This week we'll talk about adjusting learning rate and yields of different optimizers and loss functions.

For those of you who aren't familiar of the dataset we use;
We are trying to distinguish different keys from SerpApi's Google Local Results Scraper API Results

## Preliminary Results for Different Sets of Tweaks Graph (explained in detail below)

These results are not from the same training processes with the examples below. Numbers may vary but the average will be the same.

Let me show you the way to load the model we created and how to call it for real work:

Here are real examples of the model's predictions:

[1] pry(PredictValue)> load "Taco Bell"
=> "title"
[2] pry(PredictValue)> load "Quality coffee, chess & late hours are all draws at this relaxed, mural-decorated cafe."
=> "description"
=> "type"
=> "reviews"
[6] pry(PredictValue)> load "Closed ⋅ Opens 8AM Fri"
=> "hours"
=> "hours"
[8] pry(PredictValue)> load ""
=> "price"
=> "rating"
=> "rating"

As you can see, in its current state the output is giving interesting outputs like in the examples of "SerpApi" and "+1 813-876-3354". We are aiming to correct these mistakes and implement a highly efficient model in our parsers.

Let's five deep into different learning rates and how they effect the model.

## Example: Very High Learning Rate

Optimizer: SGD
Loss: NLLLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 44 seconds (CUDA Enabled)
Learning Rate: 0.999 (Static)
Accumulative Success: %57.07

This is the problem with the very high learning rate. As you can see, the loss decreases drastically at first, and then goes up or stays same (as seen in the loss graph). This kind of learning rate is equivalent of studying on speed. You think you learn everything but you learn nothing. :)

## Example: Low Learning Rate

Optimizer: SGD
Loss: NLLLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 46 seconds (CUDA Enabled)
Learning Rate: 0.001 (Static)
Accumulative Success: %51.9967

At low learning rate, we are able to bring down the loss, however the whole process is too slow for the 30000 epoch length. Also the oscillation in the range of loss is growing telescopically, creating inefficiencies in the trained model.

## High Learning Rate

Optimizer: SGD
Loss: NLLLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 44 seconds (CUDA Enabled)
Learning Rate: 0.1 (Static)
Accumulative Success: %59.753299999999996

Learning Rate is too high. You can observe the loss function to get the impression. Roughly %60 seems to be the plateau value for this model with this optimizer and loss function, in this learning rate. So, what can we do to improve the situation? Let's decrease learning weight to a factor of 10 to observe the loss function again.

## Good(ish) Learning Rate

Optimizer: SGD
Loss: NLLLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 44 seconds (CUDA Enabled)
Learning Rate: 0.01 (Static)
Accumulative Success: %59.646699999999996

This is more of a healthy line of trend as you can see. However, oscillations are too big. There is also one more problem. The optimal learning rate might change from algorithm to algorithm. What can we do to expand the range of good learning rates without nitty-picking too much? Let's explore in the next part.

First of all, we want our loss function to have a certain shape on the graph. Normally Pytorch has learning rate schedulers that are useful for adapting learning rates when the accumulative success rates are stagnating.

However, I wasn't able to find it in the library we are using. It gives us a good opportunity to come up with our own way. Let's express the shape we want in polynomial form.

For those of you who want to have a closer look, I have used Desmos graphical calculator to come up with it. The functional command is \frac{1000000000}{x-100}+100.

This function gives us a good shape for the loss function. But how can we achieve general adjustment of learning rate that will translate itself in the loss function?

I want the reader to take a closer look. As epoch (x-axis) goes to infinity, loss decreases at a certain rate. We can observe that change with taking the derivative of the ideal loss function we declared.

This is the change we want at loss each epoch. Let's declare it in our code:

Now that we have a way to know what the change in loss should be, let's implement an algorithm that takes note of that and adjusts the learning rate:

Let's break down what we did in here;

if @all_losses.size > 1

We check for @all_losses size to be more than 1. For those of you who didn't read the previous blog post, @all_losses is an array where we collect all loss values at different epoch. The reason we start when it is at least 1 is simple, we want to be able to measure change. You need at least 2 elements to have change.

low_training = (@all_losses[-1] - @all_losses[-2] > 0) || (@all_losses[-1] - @all_losses[-2] > ideal_loss_derivative(epoch-1))

If the change of loss on second to last to last element is greater than 0, we have a problem. Our loss is increasing instead of decreasing. So we can adjust the learning higher to resolve the problem. You can out the graphs for Low Learning Rate and Very High Learning Rate to see that increasing learning rate will do what we need to give our loss graph a shape.

high_training = @all_losses[-1] - @all_losses[-2] < ideal_loss_derivative(epoch-1)

This is the way we will catch high learning rates for different epoch. Derivative of epoch-1 should give us the ideal change we should get. If the real change in loss is greater than what it should be we need to lower it to get a good learning rate.

      if low_training
@learning_rate = @learning_rate * 2
elsif high_training
@learning_rate= @learning_rate * 0.2
end

Without a precise function on how much to increase or decrease, let's apply fractional changes. The reason we are applying fractional changes is simple; to keep the learning rate above 0.

One more factor here to consider is that low training rate is more dangerious than high training rate. Simply put; low training rate will increase the loss function and will lead to deficiencies in the model whereas high training rate at different levels at small scale will result in deficiency in optimization of the function. So, the aim is more important than the optimal state of the function.

Let's implement adjust_lr at where we expand @all_losses array:

      if epoch % @plot_every == 0
@all_losses.append(@current_loss / @plot_every)
success_rates.append(success_rate)
@current_loss = 0
end

Of course, this will still yield bad results if the learning rate is too high or too low. But this function will change the definition of too high and too low and broaden our chances of getting a good learning rate, and provide smoother oscillation in long-range training.

## Trying Different Sets of Tweaks

Optimizer: SGD
Loss: CrossEntropyLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 49 seconds (CUDA Enabled)
Accumulative Success: %59.1733

Optimizer: SGD
Loss: NLLLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 49 seconds (CUDA Enabled)
Accumulative Success: %59.1500

Loss: CrossEntropyLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 52 seconds (CUDA Enabled)
Accumulative Success: %28.4367

Loss: CrossEntropyLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 50 seconds (CUDA Enabled)
Accumulative Success: %18.07

Loss: CrossEntropyLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 50 seconds (CUDA Enabled)
Accumulative Success: %59.2433

Loss: NLLLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 48 seconds (CUDA Enabled)
Accumulative Success: %59.56

Loss: CrossEntropyLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 50 seconds (CUDA Enabled)
Accumulative Success: %60.4333

Loss: CrossEntropyLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 50 seconds (CUDA Enabled)
Accumulative Success: %60.9199

Loss: NLLLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 51 seconds (CUDA Enabled)
Accumulative Success: %60.7733

Loss: CrossEntropyLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 54 seconds (CUDA Enabled)
Accumulative Success: %60.2733

Loss: NLLLoss
Hidden Layers: 128 (Only One)
Epochs: 30000
Runtime: 55 seconds (CUDA Enabled)
Accumulative Success: %60.19