Optimizers

<aside> 💡 Gradient Decent

Untitled

</aside>

<aside> 💡 Momentum maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems) **we have one different parameters added to the normal gradient decent. $V^{t}$ The first is for

$V^t = \betaV^{t-1} + (1-\beta) dw^t$

And we simply correct it with

$V^t_{corr} = V^t/\sqrt{1-\beta_1}$

and we add it to the main equation to get the gradient decent with momentum: $W = W^{t-1} - \alpha * V^t_{corr}$**

</aside>

<aside> 💡 RMS-Prop(Root Mean Square Propagation)

also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy). **We again add another variable to take into account the previous steps $S^{t}$

$S^t = \beta_2 * S^{t-1} + (1-\beta_2)* dw^2$ And here we add this to the gradient decent by dividing the added value by the square root of this value after correction $S^t_{corr} = S^t/\sqrt{1-\beta_2}$

$W^t = W^{t-1}-\alpha*dw/\sqrt{S^t_{corr}}$**

</aside>

<aside> 💡 Adam optimizer(adaptive moment estimation- moment 1 is the first beta, and the moment 2 is the second beta) the parameters beta1 and beta2 control the decay rates of these moving averages We put the two previous ones together:

$W = W^{t-1}- \alpha* V^t_{corr} / \sqrt{S^t_{corr}+\epsilon}$

</aside>

<aside> 💡 Warmup in learning rate schedule resource

If your data set is highly differentiated, you can suffer from a sort of "early over-fitting". If your shuffled data happens to include a cluster of related, strongly-featured observations, your model's initial training can skew badly toward those features -- or worse, toward incidental features that aren't truly related to the topic at all.

Warm-up is a way to reduce the primary effect of the early training examples. Without it, you may need to run a few extra epochs to get the convergence desired, as the model un-trains those early superstitions.

Many models afford this as a command-line option. The learning rate is increased linearly over the warm-up period. If the target learning rate is p and the warm-up period is n, then the first batch iteration uses 1p/n for its learning rate; the second uses 2p/n, and so on: iteration i uses i*p/n, until we hit the nominal rate at iteration n.

This means that the first iteration gets only 1/n of the primacy effect. This does a reasonable job of balancing that influence.

Note that the ramp-up is commonly on the order of one epoch -- but is occasionally longer for particularly skewed data, or shorter for more homogeneous distributions. You may want to adjust, depending on how functionally extreme your batches can become when the shuffling algorithm is applied to the training set.

</aside>

Parametric and non-parametric algorithms

The ones that have a previous assumption about the distribution of our data and try to find the best parameters that would fit it, are parametric. Examples: A simple NN, linear/logistic regression, Naive Bayes (e.g., when each word in a document has a probability of being seen in each class)

But the ones that just don't assume any probability or parameter like KNN or SVM, just try to fit the best area related to each class.

Classification and its challenges

Imbalanced classes: when trying to do classification (especially binary classification like outlier detection or fraud detection), using only accuracy can be misleading as we might get high values for only the dominant class and not care enough for the minor group.

What do we do then? Use Recall and Precision for each group

Types of algorithms suitable for classification in ML: