DeepQnA

Get answers to all your Machine Learning queries

Swipe the images left or right to view more

Why do we care about an activation function's output being zero-centred in Deep Learning?

Firstly, what is a zero centred activation function? An activation function containing both positive and negative values, such as the tanh, is zero centred.
Why is it important? Let $y = f(xw)$ where $y$ is the output, $f$ is a non-zero centred activation function (such as sigmoid), $x$ is the input and $w$ are the weights. Let us also assume that $x>0$ for all values. In such a case $\frac{dy}{dw} = f’(xw)x$. As you can see, the gradient can be either positive or negative, only, for every $w$ (positive if model is in right direction. Negative if the model is in the wrong direction). Thus, the model ends up moving in a zig-zag manner and ends up converging to the optimal solution at a very slow rate.
But then why does ReLU work so well? This is because whether it be a zero-centred activation such as tanh or non-zero centred activation such as sigmoid, all of them have the issue of gradient saturation leading to little to no learning in a model. At the same time, the non-zero centred problem can be taken care of if the input $x$ has negative values or if the weights have been initialized correctly.
But then why are more activation functions being proposed in deep learning?
That's because ReLU has its own issues of "dead neurons" while converging slowly (slow convergence is typical of non-zero centred activation). Hence, newer activations like Mish and ELU have achieved state-of-the-art by taking care of such problems.
How many samples does the Random Forest classifier consider when bootstrapping and why?

Unlike feature subsampling wherein each base learner usually receives the square root of the original number of features, each base learner in a random forest receives $N$ samples selected with replacement from the original dataset of $N$ samples. That's right, the number of samples is the same as the original dataset!!
When we pick $N$ samples with replacement from an $N$ sample dataset, each base learner ends up with $\frac{2}{3}$ unique samples while $\frac{1}{3}$ of the total samples is duplicate. Okay, but where is this $\frac{2}{3}$ and $\frac{1}{3}$ coming from?
The probability of a single sample getting selected with replacement from $N$ samples is $\frac{1}{N}$. For two samples it is $(\frac{1}{N})^2$ and hence, for $N$ samples it is $(\frac{1}{N})^N$. Easy so far?
The probability of a sample not getting selected is $(1- \frac{1}{N})$. Hence, the probability of N samples not getting selected is $(1-\frac{1}{N})^N$.
Turns out that a good approximation of $(1-\frac{1}{N})^N$ is $\frac{1}{e}$ while a good approximation of $\frac{1}{e}$ is $\frac{1}{3}$. Hence, the number of duplicates (probability of a unique sample not getting selected) is $\frac{1}{3}$ because of which the number of unique samples is $(1-\frac{1}{3}) = \frac{2}{3}$.
Should you use learning rate schedulers with adaptive learning rate based optimisers such as ADAM?

Learning rate schedulers take the idea of a constant learning rate (lr) and make it dynamic meaning that the lr changes (grows or decays) with respect to the number of epochs or some other validation measure.
Adaptive learning rate based optimizers such as ADAM provide each parameter in a neural network with its individual lr which decays over time. This means that ADAM already has a dynamic learning rate and the authors who proposed the algorithm have stated in their paper that ADAM is quite robust to change in the initial learning rate hyperparameter as decided by the user.
Long story short, theoretically speaking, you should NOT use learning rate schedulers with optimizers such as ADAM, RMSProp and others. Since both do pretty much the same thing, using them together will end up in unexpected learning rate changes.
Should BatchNormalization be applied before or after the activation function?

The output of a Neural Network (NN) layer can be written as: $$y = f(Wx + b) $$
where, $f(..)$ is the activation function, $W$ is the weight matrix, $x$ is the output from the previous layer and $b$ is the bias.
Let's denote $BN(..)$ as the batchnorm function. Then, we have 3 options where we can apply this function.
Option 1: $y = BN(f(Wx + b))$
Option 2: $y = f(BN(Wx + b))$
Option 3: $y = f(W(BN(x)) + b)$
Which option is correct? Well, before that, first remember what BatchNormalization does. It standardizes a layer's input (subtract mean and divide by standard deviation). Hence, you require an input that is symmetrical and non-sparse or in other words, a gaussian distribution. Now, which option do you think is better suited for this purpose?
In Option 1, if $f(..)$ is ReLU, then the output becomes sparse and the probability of getting a non-sparse, gaussian-like distribution becomes low.
In Option 3, you have the same issue because $x$ is simply $y$ (output) of the previous layer, hence making it equivalent to option1.
In Option 2, after multiplying the $x$ with a weight matrix, you are more likely to get something close to a gaussian distribution when compared to the other options (non-sparse and symmetric). Thus, standardizing option 2 will yield more stable results.
And that is why, folks, BatchNormalization should always be applied before the activation function!!
BatchNormalization does not work for the reasons you think it does

A. In the original paper titled, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", the authors state that the distribution of the input to each subsequent layer, in a neural network (NN), changes when the previous layer gets updated.
In a sense, this means every layer is receiving completely new data in each iteration. This is known as internal covariate shift and leads to barely any learning in deeper neural networks.
In follow-up work, the paper titled, "How Does Batch Normalization Help Optimization?", the authors challenge this belief and actually prove that BatchNorm's effectiveness is not because it solves the “internal covariate shift” but rather that it makes the loss landscape of NNs smoother.
Well, what does that mean? You might have heard that NNs pass through a lot of local minima during optimization. This means that there are a lot of valleys in reaching an optimal solution which makes learning really hard for NNs. BatchNorm smoothens out these valleys and makes it easier for NNs to traverse through.
A more technical way to think about "smooth loss landscapes" is if you forcibly change the weights of an NN by a small amount, the accuracy of the model will remain the same because the loss is nearly flat (smooth) and whichever direction you move, you are still at the same height. Hence, in Deep Learning (DL) smoothness is associated with generalization.

DeepTricks

The tips and tricks you need to boost your ML game

Swipe the images left or right to view more

What do you do when you have large data but low compute power?

If the dataset is huge, fine tuning any model on top of it will require a lot of time. Instead, tune your model on a reasonably sized subset of the data. Once you fix your model on this small dataset, it should also fit perfectly well on the original, larger dataset.
How to do cross-validation specifically for data science hackathons the right way?

The answer is a double K-Fold CV.
When you tune your model and perform feature engineering using K-Fold CV, you basically end up overfitting that CV by the end of the competition. A simple solution to get rid of this problem is to use two wildly different K-Fold CVs. So if you use sklearn, then try something like this:
KFold(n_splits=5, shuffle=True, random_state=0)
StratifiedKFold(n_splits=3, shuffle=True, random_state=27).
Now you have a really robust CV and unless both your CV scores increase, you will know that your current tuning or feature engineering is not good enough. Thus, if you overfit one, the other will save you!
Note: This is not applicable to time-series tasks.
Too lazy to ensemble? Well, the out-of-folds prediction is here!

Take your model and do a K-old cross-validation split. Train your model on K-1 folds and instead of testing on the validation set, directly go for predictions in your test dataset. This way you will repeat the process K times and take an average of all the predictions on the test set as your final output.
This gives you a robust solution and helps your predictions generalize.
Ever heard about nested cross-validation (CV)?

When you have relatively small training data and you want to cross-validate as well as tune your ML model, how do you do so? Well, here comes nested CV to your rescue!!
Create a k-fold CV on your data and let's call this the outer CV. Within each iteration of the outer CV, you basically have a unique train-test (TT) split. Now, take this TT split and apply a separate k-fold CV on the train set. This becomes your inner CV.
How is this helpful? You can use the inner CV to tune your model's hyper-parameters while the outer CV gives you your model's training performance on the entire data!
Note: Nested CV is really popular in research as it allows you to tune as well as test your model using cross-validation which results in robust performance estimates.

DeepQnA

Why do we care about an activation function's output being zero-centred in Deep Learning?

How many samples does the Random Forest classifier consider when bootstrapping and why?

Should you use learning rate schedulers with adaptive learning rate based optimisers such as ADAM?

Should BatchNormalization be applied before or after the activation function?

BatchNormalization does not work for the reasons you think it does

DeepTricks

What do you do when you have large data but low compute power?

How to do cross-validation specifically for data science hackathons the right way?

Too lazy to ensemble? Well, the out-of-folds prediction is here!

Ever heard about nested cross-validation (CV)?

DeepwizAI