Why does regularizing the bias lead to underfitting in neural networks?
The importance of bias
As a child, you might remember writing the equation of a straight line as $y = mx + c$. Now that you are all grown up, you probably write it as $y = w_0 + w_1x$. Both $m$ and $w_1$ denote the slope of a straight line while $c$ and $w_0$ denote its intercept/bias. The value of $c$ or $w_0$ determines the distance of the line from the origin. Fig. 1 provides a quick refresher.
Fig. 1: A visual representation of the straight line equation
What happens when $w_0=0$ such that $y = w_1x$?
While fitting a straight line to a dataset (a process known as linear regression), if the bias term is set to $0$, i.e., $w_0=0$, the line of best fit will end up always passing through the origin. This is a bad solution as it will lead to large residuals (underfitting) since the straight line is unable to adjust itself in terms of the distance from the origin. Thus, it is advisable to learn the bias term just like we learn the weights. If the dataset demands the line to pass through the origin, the model will automatically learn to set $w_0$ to $0$. Fig. 2 demonstrates this effect.
Fig. 2 (a): Linear Regression when bias is turned off
Fig. 2 (b): Linear Regression when bias is turned on
Then what role does the bias play in Neural Networks (NN)?
The simple generalised formulation of the output of a single neuron in a neural network is written as $z = \sum_i w_ix_i + b$, where $w_i$ are the weights associated with each input $x_i$ and $b$ is the bias at that layer. For the case where only a single input exists, i.e., $i=1$, the formula becomes $z = w_1x_1 + b$ which is the exact same formula as shown previously. Clearly, each neuron in a NN tries to fit a straight line. Granted that the output, $z$, then passes through a non-linear activation function which converts the straight line into a non-linear one, the fact remains that there exists an equivalance between linear regression and a single neuron of a neural network as they have the same underlying principles. This equivalance has been established in order for the readers to keep in mind that from now on in this article the statements pertaining to linear regression are also applicable for neural networks.
Thus, bias is crucial for achieving good performance for any line fitting algorithm.
Regularization and Bias
A natural question to ask here is that if the bias is just another parameter that needs to be learned then why isn't it a part of regularization like all other parameters/weights?
The Deep Learning book by Ian Goodfellow states that regularizing the bias term leads to significant underfitting. Why is this the case?
Let's take a step back and think about how overfitting is prevented using regularization. In the case of linear regression, if a regularization term (say $L2$) is added to the loss function then the weights with large values are penalized. This prevents the model from overfitting since small changes to features corresponding to large weights leads to substantial change in the output of a model thus increasing the model's variance and causing overfitting. Therefore, weight-based regularization is an effective means of controlling overfitting. Let's quickly have a look at the equations before proceeding further. $$h_w(x) = w_0 + w_1x_1 + w_2x_2 + ... \; + w_nx_n$$ $$J(W) = \frac{1}{2m} \left[\sum^m_{i=1} \left(h_w \left(x^i \right) - y^i \right)^2 + \lambda \sum^n_{j=1} w^2_j \right]$$
Here, $h_w(x)$ is the prediction of the linear regression model. $J(W)$ is the loss function and $\lambda$ is the regularization hyper-parameter. Upon closer inspection, we can see that the regularization term starts from $1$ rather than $0$ meaning that the bias is not regularized.
For argument's sake, let us assume that $w_0$ (bias) is a part of the regularization term. In order to minimize $J(W)$, the model will reduce the value of all parameters/weights including $w_0$. This will cause the line of best fit to move towards the origin and end up in a similar situation as fixing the bias to $0$ thus leading to underfitting. Note, there is no constraint on the values taken by $w_0$. Therefore, after regularization, $w_0$ can take either negative or positive values but it is assured that these values will be closer to $0$.
The same reason is applicable for not regularizing bias in neural networks.
Experiment
Now that we have a possible hypothesis for explaining why regularizing bias leads to undefitting, let us prove it via code.
For this experiment, we are going to generate synthetic data. Code Block 1 contains all the code required to generate the synthetic data as well other parts of the experiment. We are going to apply $L2$ regularized linear regression with single input variable.
In the linear regression equation, bias is always multiplied by $1$ while rest of the learnable parameters are multiplied with their corresponding input. We create a new feature with all values as 1 and set fit_intercept to False in sklearn's implementation of ridge regression. This fixes the model's bias to zero while the weight corresponding to the new column containing only ones acts as the new bias which can take part in regularization. Code Block 1 demonstrates this idea.
Code Block 1: Simple implementation of L2 regularized linear regression, on synthetic data, once with bias outside the regularization term and once with bias inside the regularization term
It is evident from Fig. 3 that bias regularization leads to substantial underfitting of the model. This agrees with our idea that the model is trying to bring the weight associated with the bias term closer to the origin in order to reduce the overall loss function and hence results in such a solution. The same reasoning can be extended to Neural Networks since they also follow similar basic underlying principles in terms of the bias.
Fig. 3: A plot demonstrating that regularizing the bias shifts the line-of-best-fit towards the origin leading to substantial underfitting
And so we have successfully demystyfied the role bias plays in both linear regression and NNs and why regularizing them is not such a good idea.