Why does regularizing the bias lead to underfitting in neural networks?
The importance of bias
As a child, you might remember writing the equation of a straight line as $y = mx + c$. Now that you are all grown up, you probably write it as $y = w_0 + w_1x$. Both $m$ and $w_1$ denote the slope of a straight line while $c$ and $w_0$ denote its intercept/bias. The value of $c$ or $w_0$ determines the distance of the line from the origin. Fig. 1 provides a quick refresher.
Fig. 1: A visual representation of the straight line equation
Fig. 2 (a): Linear Regression when bias is turned off
Fig. 2 (b): Linear Regression when bias is turned on
Then what role does the bias play in Neural Networks (NN)?
The simple generalised formulation of the output of a single neuron in a neural network is written as $z = \sum_i w_ix_i + b$, where $w_i$ are the weights associated with each input $x_i$ and $b$ is the bias at that layer. For the case where only a single input exists, i.e., $i=1$, the formula becomes $z = w_1x_1 + b$ which is the exact same formula as shown previously. Clearly, each neuron in a NN tries to fit a straight line. Granted that the output, $z$, then passes through a non-linear activation function which converts the straight line into a non-linear one, the fact remains that there exists an equivalance between linear regression and a single neuron of a neural network as they have the same underlying principles. This equivalance has been established in order for the readers to keep in mind that from now on in this article the statements pertaining to linear regression are also applicable for neural networks.
Thus, bias is crucial for achieving good performance for any line fitting algorithm.
Regularization and Bias
A natural question to ask here is that if the bias is just another parameter that needs to be learned then why isn't it a part of regularization like all other parameters/weights?
Code Block 1: Simple implementation of L2 regularized linear regression, on synthetic data, once with bias outside the regularization term and once with bias inside the regularization term
It is evident from Fig. 3 that bias regularization leads to substantial underfitting of the model. This agrees with our idea that the model is trying to bring the weight associated with the bias term closer to the origin in order to reduce the overall loss function and hence results in such a solution. The same reasoning can be extended to Neural Networks since they also follow similar basic underlying principles in terms of the bias.
Fig. 3: A plot demonstrating that regularizing the bias shifts the line-of-best-fit towards the origin leading to substantial underfitting