Deep Learning or Machine Learning?

October 25, 2020 DeepwizAI

There is no such thing as "Machine Learning vs Deep Learning" since DL is a subset of ML. Rather, this article distinguishes between the set of algorithms that are known traditionally to be a part of ML and those that are known to be a part of DL.

What led us to Deep Learning (DL)? Why can’t Machine Learning (ML) work well on text, image, and speech data? Why are Neural Networks so powerful? What is this generalizability that people talk about?

In this blog we will answer all these questions. At the end of the blog, the reader will have a complete understanding of the major areas where Machine Learning fails and how Deep Learning overcomes them.

The Universal Approximation Theorem

Before we talk about the famous Universal Approximation Theorem (UAT), we would like to point out that for all machine learning tasks that you have ever done or will do, there exists a hypothetical function that maps input data to the output with 0 error. This is important to understand since the entire field of AI is dedicated to approximating such functions.

UAT^[1] tells us that a neural network with a single hidden layer having an arbitrary number of neurons can potentially approximate any possible function^[2]. This also goes depth-wise in the sense that any neural network with an arbitrary depth having a fixed number of neurons in each layer can also approximate any possible function (see Fig. 1). Sadly, UAT does not apply for machine learning.

Based on these facts, one might quickly arrive to the conclusion that throwing a neural network at any task should give us the best possible results, right? Well, yes and no. For example, if you browse through the winning solutions on kaggle, they are almost always dominated by Gradient Boosting Machines (GBMs) such as XGBoost, LightGBM, Catboost, etc. So what’s going on? Well, the answer isn’t that straightforward.

Fig. 1: Neural Nets and the corresponding functions they represent. This figure gives a visual representation of the UAT.
*Taken from "A visual proof that neural nets can compute any function" [2]*

DL vs ML on Structured vs Unstructured Data

There is no doubt about the fact that DL is an absolute beast when it comes to unstructured data which includes domains such as Computer Vision (CV), Natural Language Processing (NLP), audio synthesis, etc. This is because domains such as NLP try to approximate the human language, an extremely complex and arbitrary function. Such approximations require models that can learn intricate patterns from large volumes of data.

We have always heard the phrase, "DL scales well with data unlike ML". This is because Neural Networks (NNs) are over-parameterized networks meaning that the model has more learnable parameters than there are samples in the data. But how does this help because we can also build quite large GBMs, if we really want to. Well, the reason is that NNs create their own features that are refined over-time via backpropagation. These features are optimal for the task at hand unlike the features that we engineer for ML algorithms. Thus, over-parameterized networks do not saturate with more data unlike their ML counterparts and therefore scale well with more data. But more data is not the sole reason behind DL’s performance on unstructured data. Enter inductive bias.

Inductive bias is just a set of assumptions that we use to build a model. For example, the way an RNN learns is akin to how humans read data; left to right while keeping the previous words in memory. Similarly, CNNs have been modeled based on our way of looking at images and building information. We never look at an image as a whole but in local patches. This luxury of specialized algorithms for different domains have not been made available to ML and this is why ML algorithms, in their current state, will never outperform DL on unstructured data.

Fig. 2: Winning solution for ASHRAE - Great Energy Predictor III challenge on Kaggle. The dataset was structured and DL was not utilized.

Fig. 3: Winning solution for IEEE-CIS Fraud Detection challenge on Kaggle. Again, the dataset was structured and DL was not utilized.

A totally different situtation arises when dealing with structured data (see Fig. 2 and Fig. 3). As we have already pointed out that most of the top data science solutions are built on GBMs. The reason behind this is two fold.

Firstly, structured data, unlike its unstructured counterpart, does not usually possess highly abstract patterns/relationships between variables (input features). And even if it does, clever feature engineering can help bring out those patterns. In such a scenario, boosted stumps can quickly fit the data. Though based on UAT, one might argue that if boosting can approximate the hypothetical data to target mapping, then definitely an NN can as well. This is true but herein lies another issue. A deep NN is basically a combination of number of layers, neurons, type of layers (e.g. batchnormalization, dropout, etc.) and the parameters involved for those layers. This leads to an infinite search space. Somewhere in this space exists the perfect model which could potentially outperform even a highly tuned GBM, but the time required to find such a network via tuning is simply impractical.

Secondly, remember that one of the reasons for DL's high performance on unstructured data was because of the concept of inductive bias. As it turns out, no appropriate bias exists for DL on tabular data. Though, this is changing with the introduction of networks such as Tabnet^[3]

Local Constancy or Smoothness Prior

In order to generalize well, machine learning algorithms need to be guided by prior beliefs about what kind of function they should learn. For example, the k-nearest neighbor model assumes that two points having the same k-nearest neighbors will have the same output. This assumption or "prior belief" of the underlying probability distribution can deviate the model from learning the true underlying distribution.

Choosing a machine learning model for any problem results in a biased view of the function that needs to be approximated. This is why we are told to try out multiple algorithms and see which one fits best since each model has its own set of assumptions and therefore its own way of approximating the true function.

The most widely used implicit prior is the smoothness prior or local constancy prior. This prior states that the function we learn should not change much within a small region. Thus, most of the ML algorithms assume that the learned function, $f(x)$, does not vary much around its neighborhood, $f(x + \epsilon)$, where $\epsilon$ is a small change in $x$. This basic assumption can turn out to be very wrong in some cases.

The chessboard example

Consider the popular chessboard example wherein you are provided with samples on a chessboard and asked to predict their color. For ML, the alternating black and white pattern is hard to capture, unless provided explicitly as a feature. So if a test sample falls outside the region of the train set, it may be classified just based on the nearest point which we know is very likely to have the opposite color. Hence, the smoothness assumption fails. Let’s prove this with some code.

See this content in the original post

Code to generate chessboard data with train and test splits

The above code results in a training dataset of 400 samples and a test dataset of 100 samples, each having two features specifying the X and Y co-ordinates. The datasets are balanced. Multiple ML algorithms such as logistic regression, decision tree, random forest and XGBoost were applied but all of them yielded 50% accuracy on the test set. The reason behind this is evident from Fig. 4 which shows that every algorithm classified all test samples as either positive (blue area) or negative (yellow area) meaning that none of algorithms were able to perform better than a random guess.

See this content in the original post

Code to train and test multiple ML models

Fig. 4: (a) The generated chessboard dataset. The bottom left samples are for training and top right samples are for testing. White samples are positive class and black samples are negative class. (b) Visualization of linear regression’s decision boundary. Blue area indicates positive class and yellow area indicated negative class. (c) Visualization of decision tree’s decision boundary. (d) Visualization of random forest’s decision boundary. (e) Visualization of XGBoost’s decision boundary.

We then deployed a simple neural network for the same task and achieved 80% accuracy on the test set and 59.9% on the training set. The neural network has a single hidden layer with 40 neurons along with ReLU activation and a two neuron output with softmax applied. From Fig. 5 the reason behind the improved accuracy on test set is evident as the neural network's decision boundary shows that the NN does not classify all test samples into the same class. Thus, proving our point that DL is better at generalizing.

See this content in the original post

Code to create a neural network for the chessboard problem

Fig. 5: Visualization of the decision boundary learned by a neural network for the chessboard problem

Adding to this, to achieve true generalization, learners need to generalize non-locally, i.e., understand the behavior of data points present in regions outside the training dataset. Deep architectures show more promise in this regard. For example, Bengio and Monperrus^[4], demonstrate why the assumption of points lying outside known regions may not be the same(i.e. disagreement with local constancy prior belief). They introduce the notion of calculating a factor to define “how far” new unknown points are outside known regions based upon which decisions on whether to associate them with the existing nearby regions or to use tangent planes to understand their behavior, are taken.

Manifold Learning

An important concept underlying many ideas in machine learning is that of a manifold. A manifold is an abstract mathematical space whose global structure might be complex (higher dimensional) but whose local structure is simple (lower dimensional). For example, we live on a manifold as the Earth is a sphere (global structure) but for us, it is a plane (local structure). Similarly, in ML, it has been seen that samples can be represented well with dimensions lesser than $n$. For example, in a blog by Jake VanderPlas^[5], a dummy set of points are generated to look like the word “HELLO” in 3D space. Using pairwise distance of all points, the author was able to represent the same space in two dimensions while preserving interesting properties like translation and rotation. This is the goal of manifold learning.

Fig 6: (a) A 3D surface. (b) The same surface in 3D space but with points. (c) and (d) represent the same 3D surface but in 2D space. Taken from the paper “Manifold learning for visualizing and analyzing high-dimensional data” [6]

Extracting such manifolds is challenging. Simple ML algorithms can still estimate manifolds properly only if the points exhibit linear relationships among each other (see Fig. 6). When such relationships become non-linear, which is a common scenario for real-world datasets, these algorithms fail. Deep Learning, in this regard, is better in estimating such manifolds and learning meaningful patterns from them^[7]. They automate the feature selection process to a huge extent and with enough examples are able to learn better representations of the given data.

References

Wikipedia. "Universal approximation theorem". Wikipedia. (2020)
Michael Nielsen. "A visual proof that neural nets can compute any function". neuralnetworksanddeeplearning.com. (2019)
Arik, Sercan O., and Tomas Pfister. "Tabnet: Attentive interpretable tabular learning." arXiv preprint arXiv:1908.07442 (2019).
Bengio, Yoshua, and Martin Monperrus. "Non-local manifold tangent learning." In Advances in Neural Information Processing Systems, pp. 129-136 (2005)
Jake VanderPlas "In-Depth: Manifold Learning". jakevdp.github.io (N/A)
Zhang, Junping, Hua Huang, and Jue Wang. "Manifold learning for visualizing and analyzing high-dimensional data." IEEE Intelligent Systems 4, pp. 54-61 (2010).
Brahma, Pratik Prabhanjan, Dapeng Wu, and Yiyuan She. "Why deep learning works: A manifold disentanglement perspective." IEEE transactions on neural networks and learning systems 27, no. 10 pp. 1997-2008 (2015).