Introduction

We heard that you want to make it big in the AI industry? Well, what better way is there but to have an algorithm named after you which you, yourself developed from scratch? But developing a new Machine Learning (ML) algorithm is not trivial, we hear you say? Yes, ofcourse. Your current best option is to read research papers but lets be honest, most ML papers are written by researcher for fellow researchers, only. But do not fret! In this article we will list down a methodical way of developing a brand new ML algorithm. And we are not just going to stop at giving you an intuition. We will develop a novel ensembling technique from scratch, right here, that you can go out and start using in your latest kaggle competition!

The following is a brief overview of the step-by-step process involved in developing a new ML algorithm:

Devise a Problem: According to the free lunch theorem^[1], there exists no single algorithm that can solve all ML problems. This basically translates into the fact that every single problem in ML requires its own specialized solution and since no method is perfect, therefore there is always room for improvement. In short problems are everywhere. You just need to look.
Literature Survey: This step requires you to read all previous solutions to your problem.
Think, Think & Think: Build upon the solutions found during the literature survey.
Code & Experiment a.k.a. The Moment of Truth: Once the idea is in place, code it out in your favourite programming language. Then, perform an exhaustive comparison of your new algorithm against previous solutions in order to evaluate the amount of improvement your method brings to the table. If it isn't much then go back to point three and start over.

Devise a Problem

“Ensembling is the process of combining two or more algorithms trained on the same task in such a way that the final method outperforms all of its individual models (algorithms).”

A nice ensemble is almost always a sure shot way to improve the accuracy of any ML pipeline. There are many ensembling techniques with weighted averaging being arguably the most famous one.

A simple method of combining the output of, say, two classification models is to multiply each of their output probabilities by $0.5$ and then sum them up. $$Y = 0.5*Y_1 + 0.5*Y_2$$ $$\text{where, }Y_1 \text{ and } Y_2 \text{ are the output} \\ \text{probabilities of two models}$$ The above equation amounts to giving equal importance (equal weight) to both models in the final output. But this is not always true since one of the models might be better at the given task than the other. In that case, more weight can be given to the better performing model. This is known as weighted averaging and its equation is given below. $$Y = p*Y_1 + (1-p)*Y_2$$ $$\text{where, }p \in [0,1]\text{ is the weight}$$

Fig. 1: The diagrammatic representation of weighted average

The question is, can we improve upon this ubiquitous method? To answer that, let's dig a little deeper. Why does weighted averaging work? If one model makes a mistake on some sample, then the other model can correct that mistake, i.e., if one is wrong, the other one fixes the error. This basically means that neither model are correct for all samples which means that the better performing model in the ensemble does not necessarily perform better on all samples. But the problem here is that we are giving the same weight to both models for every sample. We have thus, successfully identified a problem with the current weighted averaging scheme. What we want is a dynamic weighting scheme, i.e., the weights for each model should change according to the sample whereas the current weighting scheme is static, i.e., each model gets the same weight irrespective of its performance on that sample.

Literature Survey

Wait. Don't.

Do not get us wrong. Literature survey is an integral part of any novel algorithm development process but unless you are a research scholar who needs to publish a research paper, we recommend you to take your time and think of a solution by yourself. This will aid in improving your critical thinking ability.

Think, Think & Think

Our problem statement is to find a way of providing a different weight to each model involved in the ensemble for every sample based on the model's performance on that sample. In ML, we always have a train and test set for training and evaluating a model, respectively. Finding out a model's performance on a train sample is easy because the corresponding label already exists. But for the test set, this is not possible as the labels are unknown. So, it is clear that in order to achieve dynamic weights, we need to map our test samples to the closest possible train samples.

Fig. 2: A visualization of the curse of dimensionality. A dataset of 20 samples was generated from a random uniform distribution and the distance of each sample from one another was mapped resulting in this 20x20 distance matrix. As the number of dimensions increase, the data points move further apart. This can be seen above as the cells in the matrix transition from a deeper color set (indicating smaller distance) to a lighter color set (indicating larger distance). Thus, for higher dimensions, an exponential number of samples are required.

You might be tempted to use the euclidean distance to find the train set samples closest to the test set samples and call it a day. Hold on to your horses because this method is going to fail miserably due to multiple reasons. First, as a consequence of the curse of dimensionality, the sample space (a hypothetical space where all your datapoints exist) behaves in a non-euclidean way with increase in the number of features/dimensions. For example, the smallest distance between two points on a curved surface is not a straight line. Second, most distance functions do not work for categorical features out-of-the-box. Third, to find the train sample closest to a test sample would require computation of distances of the entire training set from that single test sample. Clearly, distance based technqiues are too slow and unsuitable for our needs.

From analyzing why distance based methods are not a good fit for our problem, we have derived an understanding of the properties of the mapping function that we require. Following are the desired properties:

It should not have any prior beliefs (like, euclidean distance will work all the time!) as to which dataset it is going to be fed upon and thus be able to generalize well across datasets
Should work out-of-the-box with categorical features

For the first point, wouldn't it be great if the mapping function could divide the train dataset into different clusters/sections and assign a test sample to one of these clusters based on certain properties? This would make our final solution pretty fast since we won't be comparing against all training samples. Turns out, there is an ML algorithm which perfectly fits this line of thought. It uses the target variable (labels) to divide the training data into sections/clusters and works for categorical features without much hassle. Can you guess the name of this algorithm?

If you said Decision Trees, well, sir, you have our respect!

The Algorithm

Fig.3: A visualization of the decision tree classifier on the Iris dataset. Clearly the model is creating partitions in sample space

If you remember your ML basics, then you would know that a decision tree learns by splitting a feature on some threshold at each layer as shown in Fig. 3. Every time a feature is split, the sample space also splits, creating clusters of data points. Once the decision tree is finished learning, each of its leaf node can be thought of as an ID to a particular cluster. When a test sample is passed through this model, it reaches a leaf node which points to the cluster containing the most similar train samples. Thus, a decision tree is the perfect mapping function for our case.

Therefore, our new dynamic weighted averaging method will contain the following steps:

Learn a shallow decision tree with small number of leaf nodes on given train set. Learning a deep tree with large number of leaf nodes would result in homogenous partitions which would impede learning of downstream models.
Pass the training set back through the model and create a feature that contains the ID number of the leaf node that each sample ends up in. This becomes our mapping.
Now group each sample based on their leaf node ID to create, let's say, $B$ separate folds.
Perform K-Fold cross-validation with your favourite set of models on each of the $B$ training sets in order to determine robust weights for each model on every partition.
For every cluster/partition we now have different weights for each model. Thus, for some partition, one model might perform better and receive the largest weight while for some other partition another model might perform the best and receive the largest weight.
Now, train the ensemble again but this time on the whole dataset. The previous step was performed only to find the correct weights of each model within an ensemble for a particular partition.
Once a test sample is passed through the learned decision tree classifier, it is mapped to a particular partition. The individual model weights corresponding to this partition is used in the final ensemble learned on the whole data, to get a dynamically weighted output.

Experiment, Code & Compare

Experimental Settings

We downloaded two multi-class classification datasets from AnalyticsVidhya, namely the Janathack: Customer Segmentation Hackathon dataset and the Janatahack: ML for Banking Hackathon dataset in order to evaluate our dynamic weighted averaging (dynamic blending) ensemble method on real world datasets. We used LightGBM and XGBoost on both datasets. 5-Fold Cross Validation (CV) has been used in order to robustly evaluate all methods. To find the correct weights, we have used an inner 5-fold CV.

Code

We have provided the complete notebooks for both datasets, below. The results are completely reproducible. Each cell within the notebook has been richly annotated so that you can easily follow along.

As can be seen in both notebooks, the weights assigned to both lightgbm and xgboost are different in each partition. In certain sections, xgboost performs better than lightgbm and vice versa. These differing weights give rise to dynamic blending.

For those who would like to take our code for a spin, please check out the following links:

Code 1: Notebook for dynamic weighted averaging (dynamic blending) on AnalyticVidhya’s Janatahack: Customer Segmentation Hackathon dataset

Code 2: Notebook for dynamic weighted averaging (dynamic blending) on AnalyticVidhya’s Janatahack: ML for Banking Hackathon dataset

Comparison

As can be seen from Code 1 and Code 2, we have compared dynamic blending against individual models as well as static blending. Table 1 provides the results explicitly. Clearly, our method is an improvement over the default blending technique.

Dataset	XGBoost	LighGBM	Static Blending	Dynamic Blending
Janatahack: Customer Segmentation	53.6191	54.0898	54.1147	54.2014
Janatahack: ML for Banking	53.1522	54.3311	54.4145	54.4428

Table 1: The first row scores are in terms of accuracy while the second row scores are in terms of F1-micro. Bold values indicate the best performance

Discussion

We finally arrive at the end of this roller coaster ride! You are now armed with the knowledge of a brand new never-before-seen ensembling algorithm that also performs better than its static, old counterpart. Based on Table 1, you might be tempted to go on and apply this ensembling technique in all your competitions but hold your horses. Though dynamic blending does perform better than static blending, it might lead to overfitting as it increases the variance of the overall system. Thus our new technique is great but needs to be used with caution and is definitely not a replacement for good old static blending.

There are still many avenues to improve this algorithm, for example using a random forest instead of a decision tree as the mapper by using voting to determine the correct leaf id for a given sample. Similarly other improvements can also be made. If you have any ideas then please comment down below!

Now that you know how to go about building new algorithms, we wish you the best of luck on your journey to becoming a famous data scientist with his/her very own machine learning algorithm!!

References

[1] Wolpert, David H., and William G. Macready. "No free lunch theorems for optimization." IEEE transactions on evolutionary computation 1, no. 1 (1997): 67-82.

Cite us

For attribution in academic contexts or open source libraries, please cite this work as:

@article{hassanbasu2021blending,
title={How to invent your own new ML algorithm: A novel ensembling technique},
author={Hassan, Atif and Basu, Sayantan},
year={2021},
howpublished = {\url{https://www.deepwizai.com/projects/how-to-invent-your-own-new-ml-algorithm-a-novel-ensembling-technique}}
}

How to invent your own new ML algorithm: A novel ensembling technique