How to perform Unsupervised Feature Selection using Supervised Algorithms
This article will provide you with a step-by-step process on how we came up with a new Machine Learning (ML) algorithm that performs unsupervised feature selection using any supervised algorithm of your choice. Our goal is to provide you with insights into the process for developing a new ML algorithm. In case you want to use the algorithm, it is available on the following GitHub repo as well as in the form of a pip install python library and works exactly like any scikit-learn model. Also, here is a Kaggle notebook trying out FRUFS on a competition, replete with simulations using synthetic data as well.
Introduction
In supervised ML tasks, feature importance is defined as a score attributed to a feature or a group of features in predicting the target variable. This score is important as it forms the very basis for feature selection. A feature (or group of features) that are able to predict the target really well receive a much larger score than a feature that is unable to do the same. Hence, feature selection boils down to selecting either the top K individually most important features$^{[1]}$ or selecting the top-most important group of features$^{[2]}$.
For an unsupervised task having no labels, the definition of feature importance is not immediately apparent.
To find a solution to the above problem, let us first understand why we perform feature selection in a supervised setting. Well, feature selection reduces overfitting$^{[3]}$, removes noisy variables, reduces space usage, etc. The underlying factor among all these reasons is that some features are enough to predict the target variable. In other words, some features are a good enough representation of all other features.
Based on the above line of reasoning, we can say that the most important feature is the one that can largely represent all other features. Similarly, the second most important feature can approximate all other features but not as well as the most important one and so on. Armed with this idea, we can now define a new feature importance score.
Feature Importance in an unsupervised setting
Let us think about linear regression for a bit. $$y^i = \alpha_1x_1^i + … + \alpha_dx_d^i = \sum_j \alpha_jx_j^i$$Where, $y^i$ is the $i$-th sample’s output, $i \in\{1, 2, … n\}$, $\alpha_j$ are the co-effcients of features $x_j$, $j \in \{1, 2, … d\}$, $d$ is the total number of features and $n$ is the number of samples.
In such a form, $\alpha_j$ is also interpreted as the importance of the corresponding feature $x_j$ since larger coefficients mean more weightage in predicting the target variable.
Now, let’s say we rewrite the above equation as, $$x_k^i = \alpha_1x_1^i + … + \alpha_dx_d^i = \sum_j \alpha_jx_j^i$$Where $k \in \{1, 2, … d\}$, $j \in \{1, 2, … d\}$ and $j \neq k$. What this equation means is that we are trying to predict some $k$-th feature using all other features. In such a scenario $\alpha_j$ would correspond to how important the $j$-th feature is in predicting the $k$-th feature. Thus, the feature corresponding to the largest $\alpha_j$ is the feature that can best represent the feature currently being predicted.
In simple regression, the same coefficients are used to predict the target variable. This is because there exists only a single variable that needs to be predicted. In our case, we need to predict $d$ different variables. Hence, we need a different set of coefficients to predict each feature. Thus, we re-write the above equation as $$x_k^i = \sum_j \alpha_j^kx_j^i$$ where, $\alpha_j^k$ is the $j$-th feature’s coefficient for predicting the $k$-th feature. Such a notation allows us to clearly differentiate between coefficients used to predict different features.
Instead of writing in scalar form, we can rewrite the previous equation in concise matrix notation. Let $X$ be a $(n \times d)$ input matrix with $n$ samples and $d$ columns. Let $W$ be a randomly initialized $(d \times d)$ coefficient matrix which we want to learn with $W_{ii}=0$ (all diagonal elements are set to $0$ so that a feature cannot be both target and input at the same time). Consider the first row of $W$. The values in this row indicate the coefficients of features $2$ to $d$ that we will learn for predicting feature 1. Now consider the first column of $W$. The values in this column indicate the coefficients of feature 1 in predicting all other features. If you look back at the second equation, the left-hand side is an input feature while the right-hand side is the sum of the remaining features multiplied by some coefficients. Hence, what we want to do is the following,
$$X = XW^T$$$$\Rightarrow X - XW^T = 0$$$$\therefore \min \left\lVert X - XW^T \right\rVert_F$$Here, $\lVert.\rVert_F$ indicates the Frobenius norm which is just the sum of squares of all values in a matrix.
Congratulations!! We just re-discovered an algorithm proposed by a highly cited 2015 research paper$^{[4]}$ in the domain of unsupervised feature selection!! So yes, even you can publish great papers!😉😉
Now, how do we find the overall importance of each feature? Let $W_{i1}$ represent the first column of matrix $W$. Any $i$-th value in this column corresponds to the first feature’s coefficient/importance in predicting the $i$-th feature. Thus, it makes sense to simply average out all these values to get an overall importance of the first feature. This score corresponds to how relevant feature $1$ is to this dataset. By taking the average score of each column, we end up with an importance score for each feature without the requirement of any labels!!
Refining our approach
Improving the idea proposed in the previous section is quite straightforward. What we have done until now is basic linear regression. And all of us know what comes after linear regression. It is non-linear regression! If you have used XGBoost and other related algorithms, you know that they inherently provide a feature importance score. So instead of $\alpha_j$, we will be using the importance score returned by our favourite supervised algorithm.
Experiment
We conduct an initial experiment on the MNIST dataset. Both train and test data are randomly sub-sampled in a stratified manner resulting in new train and test datasets having $5000$ samples each. MNIST consists of $28 \times 28$ single-channel images. For the purpose of our experiment, we vectorized the images resulting in samples of $784$ dimensions. K-Means is used as our unsupervised clustering algorithm with K$=10$. We use the NMI metric$^{[5]}$ which, in a grossly inaccurate manner, can be stated as the accuracy metric but for clustering techniques.
As can be seen in the notebook, our feature selection algorithm achieves maximum NMI while using only $380$ features instead of the 784. The graph at the bottom of the notebook indicates the evolution of NMI with the induction of important features into the clustering algorithm.
Results
For categorical features, one cannot simply perform regression since there is no sense of order. Hence, for such features, classification needs to be performed. Also, it would be great if we could rely on the feature importance scores to determine the number of top features that need to be chosen for an unsupervised task. Thus, in order to reduce any coding hassle on your side, we developed a python library called FRUFS (Feature Relevance based Unsupervised Feature Selection) that can be installed via pip and can be used just like any other sklearn model. All information about this library is available at the following GitHub repository.
FRUFS was evaluated on four different datasets in both unsupervised and supervised settings. The results and corresponding jupyter notebooks are available in Table 1. For the unsupervised tasks we used the NMI metric while for the supervised ones, we used accuracy and f1-score. As can be seen from Table 1, our method does indeed outperform the baseline which considers all features.
Table 1: Performance of FRUFS on multiple datasets
Analysis
To understand the efficacy of our algorithm let us visually inspect the feature rankings to make sure that the selected feature sub-set makes sense. In order to do this, we will use the MNIST example. All features selected by our algorithm will be given the corresponding feature importance scores and the remaining will be set to $0$. This will produce a mask where the brighter parts will indicate important pixels. In order to understand whether this mask is correct, we will generate a simple baseline. This baseline image will be an average of all images in the MNIST dataset. Here as well, the brighter pixels will denote important features.
Fig. 1(a): Feature importance mask generated by FRUFS. Here, pixels are treated as features. Brighter pixels denote important features while darker ones denote useless features.
Fig. 1(b): Feature importance mask generated from the dataset by averaging all images. Here, pixels are treated as features. Brighter pixels denote important features while black pixels denote unnecessary features.
The mask generated by our algorithm in Fig. 1(a) closely matches the pixel importance generated from the dataset in Fig. 1(b). In fact, the black pixel in the centre of Fig. 1(a) perfectly correlates to the dim patch in Fig. 1(b). Thus, clearly, our algorithm is correctly ordering the features!
Conclusion
We know that supervised ML techniques such as gradient boosting and neural networks provide state-of-the-art results in a variety of tasks as they can discover underlying interactions between feature and target. So it was quite surprising that in our search for a research paper that proposes a modern unsupervised feature selection algorithm, we did not find any method that utilizes supervised algorithms. Hence, in this article, we developed FRUFS (Feature Relevance based Unsupervised Feature Selection) as a way to fill that void.
References
- Xu, Zhixiang, Gao Huang, Kilian Q. Weinberger, and Alice X. Zheng. "Gradient boosted feature selection." In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 522-531. 2014.
- Hassan, Atif, Jiaul H. Paik, Swanand Khare, and Syed Asif Hassan. "PPFS: Predictive Permutation Feature Selection." arXiv preprint arXiv:2110.10713 (2021).
- Ng, Andrew. "CS229 Lecture notes." CS229 Lecture notes 1, no. 1 (2000): 1-3. (Page 14)
- Zhu, Pengfei, Wangmeng Zuo, Lei Zhang, Qinghua Hu, and Simon CK Shiu. "Unsupervised feature selection by regularized self-representation." Pattern Recognition 48, no. 2 (2015): 438-446.
- NMI metric as available at Scikit-Learn
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998.
- Leo Breiman, Jerome H. Friedman, Adam Olshen, Jonathan Stone. "Classification and Regression Trees." 1984.
- Sigillito, V. G., Wing, S. P., Hutton, L. V., & Baker, K. B. (1989). Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest, 10, 262-266.
- Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996