Original Source Here
Feature Selection Methods and How to Choose Them
This article was first published on Neptune AI’s blog.
Have you ever found yourself sitting in front of the screen wondering what kind of features will help your machine learning model learn its task best? I bet you have. Data preparation tends to consume vast amounts of data scientists’ and machine learning engineers’ time and energy, and making the data ready to be fed to the learning algorithms is no small feat.
One of the crucial steps in the data preparation pipeline is feature selection. You might know the popular adage: garbage in, garbage out. What you feed your models with is at least as important as the models themselves, if not more so.
In this article, we will look at the place of feature selection among other feature-related tasks in the data preparation pipeline and discuss the multiple reasons why it is so crucial for any machine learning project’s success. Next, we will go over different approaches to feature selection and discuss some tricks and tips to improve their results. Then, we will take a glimpse behind the hood of Boruta, the state-of-the-art feature selection algorithm, to finally look at a clever way to combine different feature selection methods. Let’s dive in!
What is feature selection, and what it is not?
Let’s kick off by defining our object of interest.
What is feature selection? In a nutshell, it is the process of selecting the subset of features to be used for training a machine learning model.
This is what feature selection is, but it is equally important to understand what feature selection is not: it is neither feature extraction nor feature engineering, nor it is dimensionality reduction.
Feature extraction and feature engineering are two terms describing the same process of creating new features from the existing ones based on domain knowledge. This yields more features than were originally there, and it should be performed before feature selection. First, we can do feature extraction to come up with many potentially useful features, and then we perform feature selection in order to pick the best subset that will indeed improve the model’s performance.
Dimensionality reduction is yet another concept. It is somewhat similar to feature selection as both aim at reducing the number of features. However, they differ significantly in how they achieve this goal. While feature selection chooses a subset of original features to keep and discards others, dimensionality reduction techniques create projections of original features onto a fewer-dimensional space, thus creating a completely new set of features. Dimensionality reduction, if desired, should be run after feature selection, but in practice, it is either one or the other.
Now we know what feature selection is and how it corresponds to other feature-related data preparation tasks. But why do we even need it?
7 reasons why we need feature selection
A popular claim is that modern machine learning techniques do well without feature selection. After all, a model should be able to learn that particular features are useless and it should focus on the others, right?
Well, this reasoning makes sense to some extent. Linear models could, in theory, assign a weight of zero to useless features and tree-based models should learn quickly not to make splits on them. In practice, however, many things can go wrong with training when the inputs are irrelevant or redundant — more on these two terms later. On top of this, there are many other reasons why simply dumping all the available features into the model might not be a good idea. Let’s look at the seven most prominent ones.
1: Irrelevant and redundant features
Some features might be irrelevant to the problem at hand. This means they have no relation with the target variable and are completely unrelated to the task the model is designed to solve. Discarding irrelevant features will prevent the model from picking up on spurious correlations it might carry, thus fending off overfitting.
Redundant features are a different animal. Redundancy implies that two or more features share the same information and all but one can be safely discarded without information loss. Note that a relevant feature can also be redundant in the presence of another relevant feature. Redundant features should be dropped, as they might pose many problems during training, such as multicollinearity in linear models.
2: Curse of dimensionality
Feature selection techniques are especially indispensable in scenarios with many features but few training examples. Such cases suffer from what is known as the curse of dimensionality: in a very high-dimensional space, each training example is so far from all the other examples, that the model cannot learn any useful patterns. The solution is to decrease the dimensionality of the feature space, for instance via feature selection.
3: Training time
The more features, the more training time. The specifics of this trade-off depend on the particular learning algorithm being used, but in situations where retraining needs to happen in real-time, one might need to limit oneself to a couple of best features.
4: Deployment effort
The more features, the more complex the machine learning system becomes in production. This poses multiple risks, including but not limited to high maintenance effort, entanglement, undeclared consumers, or correction cascades.
With too many features, we lose the explainability of the model. While not always the primary modeling goal, interpreting and explaining the model’s results is often important and in some regulated domains might even constitute a legal requirement.
6: Occam’s Razor
According to this so-called law of parsimony, simpler models should be preferred over the more complex ones as long as their performance is the same. This also has to do with the machine learning engineer’s nemesis, overfitting. Less complex models are less likely to overfit the data.
7: Data-model compatibility
Finally, there is the issue of data-model compatibility. While in principle the approach should be data-first, which means collecting and preparing high-quality data, and then choosing a model which works well on this data, real life may have it the other way around.
You might be trying to reproduce a particular research paper, or your boss might have suggested using a particular model. In this model-first approach, you might be forced to select features that are compatible with the model you set out to train. For instance, many models don’t work with missing values in the data. Unless you know your imputation methods well, you might need to drop the incomplete features.
Different approaches to feature selection
All the different approaches to feature selection can be grouped into four families of methods, each coming with its pros and cons. There are unsupervised and supervised methods. The latter can be further divided into wrapper, filter, and embedded methods. Let’s discuss them one by one.
Just like unsupervised learning is the type of learning that looks for patterns in unlabeled data, similarly, unsupervised feature selection methods are such methods that do not make use of any labels. In other words, they don’t need access to the target variable of the machine learning model.
How can we claim a feature to be unimportant for the model without analyzing its relation to the model’s target, you might ask. Well, in some cases this is possible. We might want to discard the features with:
- Zero or near-zero variance. Features that are (almost) constant provide little information to learn from, and thus are irrelevant.
- Many missing values. While dropping incomplete features is not the preferred way to handle missing data, it is often a good start, and if too many entries are missing, it might be the only sensible thing to do, since such features are likely irrelevant.
- High multicollinearity; multicollinearity means a strong correlation between different features, which might signal redundancy issues.
Wrapper methods refer to a family of supervised feature selection methods which uses a model to score different subsets of features to finally select the best one. Each new subset is used to train a model whose performance is then evaluated on a hold-out set. The features subset which yields the best model performance is selected.
A major advantage of wrapper methods is the fact that they tend to provide the best-performing feature set for the particular chosen type of model.
At the same time, however, it is a limitation. Wrapper methods are likely to overfit to the model type and the feature subsets they produce might not generalize should one want to try them with a different model.
Another significant disadvantage of wrapper methods is their large computational needs. They require training a large number of models which might require time and computing power.
Popular wrapper methods include:
- Backward selection, in which we start with a full model comprising all available features. In subsequent iterations, we remove one feature at a time, always the one that yields the largest gain in a model performance metric, until we reach the desired number of features.
- Forward selection, which works in the opposite direction: we start from a null model with zero features, and add them greedily one at a time to maximize the model’s performance.
- Recursive Feature Elimination, or RFE, which is similar in spirit to backward selection. It also starts with a full model and iteratively eliminates the features one by one. The difference is in the way the features to discard are chosen. Instead of relying on a model performance metric from a hold-out set, RFE makes its decision based on feature importance extracted from the model. This could be feature weights in linear models, impurity decrease in tree-based models, or permutation importance (which is applicable to any model type).
Another member of the supervised family is filter methods. They can be thought of as a simpler and faster to compute alternative for wrappers. In order to evaluate the usefulness of each feature, they simply analyze its statistical relation with the model’s target, using measures such as correlation or mutual information as a proxy for the model performance metric.
Not only are filter methods faster than wrappers, but they are also more general since they are model-agnostic; they won’t overfit to any particular algorithm. They are also pretty easy to interpret: a feature is discarded if it has no statistical relationship to the target.
On the other hand, however, filter methods have one major drawback. They look at each feature in isolation, evaluating its relation to the target. This makes them prone to discarding useful features that are weak predictors of the target on their own, but add a lot of value to the model when combined with other features.
The final approach to feature selection we will discuss is to embed it into the learning algorithm itself. The idea is to combine the best of both worlds: speed of the filters, while getting the best subset for the particular model just like from a wrapper.
The flagship example is the LASSO regression. It is basically just regularized linear regression, in which feature weights are shrunk towards zero in the loss function. As a result, many features end up with weights of zero, meaning they are discarded from the model, while the rest with non-zero weights are included.
The problem with embedded methods is that there are not that many algorithms out there with feature selection built-in. Another example next to LASSO comes from computer vision: auto-encoders with a bottleneck layer force the network to disregard some of the least useful features of the image and focus on the most important ones. Other than that, there aren’t many examples.
Filter methods: tricks & tips
As we have seen, wrapper methods are slow, computationally heavy, and model-specific, and there are not many embedded methods. As a result, filters are often the go-to family of feature selection methods.
At the same time, they require the most expertise and attention to detail. While embedded methods work out of the box and wrappers are fairly simple to implement (especially when one just calls scikit-learn functions), filters ask for a pinch of statistical sophistication. Let us now turn our attention to filter methods and discuss them in more detail.
Brush up on your statistics
Filter methods need to evaluate the statistical relationship between each feature and the target. As simple as it sounds, there’s more to it than meets the eye. There are many statistical methods to measure the relationship between two variables. To know which one to choose in a particular case, we need to think back to our first STATS101 class and brush up on data measurement levels.
Data measurement levels
In a nutshell, a variable’s measurement level describes the true meaning of the data and the types of mathematical operations that make sense for these data. There are four measurement levels: nominal, ordinal, interval, and ratio.
Nominal features, such as color (“red”, “green” or “blue”) have no ordering between the values; they simply group observations based on them.
Ordinal features, such as education level (“primary”, “secondary”, “tertiary”) denote order, but not the differences between particular levels (we cannot say that the difference between “primary” and “secondary” is the same as the one between “secondary” and “tertiary”).
Interval features, such as temperature in degrees Celsius, keep the intervals equal (the difference between 25 and 20 degrees is the same as between 30 and 25).
Finally, ratio features, such as price in USD, are characterized by a meaningful zero, which allows us to calculate ratios between two data points: we can say that $6 is twice as much as $2.
In order to choose the right statistical tool to measure the relationship between two variables, we need to think about their measurement levels.
Measuring correlations for various data types
When the two variables we compare, i.e. one of the features and the target, are both either interval or ratio, we are allowed to use the most popular correlation measure out there: the Pearson correlation, also known as Pearson’s r.
This is great, but Pearson correlation comes with two drawbacks: it assumes both variables are normally distributed, and it only measures the linear correlation between them. When the correlation is non-linear, Pearson’s r won’t detect it, even if it’s really strong.
You might have heard about the Datasaurus dataset compiled by Alberto Cairo. It consists of 13 pairs of variables, each with the same very weak Pearson correlation of -0.06. As it quickly becomes obvious once we plot them, the pairs are actually correlated pretty strongly, albeit in a non-linear way.
When non-linear relations are to be expected, one of the alternatives to Pearson’s correlation should be taken into account. The two most popular ones are:
- Spearman’s rank correlation (Spearman’s Rho),
- Kendall rank correlation (Kendall Tau).
Spearman’s rank correlation is an alternative to Pearson’s correlation for ratio/interval variables. As the name suggests, it only looks at the rank values, i.e. it compares the two variables in terms of the relative positions of particular data points within the variables. It is able to capture non-linear relations, but there are no free lunches: we lose some information due to only considering the rank instead of the exact data points.
Another rank-based correlation measure is the Kendall rank correlation. It is similar in spirit to Spearman’s correlation but formulated in a slightly different way (Kendall’s calculations are based on concordant and discordant pairs of values, as opposed to Spearman’s calculations based on deviations). Kendall is often regarded as more robust to outliers in the data.
If at least one of the compared variables is of ordinal type, Spearman’s or Kendall rank correlations are the way to go. Due to the fact that ordinal data contains only the information on the ranks, they are both a perfect fit, while Pearson’s linear correlation is of little use.
Another scenario is when both variables are nominal. In this case, we can choose from a couple of different correlation measures:
- Cramer’s V, which captures the association between the two variables into a number ranging from zero (no association) to one (one variable completely determined by the other).
- Chi-Squared statistic, commonly used for testing for dependence between two variables. Lack of dependence suggests the particular feature is not useful.
- Mutual information, a measure of mutual dependence between two variables that seeks to quantify the amount of information that one can extract from one variable about the other.
Which one to choose? There is no one-size-fits-all answer. As usual, each method comes with some pros and cons. Cramer’s V is known to overestimate the association’s strength. Mutual information, being a non-parametric method, requires larger data samples to yield reliable results. Finally, the Chi-Squared does not provide information about the strength of the relationship, but rather only whether it exists or not.
We have discussed scenarios in which the two variables we compare are both interval or ratio, when at least one of them is ordinal, and when we compare two nominal variables. The final possible encounter is to compare a nominal variable with a non-nominal one.
In such cases, the two most widely-used correlation measures are:
- ANOVA F-score, a chi-squared equivalent for the case when one of the variables is continuous while the other is nominal,
- Point-biserial correlation, a correlation measure specially designed to evaluate the relationship between a binary and a continuous variable.
Once again, there is no silver bullet. The F-score only captures linear relations, while point-biserial correlation makes some strong normality assumptions that might not hold in practice, undermining its results.
Having said all that, which method should one choose in a particular case? The table below will hopefully provide some guidance in this matter.
Transform your variables
As we have seen, each correlation measure that can be used with filter feature selection methods is suited to a particular data type. At the same time, each comes with some advantages and drawbacks. What if you thought a particular method would be a good fit for your use case, but it is not suited for your data type, or its assumptions are not met? No worries — a clever data transformation will save the day.
For instance, imagine you have a ratio feature in your data set and a ratio target. You are worried, however, that using Pearson correlation will not work, since the feature is non-normally distributed. In such a case, you could normalize your feature using a z-score to ensure the normality assumption is met.
Now imagine a different situation: the target is nominal, and the feature is a ratio. Based on domain knowledge, you expect the relation between them to be non-linear, which prohibits the use of the F-score. At the same time, the feature is non-normal, which rules out using point-biserial correlation. What to do? You could discretize the feature into a number of bins and try one of the nominal-nominal measures, such as mutual information.
The takeaway from these two examples is that by transforming the data we can unlock access to more correlation measures than are originally viable.
Scikit-Learn and beyond
Just like most other machine learning tasks, feature selection is served very well by the scikit-learn package, and in particular by its
sklearn.feature_selection module. However, in some cases, one needs to reach out to other places. For the remainder of the article, let’s denote by
X an array or data frame with all potential features as columns and observation in rows, and by
y the targets vector.
Let’s start with the unsupervised feature selection methods:
sklearn.feature_selection.VarianceThresholdtransformer will by default remove all zero-variance features. We can also pass it a threshold to make it remove features whose variance is lower than the threshold.
from sklearn.feature_selection import VarianceThresholdsel = VarianceThreshold(threshold=0.05)
X_selection = sel.fit_transform(X)
- In order to drop the columns with missing values, pandas’
dropna(axis=1)method can be used on the data frame.
X_selection = X.dropna(axis=1)
- To remove features with high multicollinearity, we first need to measure it. A popular multicollinearity measure is the Variance Inflation Factor, or VIF. It is implemented in the statsmodels package.
from statsmodels.stats.outliers_influence import variance_inflation_factorvif_scores = [
for feature in range(len(X.columns))
By convention, columns with a VIF larger than 10 are considered as suffering from multicollinearity, but another threshold may be chosen if it seems more reasonable.
When it comes to wrapper methods, scikit-learn has got us covered:
- Backward and forward feature selection can be implemented with the SequentialFeatureSelector transformer. For instance, in order to use the k-Nearest-Neighbor classifier as the scoring model in forward selection, we could use the following code snippet:
from sklearn.feature_selection import SequentialFeatureSelectorknn = KNeighborsClassifier(n_neighbors=3)
sfs = SequentialFeatureSelector(knn, n_features_to_select=3, direction=”forward”)
X_selection = sfs.transform(X)
- Recursive Feature Elimination is implemented in a very similar fashion. Here is a snippet implementing RFE based on feature importance from a Support Vector Classifier.
from sklearn.feature_selection import RFEsvc = SVC(kernel=”linear”)
rfe = RFE(svc, n_features_to_select=3)
X_selection = rfe.transform(X)
Let’s now take a look at implementing various filter methods. These will need some more glue code to implement. First, we need to compute the appropriate correlation measure between each feature and the target. Then, we would sort all features according to the results and keep the desired number (top-K, or top-30%) of the ones with the strongest correlation. Luckily, scikit-learn provides some utilities to help in this endeavor.
- To keep the top two features with the strongest Pearson correlation with the target we can run:
from sklearn.feature_selection import r_regression, SelectKBest
X_selection = SelectKBest(r_regression, k=2).fit_transform(X, y)
- Similarly, to keep the top 30% of features, we would run:
from sklearn.feature_selection import r_regression, SelectPercentileX_selection = SelectPercentile(r_regression, percentile=30).fit_transform(X, y)
SelectPercentile methods will also work with custom or non-scikit-learn correlation measures, as long as they return a vector of length equal to the number of features, with a number for each feature denoting the strength of its association with the target. Let’s then take a look at how to calculate all the different measures we have discussed previously.
- Spearman’s Rho, Kendall Tau, and point-biserial correlation are all available in the scipy package. This is how to get their values for each feature in X.
from scipy import statsrho_corr = [stats.spearmanr(X[:, f], y).correlation for f in range(X.shape)]tau_corr = [stats.kendalltau(X[:, f], y).correlation for f in range(X.shape)]pbs_corr = [stats.pointbiserialr(X[:, f], y).correlation for f in range(X.shape)]
- Chi-Squared, Mutual Information, and ANOVA F-score are all in scikit-learn. Note that mutual information has a separate implementation, depending on whether the target is nominal or not.
from sklearn.feature_selection import chi2
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import f_classifchi2_corr = chi2(X, y)
f_corr = f_classif(X, y)
mi_reg_corr = mutual_info_regression(X, y)
mi_class_corr = mutual_info_classif(X, y)
- Cramer’s V can be obtained from a recent scipy version (1.7.0 or higher).
from scipy.stats.contingency import associationv_corr = [association(np.hstack([X[:, f].reshape(-1, 1), y.reshape(-1, 1)]), method=”cramer”) for f in range(X.shape)]
Take no prisoners: Boruta needs no human input
When talking about feature selection, we cannot fail to mention Boruta. Back in 2010 when it was first published as an R package, it was quick to become famous as a revolutionary feature selection algorithm.
All the other methods we have discussed so far require a human to make an arbitrary decision. Unsupervised methods need us to set the variance or VIF threshold for feature removal. Wrappers require us to decide on the number of features we want to keep upfront. Filters need us to choose the correlation measure and the number of features to keep as well. Embedded methods have us select regularization strength. Boruta needs none of these.
Boruta is a simple yet statistically elegant algorithm. It uses feature importance measures from a random forest model to select the best subset of features, and it does so by introducing two clever ideas.
First, the importance scores of features are not compared to one another. Rather, the importance of each feature competes against the importance of its randomized version. To achieve this, Boruta randomly permutes each feature to construct its “shadow” version. Then, a random forest is trained on the whole feature set, including the new shadow features. The maximum feature importance among the shadow features serves as a threshold. Of the original features, only those whose importance is above this threshold, score a point. In other words, only features that are more important than random vectors are awarded points.
The process described above is repeated iteratively multiple times. Since each time the random permutation is different, the threshold also differs and so different features might score points. After multiple iterations, each of the original features has some number of points to its name.
The final step is to decide, based on the number of points each feature scored, whether it should be kept or discarded. Here enters the other of Boruta’s two clever ideas: we can model the scores using a binomial distribution.
Each iteration is assumed to be a separate trial. If the feature scored in a given iteration, it is a vote to keep it; if it did not, it’s a vote to discard it. A priori, we have no idea whatsoever whether a feature is important or not, so the expected percentage of trials in which the feature scores is 50%. Hence, we can model the number of points scored with a binomial distribution with p=0.5. If our feature scores significantly more times than this, it is deemed important and kept. If it scores significantly fewer times, it’s deemed unimportant and discarded. If it scores in around 50% of trials, its status is unresolved, but for the sake of being conservative, we can keep it.
For example, if we let Boruta run for 100 trials, the expected score of each feature would be 50. If it’s closer to zero, we discard it, if it’s closer to 100, we keep it.
Boruta has proven very successful in many Kaggle competitions and is always worth trying out. It has also been successfully used for predicting energy consumption for building heating or predicting air pollution.
There is a very intuitive Python package to implement Boruta, called BorutaPy (now part of
scikit-learn-contrib). The package’s GitHub readme demonstrates how easy it is to run feature selection with Boruta.
Build yourself a voting selector
We have discussed many different feature selection methods. Each of them has its strengths and weaknesses, makes its own assumptions, and arrives at its conclusions in a different fashion. Which one to choose? Or do we have to choose? In many cases combining all these different methods together under one roof would make the resulting feature selector stronger than each of its subparts.
One way to do it is inspired by ensembled decision trees. In this class of models, which includes random forests and many popular gradient boosting algorithms, one trains multiple different models and lets them vote on the final prediction. In a similar spirit, we can build ourselves a voting selector.
The idea is simple: implement a couple of feature selection methods we have discussed. Your choice could be guided by your time, computational resources, and data measurement levels. Just run as many different methods as you conveniently can afford.
Then, for each feature, write down the percentage of selection methods that suggest keeping this feature in the data set. If more than 50% of the methods vote to keep the feature, keep it — otherwise, discard it.
The idea behind this approach is that while some methods might make wrong judgments with regard to some of the features due to their intrinsic biases, the ensemble of methods should get the set of useful features right. Let’s see how to implement it in practice!
Let’s build a simple voting selector that ensembles three different feature selection methods: a filter method based on Pearson correlation, an unsupervised method based on multicollinearity, and a wrapper, Recursive Feature Elimination. Let’s take a look at what such a voting selector might look like. Next, we will go over the code to discuss it in detail.
Our VotingSelector class comprises four methods on top of the init constructor. Three of them implement the three feature selection techniques we would like to ensemble:
_select_pearson() for Pearson correlation filtering,
_select_vif() for Variance Inflation Factor-based unsupervised approach and
_select_rbf() for the RBF wrapper.
Each of these methods takes the feature matrix X and the targets y as inputs. The VIF-based method will not use the targets, but we use this argument anyway to keep the interface consistent across all methods so that we can conveniently call them in a loop later. On top of that, each method accepts a keyword arguments dictionary which we will use to pass method-dependent parameters.
Having parsed the inputs, each method calls the appropriate sklearn or statsmodels functions which we have discussed before to return the list of feature names to keep.
The voting magic happens in the
select() method. There, we simply iterate over the three selection methods, and for each feature, we record whether it should be kept (1) or discarded (0) according to this method. Finally, we take the mean over these votes. For each feature, if this mean is greater than the voting threshold of 0.5 (which means that at least two out of three methods voted to keep a feature), we keep it.
Let’s see it working in practice. We will load the infamous Boston Housing data, which comes built-in w scikit-learn.
Now, running feature selection is as easy as this:
vs = VotingSelector()
X_selection = vs.select(X, y)
As a result, we get the feature matrix with only three features left.
ZN CHAS RM
0 18.0 0.0 6.575
1 0.0 0.0 6.421
2 0.0 0.0 7.185
3 0.0 0.0 6.998
4 0.0 0.0 7.147
.. ... ... ...
501 0.0 0.0 6.593
502 0.0 0.0 6.120
503 0.0 0.0 6.976
504 0.0 0.0 6.794
505 0.0 0.0 6.030
[506 rows x 3 columns]
We can also glimpse at how each of our methods has voted by printing
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
pearson 0 1 0 1 0 1 0 1 0 0 0 1 0
vif 1 1 0 1 0 0 0 0 0 0 0 0 0
rfe 0 0 0 1 1 1 0 0 0 0 1 0 1
We might not be happy with only 3 out of the initial 13 columns left. Luckily, we can easily make the selection less restrictive by modifying the parameters of the particular methods. This can be done by simply adding appropriate arguments to the call to select, thanks to how we pass kwargs around.
Pearson and RFE methods need a pre-defined number of features to keep. The default has been 5, but we might want to increase it to 8. We can also modify the VIF threshold, that is the value of the Variance Inflation Factor above which we discard a feature due to multicollinearity. By convention, this threshold is set at 10, but increasing it to, say, 15 will result in more features being kept.
vs = VotingSelector()
X_selection = vs.select(X, y, n_features_to_select=8, vif_threshold=15)
This way, we have seven features left.
Our VotingSelector class is a simple but generic template that you can extend to an arbitrary number of feature selection methods. As a possible extension, you could also treat all the arguments passed to
select() as hyperparameters of your modeling pipeline and optimize them so as to maximize the performance of the downstream model.
Feature selection at Big Techs
Large technology companies such as GAFAM and the like, with their thousands of machine learning models in production, are prime examples of how feature selection is operated in the wild. Let’s see what these tech giants have to say about it!
Rules of ML is a handy compilation of best practices in machine learning from around Google. In it, Google’s engineers point out that the number of parameters the model can learn is roughly
proportional to the amount of data it has access to. Hence, the less data we have, the more features we need to discard. Their rough guidelines (derived from text-based models) are to use a dozen features with 1000 training examples, or 100,000 features with 10 million training examples.
Another crucial point in the document concerns model deployment issues, which can also affect feature selection. First, your set of features to select from might be constrained by what will be available in production at inference time. You may be forced to drop a great feature from training if it won’t be there for the model when it goes live. Second, some features might be prone to data drift. While the topic of tackling drift is a complex one, sometimes the best solution might be to remove the problematic feature from the model altogether.
Thanks for reading! I hope this overview article has convinced you that feature selection is a crucial step in the data preparation pipeline, and gave you some guidance as to how to approach it. Don’t hesitate to hit me up on social media to discuss the topics covered here, or any other machine learning topics, for that matter. Happy feature selection!
If you liked this post, why don’t you subscribe for email updates on my new articles? And by becoming a Medium member, you can support my writing and get unlimited access to all stories by other authors and myself.
Need consulting? You can ask me anything or book me for a 1:1 here.
You can also try one of my other articles. Can’t choose? Pick one of these:
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot