Understand Feature Selection in Machine Learning with Python


Original Source Here

Understand Feature Selection in Machine Learning with Python

Techniques of choosing the best set of features from the data

Photo by Jon Tyson on Unsplash

Feature Selection and its types

We all work on bucket loads of data, not every column is important to make our model.

For example, consider a dataset of students having features like name, age, sex, hours of study, and school name. If you must make a model that predicts the score of students, it is obvious that the feature ‘hours of study is the one that helps compared to the others.

When you work on a large dataset it is not easy to know the importance of a feature as easy as I stated above. So, we need a technique that works automatically for us. Feature selection is one such method.

Feature Selection as the name says is a technique of choosing the best set of features from the data set to build a good predictive model. This process can also be termed variable selection or attribute selection.

Image Source

Importance and Advantages of Feature Selection.

We all know how important data cleaning is in order is to get an accurate model. The same is the significance of feature selection too. The number of more and unwanted features can lead to many problems like:

  • Unnecessary information allocation.
  • Takes more time for training the algorithm.
  • When unnecessary data is included it may also have noise, leading to over-fitting models.

So, when feature selection is used before training the model with the data, the model will be accurate, less training time decreases over-fitting.

There are many types of feature selection methods such as:

  • Filter method.
  • Wrapper method.
  • Embedded method.

Filter method

This is a type of feature selection that is done as a pre-processing step. This ranks each variable according to its importance. This is also called a uni-variate method as it works on one feature at a time. This method is the simplest and easiest of all the other methods.

This method can be achieved using many techniques like the Chi-square test, Information Gain, Fisher’s Score, and correlation coefficient.

Example using python:

Here’s a chunk of code explaining the correlation coefficient using heat-map.

Here I am using an advertising dataset to create a model that will predict if the user clicks on an ad or not based on the data given. It has the following columns.

A photo by Author

Now, I’ll use a heat map from the seaborn library to visualize the most correlated feature in the dataset.

sns.heatmap(data1.corr(), annot = True)
A photo by Author

The correlation ranges from -1 to 1. The features with 1 have the maximum correlation. Here the feature ‘Daily time spend on the site’ is the most correlated feature. We can drop the features which have a lower coefficient correlation with the target variable. If there is a correlation between more than a variable it is called multi-collinearity.

Wrapper method

The wrapper method is also called a greedy method as it searches and evaluates all possible combinations of the features to meet the evaluation criteria.

This method has techniques like forward feature selection, backward feature elimination, exhaustive feature selection, recursive feature elimination.

Example using python:

Here is a bit of code explaining the Recursive feature elimination for the diabetes dataset. This gives the most important features using the scikit. the feature-selection method from the scikit library.

from sklearn.linear_model import LogisticRegression
dataframe = pd.read_csv("diabetes.csv")
array =dataframe.values
x = array[:, 0:8]
y = array[:,8]
#gives top 5 features
rfe = RFE(model, 5)
fit = rfe.fit(x,y)
print("Num Features: %s" %(fit.n_features_))#this shows TRUE fpr features that are important and other features #as FALSE
print("Selected Features: %s" %(fit.support_))
#This is the ranking of the feature in order
print("Feature Ranking: %s" %(fit.ranking_))
Num Features: 5
Selected Features: [True True False True True False False True]
Feature Ranking: [1 1 2 3 4 1 1 1]

Embedded method

The machine learning models that have feature selection as a part(built-in) of their model are called embedded or intrinsic methods. Built-in feature selection means that the model includes the predictors that help in maximizing accuracy.

Models that use embedded methods are Lasso regression, Ridge regression, Decision tree, and Random Forest algorithm.

Example using python:

Here’s a chunk of code explaining the Lasso regression also called the L1 regularization.

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
#set the regularization parameter C=1
logistic = LogisticRegression(C=1, penalty ="l1", solver="liblinear'
, random_state = 7).fit(x,y)
model = SelectFromModel(logistic, prefit = True)x_new = model.transform(x)#dropped columns have value of all zeros, keep other columns
selected_columns = selected_features.columns[selected_features.var()
!= 0]
Index([0, 1, 2, 3, 4, 5, 6, 7], dtype = 'int64')

There is also another method called the Hybrid method that is a combination of both the filter and wrapper methods. When to compared to other methods these models provide better accuracy.


In this article, we discussed mainly feature selection, its importance, and types. Hope you got an insight into the topic.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: