# Feature Selection and Removing in Machine Learning

https://miro.medium.com/max/1200/0*iZJW9TvF7_y1DeVs

Original Source Here

# Feature Selection and Removing in Machine Learning

## Improving model and its accuracy for high dimension data

As we know the importance of features in the machine learning algorithms are playing a very crucial role in prediction analysis in any field.

When the data features become very complex then there are very high chances to get a multi-collinearity situation or high correlation between two and more features. This situation strikes badly on the training of data and it might go over-fitting or under-fitting of the data.

There are some methods to select and remove features as shown below:

`Feature Selection Methods1. Uni-variate Selection2. Selecting from ModelFeature removing Methods1. Low variance method2. Recursive method`

Uni-variate Selection

Uni-variate methods are a group of methods to examine the relationship strength of features. We can use any method to understand individual features to the target feature.

Methods in Uni-variate feature selection

• Select K best
• Select percentile
• Generic Uni-variate select
• Pearson Correlation

## K Best Method

In this method, we choose the number of features to be select based on the highest scores and p-value of the features.

The example of this method with python is shown below:

`from sklearn.datasets import load_irisfrom sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import chi2feature, target= load_iris(return_X_y=True)feature.shape#output:(150, 4)feature_new = SelectKBest(chi2, k=2).fit_transform(feature, target)feature_new.shape#output:(150, 2)`

One thing we noticed here is that we use the chi2 feature because it is used to find the statistic on how one feature is dependent on the target feature.

## Select Percentile

In this method, we choose the number of features based on the highest percentile scores of the features.

The example of this method with python is shown below:

`from sklearn.datasets import load_irisfrom sklearn.feature_selection import SelectPercentilefrom sklearn.feature_selection import chi2feature, target= load_iris(return_X_y=True)feature.shape#output:(1797, 64)feature_new = SelectPercentile(chi2,                       percentile=10).fit_transform(feature, target)feature_new.shape#output:(1797, 7)`

We can choose the percentile as shown in the above code, also we see that the number of selected features is only 7 now from 64 features.

## Generic Uni-variate select

In this method, we choose the type of the mode to select the number of features.

The example of this method with python is shown below:

`from sklearn.datasets import load_breast_cancerfrom sklearn.feature_selection import GenericUnivariateSelectfrom sklearn.feature_selection import chi2feature, target= load_iris(return_X_y=True)feature.shape#output:(569, 30)transformer = GenericUnivariateSelect(chi2, mode='k_best', param=20)feature_new = transformer.fit_transform(feature, target)feature_new.shape#output:(569, 20)`

## Pearson Correlation

This method is very simple to know about the feature correlation between input feature and target feature. We get a score in the range of [-1 to 1] i.e. negatively correlated to strongly positively correlated.

The example of this method with python is shown below:

`import numpy as npfrom scipy.stats import pearsonrnp.random.seed(0)count = 300x = np.random.normal(0, 1, count)print("Less random data",pearsonr(x, x + np.random.normal(0, 1,                         size)))print("More random data", pearsonr(x, x + np.random.normal(0, 10,                         size)))#output:Less random data (0.618, 4.743e-49)More random data (0.078, 0.23)`

The one main disadvantage of this method is that it works on a linear relationship of features.

Selecting from Model

This method is used to find the important features after fitting the model. It is like an add-on to the model to select the important features based on some threshold.

The example of this method with python is shown below:

`from sklearn.feature_selection import SelectFromModelfrom sklearn.linear_model import LogisticRegressionX = [[ 0.27, -2.34,  0.31 ],    [-2.79, -0.09, -0.85 ],    [-0.34, 1.34, -2.55 ],    [ 1.77,  1.28,  0.54 ]]y = [1, 0, 1, 0]selector = SelectFromModel(estimator=LogisticRegression()).fit(X, y)selector.estimator_.coef_output:array([[ 0.35531238, -0.56881882, -0.70144451]])-------------------------------------------------------------selector.threshold_output:0.5524527319086916-------------------------------------------------------------selector.get_support()output:array([False,  True, False])`

Low variance method

This method is based on the variance approach on a feature to meet some threshold given to find the important importance in the data.

The general formula for threshold is shown below:

`.8 * (1 - .8)`

The example of this method with python is shown below:

`from sklearn.feature_selection import VarianceThresholdX = [[1,0,1], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 0], [0, 1, 1]]sel = VarianceThreshold(threshold=(.5 * (1 - .5)))sel.fit_transform(X)#output:No feature in X meets the variance threshold 0.25000-------------------------------------------------------------sel = VarianceThreshold(threshold=(.6 * (1 - .6)))sel.fit_transform(X)#output:array([,       ,       ,       ,       ,       ])`

Here, if we give a low threshold then the features is not selected. But if we give a threshold with some point then it removes some columns from the data.

Recursive method

In this method, the ranking of features is based on weights. The method breaks the features into a smaller set of features then training goes from a smaller to a higher set of features.

The example of this method with python is shown below:

`from sklearn.datasets import make_friedman1from sklearn.feature_selection import RFEfrom sklearn.svm import SVRfeature, target = make_friedman1(n_samples=30, n_features=8,                                                  random_state=0)estimator = SVR(kernel="linear")selector = RFE(estimator, n_features_to_select=5, step=1)selector = selector.fit(feature, target)selector.ranking_#output:array([1, 2, 3, 1, 1, 1, 1, 4])`

From the above example, we can see that there is a ranking on the 8 features that we give in the n_features parameter.

Conclusion

In this article, we discuss some features selection methods in machine learning. Feature selection is an important task in every project.

I hope you like the article. Reach me on my LinkedIn and twitter.

# Recommended Articles

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot