Feature Selection and Removing in Machine Learning

https://miro.medium.com/max/1200/0*iZJW9TvF7_y1DeVs

Original Source Here

Feature Selection and Removing in Machine Learning

Improving model and its accuracy for high dimension data

Photo by Franki Chamaki on Unsplash

As we know the importance of features in the machine learning algorithms are playing a very crucial role in prediction analysis in any field.

When the data features become very complex then there are very high chances to get a multi-collinearity situation or high correlation between two and more features. This situation strikes badly on the training of data and it might go over-fitting or under-fitting of the data.

There are some methods to select and remove features as shown below:

Feature Selection Methods
1. Uni-variate Selection
2. Selecting from Model
Feature removing Methods
1. Low variance method
2. Recursive method

Uni-variate Selection

Uni-variate methods are a group of methods to examine the relationship strength of features. We can use any method to understand individual features to the target feature.

Methods in Uni-variate feature selection

  • Select K best
  • Select percentile
  • Generic Uni-variate select
  • Pearson Correlation

K Best Method

In this method, we choose the number of features to be select based on the highest scores and p-value of the features.

The example of this method with python is shown below:

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
feature, target= load_iris(return_X_y=True)
feature.shape
#output:
(150, 4)
feature_new = SelectKBest(chi2, k=2).fit_transform(feature, target)
feature_new.shape
#output:
(150, 2)

One thing we noticed here is that we use the chi2 feature because it is used to find the statistic on how one feature is dependent on the target feature.

Select Percentile

In this method, we choose the number of features based on the highest percentile scores of the features.

The example of this method with python is shown below:

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import chi2
feature, target= load_iris(return_X_y=True)
feature.shape
#output:
(1797, 64)
feature_new = SelectPercentile(chi2,
percentile=10).fit_transform(feature, target)
feature_new.shape
#output:
(1797, 7)

We can choose the percentile as shown in the above code, also we see that the number of selected features is only 7 now from 64 features.

Generic Uni-variate select

In this method, we choose the type of the mode to select the number of features.

The example of this method with python is shown below:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import GenericUnivariateSelect
from sklearn.feature_selection import chi2
feature, target= load_iris(return_X_y=True)
feature.shape
#output:
(569, 30)
transformer = GenericUnivariateSelect(chi2, mode='k_best', param=20)feature_new = transformer.fit_transform(feature, target)
feature_new.shape
#output:
(569, 20)

Pearson Correlation

This method is very simple to know about the feature correlation between input feature and target feature. We get a score in the range of [-1 to 1] i.e. negatively correlated to strongly positively correlated.

The example of this method with python is shown below:

import numpy as np
from scipy.stats import pearsonr
np.random.seed(0)count = 300
x = np.random.normal(0, 1, count)
print("Less random data",pearsonr(x, x + np.random.normal(0, 1,
size)))
print("More random data", pearsonr(x, x + np.random.normal(0, 10,
size)))
#output:
Less random data (0.618, 4.743e-49)
More random data (0.078, 0.23)

The one main disadvantage of this method is that it works on a linear relationship of features.

Selecting from Model

This method is used to find the important features after fitting the model. It is like an add-on to the model to select the important features based on some threshold.

The example of this method with python is shown below:

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
X = [[ 0.27, -2.34, 0.31 ],
[-2.79, -0.09, -0.85 ],
[-0.34, 1.34, -2.55 ],
[ 1.77, 1.28, 0.54 ]]
y = [1, 0, 1, 0]selector = SelectFromModel(estimator=LogisticRegression()).fit(X, y)
selector.estimator_.coef_
output:
array([[ 0.35531238, -0.56881882, -0.70144451]])
-------------------------------------------------------------selector.threshold_output:
0.5524527319086916
-------------------------------------------------------------selector.get_support()output:
array([False, True, False])

Low variance method

This method is based on the variance approach on a feature to meet some threshold given to find the important importance in the data.

The general formula for threshold is shown below:

.8 * (1 - .8)

The example of this method with python is shown below:

from sklearn.feature_selection import VarianceThresholdX = [[1,0,1], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 0], [0, 1, 1]]sel = VarianceThreshold(threshold=(.5 * (1 - .5)))
sel.fit_transform(X)
#output:
No feature in X meets the variance threshold 0.25000
-------------------------------------------------------------sel = VarianceThreshold(threshold=(.6 * (1 - .6)))
sel.fit_transform(X)
#output:
array([[1],
[0],
[1],
[0],
[0],
[1]])

Here, if we give a low threshold then the features is not selected. But if we give a threshold with some point then it removes some columns from the data.

Recursive method

In this method, the ranking of features is based on weights. The method breaks the features into a smaller set of features then training goes from a smaller to a higher set of features.

The example of this method with python is shown below:

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
feature, target = make_friedman1(n_samples=30, n_features=8,
random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, n_features_to_select=5, step=1)
selector = selector.fit(feature, target)
selector.ranking_#output:
array([1, 2, 3, 1, 1, 1, 1, 4])

From the above example, we can see that there is a ranking on the 8 features that we give in the n_features parameter.

Conclusion

In this article, we discuss some features selection methods in machine learning. Feature selection is an important task in every project.

I hope you like the article. Reach me on my LinkedIn and twitter.

Recommended Articles

1. 8 Active Learning Insights of Python Collection Module
2. NumPy: Linear Algebra on Images
3. Exception Handling Concepts in Python
4. Pandas: Dealing with Categorical Data
5. Hyper-parameters: RandomSeachCV and GridSearchCV in Machine Learning
6. Fully Explained Linear Regression with Python
7. Fully Explained Logistic Regression with Python
8. Data Distribution using Numpy with Python
9. Decision Trees vs. Random Forests in Machine Learning
10. Standardization in Data Preprocessing with Python

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: