Original Source Here
A genetic algorithm is a technique for optimization problems based on natural selection. In this post, I show how to use genetic algorithms for feature selection.
While there are many well-known feature selections methods in scikit-learn, feature selection goes well beyond what is available there.
Feature selection is a crucial aspect of any machine learning pipeline. However, these days there is a surplus of available data. As a consequence, there is often a surplus of features.
As is often the case with many features, many are redundant. They add noise to your model and make model interpretation problematic.
The problem is determining what features are relevant to the problem. The aim is to have quality features.
This post makes use of the ‘sklearn-genetic’ package:
This package is compatible with existing sklearn models and provides a great deal of functionality and options for genetic selection.
For this post, I am using a genetic algorithm for feature selection. But, a genetic algorithm can also be used for hyper-parameter optimization. Because the steps are pretty straightforward and generalized, it applies to many different areas.
Selecting features is an NP-Hard problem. The optimal configuration is a set or subset of those features, given a set of features. This method is a discrete selection. With a permutation of possibilities, it is very costly to determine the optimal feature set.
Genetic algorithms use an approach to determine an optimal set based on evolution. For feature selection, the first step is to generate a population based on subsets of the possible features.
From this population, the subsets are evaluated using a predictive model for the target task. Once each member of the population is considered, a tournament is performed to determine which subsets will continue into the next generation. The next generation is composed of the tournament winners, with some cross over (update the winning feature sets with features from the other winners) and mutation (introduce or remove some features at random).
- An initial population is produced.
- A score is attached to the members of the population.
- A subset is selected for reproduction with a tournament.
- Select genetic material to pass on.
- Apply mutations.
- Repeat over multiple generations.
The algorithm runs for a set number of generations (iterations). After which, the optimal member of the population is the selected features.
The experiments are based on the UCI breast cancer dataset, which contains 569 instances and 30 features. With this dataset, I test several classifiers with all of the features, the subset of features from the genetic algorithm, and five features using the chi-squared test for comparison.
Below is the code used to select up to five features using a genetic algorithm.
from sklearn.datasets import load_breast_cancer
from genetic_selection import GeneticSelectionCV
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as npdata = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
X = df.drop(['target'], axis=1)
y = df['target'].astype(float)estimator = DecisionTreeClassifier()
model = GeneticSelectionCV(
estimator, cv=5, verbose=0,
model = model.fit(X, y)
The initial population (of size ‘n_population’) is generated at random from the sample space of feature sets. These sets are limited in scope by the parameter ‘max_features’, which sets the maximum size of each feature subset.
For each member of the initial population, a score is measured with the target metric. This measurement is the performance of the estimator specified.
A tournament selection is performed to determine which members will continue to the next generation. The number of members within the tournament is set with ‘tournament_size’. Tournament size is a selection of a few members from the population that compete against one another based on the scoring metric. The winner of a tournament is chosen as a parent for the next generation.
The number of members for the tournament should remain small. When the value is quite large, the current best member is usually selected. This behaviour causes none of the weaker members to be selected. While providing temporary performance gains, ultimately, this leads to a reduced performance overall as the weaker options are not given a chance to improve.
In natural selection, genetic information is stored in a chromosome. During reproduction, some genetic material is passed from parent to the children. Then the child contains genetic material from both of the parents. This property is represented with the parameter ‘crossover_proba’. The probability specified represents the chance of cross over from one generate to the next. There is also the parameter ‘crossover_independent_proba’, which is the probability that a feature will cross over to the child.
A critical aspect of evolution is mutation. Mutation mitigates the risk of the search falling into a local optimum and getting stuck. At each generation, in addition to the crossover, a random mutation is added. The probability that a mutation will happen is set with the parameter ‘mutation_prob’. This parameter is combined with ‘mutation_independent_proba’, which is the chance of adding a feature to the feature set.
Notably, setting this probability too high transforms the algorithm into a random selection process. So, you will want to keep this value relatively low. Randomly introducing features in each generation effectively acts as a regularization for the genetic process.
The genetic search algorithm used here also has a ‘n_gen_no_change’ parameter which monitors if the best member of the population hasn’t changed over several generations. The search has found an optimum in this scenario. Consider increasing the mutation or cross over probabilities to vary the selection further.
The results of the genetic vs the chi-squared feature selection are showed below. The baseline performance is also listed using all features. The results are from cross-validation, the performance measured is accuracy, and the number of features used is in parenthesis.
While these results are by no means conclusive, they show the benefits of the genetic algorithm. The model performance is based on the subset of features from the genetic algorithm that consistently outperformed both the baseline model and the chi-square feature subset. There was a single exception with the logistic regression model, where the results were still comparable.
Additionally, the optimal feature subsets produced were smaller than the maximum of five features. Models with fewer features are ultimately preferred to larger models as they are simpler and more interpretable.
Genetic Algorithms are incredibly versatile and apply to a wide range of scenarios.
This post explored how genetic algorithms are used for feature selection using the sklearn-genetic package. These algorithms have also been shown to be effective in hyper-parameter searches and generative design.
While less conventional than the readily available methods in sklearn, genetic algorithms offer a distinct and practical approach to feature selection. The way that these algorithms optimize is far different from most other feature selection methods. The process is based on a pure natural selection approach.
I encourage data scientists to take the time to understand and implement genetic algorithms in their work.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot