Feature engineering A-Z


Original Source Here

Feature extraction

Let’s say we have the data on consumption statistics of some kind and it has a time stamp on it:

Data with a timestamp

In this example, the “Date” column could easily be used to extract additional features and generate powerful insights such as variations of consumption on weekdays or weekends or at a particular time in the year (see yellow highlights below).

Data with feature engineering

Feature synthesis

Feature synthesis is the opposite of feature extraction. In this case, one or more features are combined into creating new features that are more informative than they are individually.

Let’s say, in a house price dataset you have two columns: floor_space (sqft) and total_house_price (US$). You could use them individually in your analysis but you could also create a new calculated feature called price_per_sqft (US$/sqft).

Feature scaling

Feature scaling/transformation refers to a variety of methods applied in data preprocessing to rescale or normalize data into a different range. The purpose of scaling is to transform data in a way that they are either dimensionless and/or have similar distributions. Three popular scaling methods are:

a) Rescaling: also known as “min-max normalization”, it is the simplest of all methods and calculated as:

b) Mean normalization: This method uses the mean of the observations in the transformation process:

c) Standardization: Also known as Z-score normalization, this technique uses Z-score or “standard score” for feature scaling. It is widely used in machine learning algorithms such as SVM and logistic regression:

d) Log transformation: In logarithmic transformation, each value of a feature is transformed from x to log(x). A popular application of log transformation is in building linear regression models, where the distribution of a continuous variable is changed to a Gaussian form to meet modeling assumptions.

Another reason why log transformation is popular is because of increasing visual interpretability and appearance, especially of high variance data (see the figure below).

Scatterplot showing the relationship between area and population with and without log transformation. Source: Wikipedia.

Feature encoding

Feature encoding refers to transforming categorical string values of a feature into numeric ones.

For example, if you have a “gender” column in your dataset and the values are presented as female and male, you could convert those strings into numeric representation such as male = 1, female = 2. Encoding features this way is known as Label Encoding.

Of course, doing a label encoding means the algorithm may put higher weight on female (= 2) than male (= 1) values. To overcome this situation, people instead use One Hot Encoding to create dummy variables, where each category becomes a dummy column and values become either 1 or 0.

Here’s an example, where one-hot encoding is applied to a dataset with categorical features:

import seaborn as sns
df = sns.load_dataset('tips')
Categorical features without dummies

Now creating dummies in pandas :

import pandas
Categorical features converted into dummy variables

Outlier treatment

Every data scientist has to face outlier problems in their projects without much exception. Outliers are domain and context-specific. If you are building a model to predict house prices in a neighborhood where the average price is $300K, whereas, your dataset has a house that was sold for $10M, that must be an outlier. Similarly, if a house was sold for $30k, you don’t want that in the model either.

So what options do you have to transform those outliers? The following are your choices:

  • you can leave them in the dataset, if they are barely outside of the acceptable range
  • you may drop the observations if you can afford (e.g. your dataset is fairly large)
  • capping/clipping the outliers, where you trim the outlier values to a certain range (e.g. inter-quartile range, IQR).

Missing value treatment

Missing values are a common occurrence in most datasets and they are recorded either as NA or NaN. There is no single standard way of dealing with missing data, people take different approaches depending on the dataset. Here are 3 most popular methods:

  • Deletion: delete the complete record with a missing value, if the dataset is large and the number of missing occurrences is relatively low.
  • Substitution/imputation: missing values are often replaced with a suitable substitute such as column mean, median, mean of nearest neighbors, moving average etc.
  • Statistical imputation: it is a sophisticated imputation approach, where the missing values are predicted using linear regression as a function of other columns in the dataset.

Other kinds of feature engineering

Besides what I’ve just described there are a few other works in the feature engineering space, including:

  • Reducing the number of categories: Re-categorization of categorical features, i.e., reducing the number of classes. For example, if there are 6 categories of education-level in a column, you might want to get them down to 3 categories.
  • Binning: Creating intervals for numeric variables, for example, ‘age’ can be categorized into <20, 20–30, 30–40 etc. instead of leaving them as 1,2,3,4 ……. 40 etc.
  • Polynomial fit: In regression analysis, instead of having a linear model in the form of y = b0+ b1x one can fit a polynomial function to find a better fit in the form ofy = b0 + b1x + b2x².


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: