5 Useful Encoding Techniques in Machine Learning



Original Source Here

5 Useful Encoding Techniques in Machine Learning

A pre-processing step in machine learning modeling

Introduction

A Data Analyst spends most of the time preparing and cleaning the data because the raw data is unstructured and contains noise that can not be used by machine learning models directly. Therefore this data is to be cleaned/filter which enhances the quality of the model and also helps in feature engineering.

The main motive of Data cleaning is to deal with handling encoding categorical data, handling missing values, dropping the redundant features, and reducing the dimensionality with the help of standard dimensionality reduction techniques. This step makes/prepares our data as a whole to be applied to any machine learning algorithms. There are two types of data which are divided into two categories Structured Data and Unstructured Data.

Data Encoding is a pre-processing step in machine learning modeling that converts or encodes categorical data into numerical form.

Variables in Machine learning / Data science

We have a variable number of data encoding techniques which I will discuss with the help of Python in this article.

In machine learning, our models mostly deal with different types of variables, generally numerical variables but what happens when the categorical variables come in to picture. We need to convert these categorical variables to numeric form before fitting and evaluate our model so that our model can understand and extract insights information from it. To deal with this we need to understand categorical data.

Categorical Data

Statistically Categorical data or categorical variable is used to represent a certain number of possible values belonging to a particular category.

For example:

  • The cities where a person lives: Noida, Delhi, Gurgaon, Mumbai, Bangalore, etc.
  • The department of a person where he/she works: HR, Finance, IT, Production, etc.
  • The highest degree of a person: Ph.D., Masters, Bachelors, SSC, Diploma, SC, etc.
  • The grades of a student: A, B, C, D, etc.

From the example given above, we conclude that categorical variables may be of two types ordinal and nominal which is explained in the above image.

Techniques that are converting Categorical values into Numeric values are as given below :

  1. Label Encoding/ordinal encoding
  2. One Hot Encoding
  3. Binary Encoding
  4. Hash Encoding
  5. Mean Encoding or Target encoding
Image Source

Ordinal Encoding or Label Encoding

The categorical data encoding technique is used when the categorical feature is ordinal. In Label encoding, every label is converted into an integer value. We will create a variable that contains the categories representing the qualification of a person.

Python code

import category_encoders as ce
import pandas as pd

In the first line of our code, we import category_encoders which is a set of scikit-learn-style transformers which are used for encoding categorical variables into numeric.

train_df=pd.DataFrame({'Degree'['Highschool','Masters','Diploma',
'Bachelors','Bachelors','Masters','Phd','High
school','High school']})
# create object of Ordinal encoding
encoder= ce.OrdinalEncoder(cols=['Degree'],return_df = True,
mapping=[{'col':'Degree',
'mapping':{'None':0,'High school':1,'Diploma':2,'Bachelors':3,'Masters':4,'phd':5}}])#Original data
train_df
#fit and transform train data
df_train_transformed = encoder.fit_transform(train_df)

The disadvantage of label encoding

Label coding considers some hierarchy in the columns which can mislead to nominal features present in the data set.

One Hot Encoding

Categorical data encoding technique used when the features are nominal/don’t have order). In one hot encoding, for every level of a categorical value, we create a new variable. Each category is represented with a binary variable containing either 1 or 0 where 1 represents the presence of that category and 0 represents the absence. One hot encoding overcomes the problem of label encoding, as Label coding considers some hierarchy in the columns which can mislead to nominal features present in the data set.

These created binary features are known as Dummy variables. How many dummy variables are there, depends on the levels present in the categorical variable. For example, Suppose we have a dataset with a category animal, having different animals like cat, dog, lion, cow, sheep, horse. Now, we are applying one-hot encoding to this data.

When encoding is done, we found in the second table that we have dummy variables each representing a category in the feature Animal. for each category in the table, we have o and 1 in the column of categories. Now, implementing one-hot encoding as given below.

Python Code

import category_encoders as ce
import pandas as pd
data=pd.DataFrame({‘City’:[‘Delhi’,’Mumbai’,’Hydrabad’,
’Chennai’,’Bangalore’,’Delhi’,’Hydrabad’,
’Bangalore’,’Delhi’]})
#Create object for one-hot encodingencoder=ce.OneHotEncoder(cols=’City’,handle_unknown=’return_nan’,
return_df=True,use_cat_names=True)
#Original Data
data
#Fit and transform Data
data_encoded = encoder.fit_transform(data)
data_encoded

Binary coding

Binary Encoding is a special case of One Hot Encoding in which binary digits are used for encoding i.e. 0 or 1.

For example for 7 binary code is 111.

This technique is preferable when there are more number categories. Suppose you have 100 or more different categories then One hot encoding will create 100 or more different columns, but binary encoding only need 7 columns to represent it.

Binary coding
YES 1
NO 0

Python Code

from category_encoders import BinaryEncoder
encoder = BinaryEncoder(cols =['ord_2'])
# transforming the column after fitting
newdata = encoder.fit_transform(df['ord_2'])
# concating dataframe
df = pd.concat([df, newdata], axis = 1)
# dropping old column
df = df.drop(['ord_2'], axis = 1)
df.head(10)

Hash Encoding

Hashing is the process of converting a string of characters into a unique hash code or value by applying a hash function. It can deal with a higher number of categorical data with low memory usage.

Python Code

from sklearn.feature_extraction import FeatureHasher

here in the code, we import a feature hasher that implements feature hashing to the feature and uses their hash value for indices.

# The number of bits you want in your hash value contained in n_features.h = FeatureHasher(n_features = 3, input_type =’string’)# transforming the column after fittinghashed_Feature = h.fit_transform(df[‘nom_0’])
hashed_Feature = hashed_Feature.toarray()
df = pd.concat([df, pd.DataFrame(hashed_Feature)], axis = 1)
df.head(10)

Mean Encoding or Target encoding

Target encoding is a very good encoding technique because it picks up the values that can explain the target. It is used by most competitors. The basic idea of this technique is to replace a categorical value with the mean of the target variable.

Code:

# Target column inserting in the dataset as it needs a target
df.insert (6, “Target”, [0, 1, 1, 0, 0, 1, 0, 0, 0, 1], True)
# importing TargetEncoder
from category_encoders import TargetEncoder
Target_enc = TargetEncoder()
# transforming the column after fitting
points = Target_enc.fit_transform(X = df.nom_0, y = df.Target)
# concating values with dataframe
df = pd.concat([df, points], axis = 1)
df.head(10)

Conclusion

In this article, I focused on data encoding techniques which are very important to deal with the missing values like a problem before fitting to our model and evaluate of the model.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: