4 Categorical Encoding Concepts to Know for Data Scientists

Original Source Here

3. Hash Encoding

One-Hot Encoding’s major weakness is the features it produced are equivalent to the categorical cardinal, which causes dimensionality issues when the cardinality is too high. One way to alleviate this problem is to represent the categorical data into a lesser number of columns, and that is what Hash Encoding did.

Hash Encoding represents the categorical data into numerical value by the hashing function. Hashing is often used in data encryption or data comparison, but the main part is still similar — transform one feature to another using hashing function.

I would not explain in deep regarding the hashing process, but you could read this following paper to understand one of the most used hashing functions, md5.

The main advantage of using Hash Encoding is that you can control the number of numerical columns produced by the process. You could represent categorical data with 25 or 50 values with five columns (or any number you want). Let’s try to do Hash Encoding with a coding example.

First, let’s install a particular package for categorical Encoding called category_encoders .

pip install category_encoders

This Python package contains many functions for the Categorical Encoding process and works well with the Featuretools package (The category_encoders package developed to work with the Featuretools). You could check my following article to know what is Featuretools did.

Using the category_encoders, let’s try to Hash Encode the category data in the sample mpg dataset. Our dataset has ‘model_year’ data with 13 cardinal, and I want to transform it into five numerical features. To do that, we could try the following code.

import category_encoders as ce
hash_res = encoder.fit_transform(mpg['model_year'])
Hash Encode result (Image by Author)

Hash Encoding would return numerical features that represent the categorical data. Let’s concatenate the Hash Encoding result with the original data to have a comparison.

pd.concat([encoder.fit_transform(mpg['model_year']), mpg], axis =1).sample(5)
Image by Author

As we can see from the image above, the ‘model_year’ data have been transformed into numerical values 0 or 1, and the Machine Learning model could use the data for training.

However, Hash Encoding has two significant weaknesses. First, because we transform the data into fewer features, there would be an information loss. Second, since a high number of categorical values are represented into a smaller number of features, different categorical values could be represented by the same Hash values — this is called a collision.

But, many Kaggle competitors use Hash Encoding to win the competition, so it is worth a try.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: