Stratified Sampling: You May Have Been Splitting Your Dataset All Wrong

https://miro.medium.com/max/1200/0*mdnULRKu64-sbAGX

Original Source Here

Stratified Sampling: You May Have Been Splitting Your Dataset All Wrong

Randomly generating splits of the data set is not always the optimal solution, as the proportions in the target variable can be extremely different. Let me introduce you to Stratified Validation in Python.

Photo by Testalize.me on Unsplash

During the development of a machine learning model, it is common to divide the data set into training and testing, and even validation splits to obtain more representative results. However, there is something that can potentially affect the quality of your prediction that is most often forgotten when generating these splits: the distribution of your target variable(s).

These dataset divisions are usually generated randomly according to a target variable. However, when doing so, the proportions of the target variable among the different splits can differ, especially in the case of small datasets. This means that we are training and evaluating in heterogeneous subgroups, which will lead to prediction errors.

The solution is simple: stratified sampling. This technique consists of forcing the distribution of the target variable(s) among the different splits to be the same. This small change will result in training on the same population in which it is being evaluated, achieving better predictions.

Implementation

To illustrate the advantages of stratification, I will show the difference in the distribution of the target variable when dividing a data set into a training, test, and validation sets with or without stratified sampling. To use an extreme example, the input used will be the Iris dataset.

Note that for both cases, the distribution of the target variable is the following:

Traditional Split

The following proportions in each of the divisions were found by dividing the 150 records into training, test, and validation without considering the target variable.

The smaller the data set, the greater the probability of finding such different proportions of the target variable among the different splits. As you can see in this example, the subpopulations are extremely different.

Stratified Split

On the other side, when considering the target variable and grouping by it before generating the splits, the resulting distributions were:

Python Code

This example is publicly available on Gist, where I provide a utility method get_dataset_partitions_pd that you can use to easily generate your stratified splits with a Pandas DataFrame.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: