How to Work with Million-row Datasets Like a Pro



Original Source Here

Reduce the memory size

Next, we have memory issues. Even a 200k row dataset may exhaust your 16GB RAM while doing complex computations.

I have experienced this first-hand twice in the last month’s TPS competition on Kaggle. The first one was when projecting the training data to 2D using UMAP — I ran out of RAM. The second was while computing the SHAP values with XGBoost for the test set — I ran out of GPU VRAM. What is shocking is that the training and test sets only had 250k and 150k rows with a hundred features, and I was using Kaggle kernels.

The dataset we are using today has ~960k rows with 120 features, so memory issues are much more likely:

Using the memory_usage method on a DataFrame with deep=True, we can get the exact estimate of how much RAM each feature is consuming – 7 MBs. Overall, it is close to 1GB.

Now, there are certain tricks you can use to decrease memory usage up to 90%. These tricks have a lot to do with changing the data type of each feature to the smallest subtype possible.

Python represents various data with unique types such as int, float, str, etc. In contrast, pandas has several NumPy alternatives for each of Python’s:

Source: http://pbpython.com/pandas_dtypes.html

Numbers next to the datatype refer to how many bits of memory a single data unit consumes when represented in that format. To reduce the memory as much as possible, choose the smallest NumPy data format. Here is a good table to understand this:

Source: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html

In the above table, uint refers to unsigned, only positive integers. I have found this handy function that reduces the memory of pandas DataFrames based on the above table (shout out to this Kaggle kernel):

Based on the minimum and maximum value of a numeric column and the above table, the function converts it to the smallest subtype possible. Let’s use it on our data:

70% memory reduction is pretty impressive. However, please note that memory reduction won’t speed up computation in most cases. If the memory size is not an issue, you can skip this step.

Regarding non-numeric data types, never use the object datatype in Pandas as it consumes the most memory. Either use str or category if there are few unique values in the feature. In fact, using pd.Categorical data type can speed things up to 10 times while using LightGBM’s default categorical handler.

For other data types like datetime or timedelta, use the native formats offered in pandas since they enable special manipulation functions.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: