Never Worry About Optimization. Process GBs of Tabular Data 25x Faster With No-Code Pandas

https://miro.medium.com/max/1200/0*hVH1wGrK8QKXX55p

Original Source Here

Never Worry About Optimization. Process GBs of Tabular Data 25x Faster With No-Code Pandas

Photo by freestocks on Unsplash

Motivation

Pandas makes the tasks of analyzing tabular datasets an absolute breeze. The sleek API design offers a wide range of functionalities that covers almost every tabular data use case.

However, it’s only when someone transitions towards scale that they experience the profound limitations of Pandas. I have talked about this before in the blog below:

In a gist, almost all limitations of Pandas arise from its single-core computational framework.

In other words, even if your CPU has multiple cores available (or idle), Pandas always relies on a single core, which inhibits its performance.

Moreover, Pandas DataFrames are inherently bulky. Pandas never optimizes the datatypes of DataFrame’s columns. Thus, if you have to deal with a large dataset (with size in GBs):

  1. You should only load a chunk of data in memory at a time, process it, discard it and load the next chunk, or,
  2. If, for some reason, you need the whole dataset, then you should adjust the data types to fit it into memory.

Also, in many cases, data is often split across multiple CSV files. Thus, if you want to analyze the entire data together, one has to merge the data of multiple CSV files and then process it.

There are a few more ways which I have discussed in one of my previous blogs:

Unfortunately, you are out of luck if you need the whole dataset in memory, but despite optimizing the memory usage, you are not able to load it.

Even if you somehow succeed in loading the dataset, there’s no guarantee that you will be able to process it.

As intermediate computations also require memory, this may eventually lead to the following:

Kernel Died (Image by Author)

Back to square one.

I am sure many of you have been there. I know it is frustrating, especially when your objective is to conduct a quick analysis, such as plotting some bar graphs, estimating group statistics, etc.

Also, honestly speaking, unless you are building a data science pipeline intended to serve an end-user, why should a Data Scientist spend hours optimizing the memory? This, in my opinion, inhibits their productivity.

Solution

I hope by this point, you understand that the problem is real.

To this end, let’s discuss a potential solution — Gigasheet, which in my opinion, is a revolutionary alternative to Pandas, not just for large datasets but for small datasets too.

I covered a detailed overview of transitioning from Pandas to “No-code Pandas” using Gigasheet in the blog below.

The focus of this blog is, however, slightly different.

In this blog, I will briefly demonstrate how challenging it is to work with large datasets in Pandas and how Gigasheet makes it a piece of cake.

Additionally, the two biggest advantages that come with a tool like Gigasheet are:

  1. You don’t have to write any code.
  2. You don’t have to worry about run-time or memory optimization.

That’s amazing. Let’s begin 🚀!

Dataset

For demonstration purposes, I will use a dummy dataset created using Faker. As shown below, it takes up a little over 9 GBs of disk space.

Dummy Dataset Size (Image by Author)

The first five rows are shown below:

First five rows of the Dummy Dataset (Gif by Author)

Also, as shown below, the dataset has just over 30M rows and 25 columns.

Rows and Columns as shown in Gigasheet Dashboard (Image by Author)

Pandas vs. Gigasheet

Next, let’s perform some common operations in Pandas, measure their run-time, and compare that to Gigasheet.

As Gigasheet is a no-code tool, measuring the run-time is a bit challenging. Therefore, for comparison, I shall use the duration of the screen recording uploaded here to approximate the run-time of Gigasheet.

#1 Filtering

Filtering is a common operation performed during tabular data analysis. Let’s filter the DataFrames on the Country_Code column and select all rows where the value is MY.

Pandas

Gigasheet

The timer starts right after selecting “Apply.” (Gif by Author)

Filtering in Gigasheet took just over two seconds. Compared to Pandas, which took close to 25 seconds, Gigasheet is ~92% faster.

Pandas vs. Gigasheet Scorecard (Image by Author)

#2 Sorting

Next, let’s sort the DataFrame on the BBAN column, in ascending order.

Pandas

Here, we can use the sort_values() method.

Gigasheet

The steps are demonstrated below:

The timer starts right after selecting “Sort Sheet.” (Gif by Author)

Sorting in Gigasheet took a little over 5 seconds. On the other hand, Pandas took close to 2 minutes — a performance gain of 96%.

Pandas vs. Gigasheet Scorecard (Image by Author)

#3 Grouping

Next, let’s group the DataFrame on the Country column and find a couple of things:

  1. The number of records in each group.
  2. The number of unique Zip_Codes.

Pandas

Recalling what we discussed in the Introduction:

Even if you somehow succeed in loading the dataset, there’s no guarantee that you will be able to process it.

→ Gigasheet

Grouping in Gigasheet (Gif by Author)

Gigasheet computes the required aggregations without any hassle.

Pandas vs. Gigasheet Scorecard (Image by Author)

Yet again, Gigasheet emerges as the clear winner.

Conclusion

In this blog, I demonstrated how you can leverage Gigasheet and perform three typical operations on a large dataset.

While in Pandas, performing these operations did take a fair amount of time to execute. The same operations in Gigasheet, however, were highly efficient.

In fact, with Pandas, we never obtained the results for the grouping operation as it crashed the Jupyter Kernel. On the other hand, Gigasheet, doing justice to its name, handled everything with ease.

Before wrapping up, one point that I wish to add is that through this article, I am not claiming that Gigasheet is (or will be) the ultimate replacement for Pandas.

One can do many more things with Pandas or Excel, which aren’t possible with Gigasheet (yet). However, besides being no-code, its ability to easily handle large datasets makes it an absolute winner in such situations.

Also, the biggest issue with Excel is its max row limit. This inhibits working on large-scale data analytics projects, which Excel does not support.

In fact, even if you are working with moderately sized datasets, Gigasheet can come in handy there too. I wrote a detailed article recently on the transition from Pandas to Gigasheet, which you can read here:

As always, thanks for reading!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: