Original Source Here
The Downsides of PandaSQL that No One Talks About
Pandas + SQL = PandaSQL = A big mess!
Both Structured Query Language (SQL) and Pandas are undoubtedly the go-to tools for Data Scientists for tabular data management, processing, and analysis.
While Pandas is a popular Python library for data analysis used by Data Scientists, SQL is an entire programming language of its own to interact with Databases spanning applicability across various domains of computer science. One thing that stands out in common between them is that both are incredible tools to handle Tabular Data.
PandaSQL, a popular python package, is a blend of both Pandas and SQL, allowing you to leverage the power of SQL syntax in a pythonic environment. As a result, PandaSQL empowers you to query pandas DataFrames using SQL syntax — making it a big win for someone who is unfamiliar or uncomfortable with Pandas syntax.
This post serves as an introduction to PandaSQL. I’ll demonstrate how to use PandaSQL with example queries. Towards the end, I’ll compare the run-time difference between Pandas and PandaSQL.
You can find the notebook for this article here.
Let’s begin 🚀!
Getting Started With PandaSQL
As mentioned above, PandaSQL is a python library that provides you with the flexibility to execute SQL Queries over a Pandas DataFrame. The step-by-step process to do this is demonstrated in detail below.
Step 1: Install PandaSQL
First and foremost, install the PandaSQL library using
pip. You can find the instructions here or use the command below to install it:
Step 2: Import Requirements
Next, we import the installed library above and other pre-installed libraries required in this project.
Step 3: Create a Dummy Dataset
To understand how to use SQL in a pythonic environment using PandaSQL, we shall create a dummy dataset of ten thousand employees. The image following the code below shows a glimpse of the dataset.
Step 4: Create the PandaSQL Environment
The next step is to set up the SQL environment, which is done as follows:
We create a lambda function that takes a query
q and the variables in the scope of the program using
globals() which returns a python dictionary.
Essentially, SQL queries are written as strings in Python. For PandaSQL to interpret the table name and its column names specified in the string, we retrieve and pass all the global variables using the
globals() method to the defined lambda function. The
data DataFrame defined in Step 3 can be referenced using
globals() as follows:
Step 5: Run SQL Queries on Pandas DataFrame
Once you have set up the SQL environment, you can proceed with executing SQL queries. The SQL syntax used with PandaSQL is exactly the same as the traditional SQL syntax.
For example, let’s write a query to count the number of records in the DataFrame. This is demonstrated below:
Next, if you want to query the dataframe based on a condition, you can do this using the
where clause in SQL as shown below:
Lastly, if you want to obtain the number of records corresponding to every organization in the given data, you can do so using the
Similar to what I demonstrated in the above three sample queries, you can write any SQL query of your choice and execute it over the Pandas DataFrame using the PandaSQL library.
The caveats of PandaSQL
PandaSQL is undoubtedly an excellent alternative for someone from a SQL background unfamiliar with the pythonic syntax that Pandas demands. For instance, filtering dataframe or grouping them on a particular column while aggregating multiple columns might feel a little overwhelming or confusing. But these things are easily implementable and interpretable in SQL.
However, this ease of leveraging SQL in Pandas comes with a tremendous cost of the excess run-time. To understand this better, let’s compare the run-time of PandaSQL and Pandas. We shall write a query in SQL and then compare the run-time with its corresponding implementation in Pandas. If you are unfamiliar with the translation of fundamental SQL queries to Pandas operations, I would recommend reading my post below:
Python already has numerous names that are reserved as keywords, such as
as, etc. The introduction of SQL introduces further restrictions around variable naming as more keywords such as
having etc., can no longer be used as variable names. I have implemented such a scenario below:
Execution Run-time Caveat
While the syntactical caveat can be handled for once by altering the variable names, the overhead introduced in terms of run-time is profoundly concerning. Consider the queries below to understand better.
First, let’s compare the run-time of computing the number of records in the Dataframe. This is demonstrated below using both PandaSQL and Pandas:
To calculate the number of rows, PandaSQL takes close to 10⁷ times more run-time than Pandas
Next, let’s compare the run-time to find the number of records corresponding to each organization in the given DataFrame. This is implemented below:
Once again, Pandas outperforms PandaSQL by a considerable margin. In this case, Pandas is over 70 times faster than PandaSQL.
Now, I will demonstrate the difference in run-time for filtering the dataframe based on a condition using PandaSQL and Pandas. The SQL query and its corresponding Pandas method are implemented below:
PandaSQL takes over 160 times more time for the filtering operation compared to the Pandas-based method.
Lastly, let’s also consider the run-time differences for joins. We shall implement a self-join on the given dataframe. This is demonstrated below:
Yet again, Pandas is the clear winner, outperforming PandaSQL by 27 times.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot