Knowing Pandas But Not SQL? Understand Common SQL Usages From Pandas Operations

https://miro.medium.com/max/1400/0*7p1PMNxln0Gy2Q9i

Original Source Here

1. Select Data Based on Specific Criteria

In pandas, we select the desired rows that meet certain criteria using the following approach. Suppose that we want to select the cars, whose origin is USA and have a weight more than 4900 pounds.

Pandas Select Data

The corresponding SQL statement is below.

select * 
from mpgtbl
where origin = 'usa' and weight > 4900
  • select *: select is a keyword, and it means select… The asterisk sign means all the columns. If you want to select some columns, you can simply list them out: select mpg, horsepower.
  • from the_table: specify the table from where you select data
  • where: specify the selection criteria. Please note that the comparison operator is simply =, but not ==. The AND logical operator is and.
  • Strings: it’s usually a good idea to use single quotes for strings, although some databases do support double quotes.

2. Select Top 3 Rows

In Pandas, we can use the head function to select the top N rows. The corresponding SQL statement uses the limit keyword, which specify the top N rows that match criteria if defined.

# Pandas
df.head(3)
# SQL
select *
from mpgtbl
limit 3

What about you want to select rows 4–6? Here’s a possible solution for each.

# Pandas
df.iloc[3:6, :]
# SQL
select *
from mpgtbl
limit 3
offset 3;

In Pandas, we can use the iloc property for data selection. Please note that iloc uses index and index starts with 0, so to get rows 4–6, we need to specify 3:6, which ends before 6.

In SQL, we can simply supply an additional restriction by requesting an offset of 3, such that we can get rows 4–6.

3. Select The Unique Values

Pandas has the built-in unique function when it comes to select distinct values for a particular column.

# Pandas
df['origin'].unique()
# SQL
select distinct origin
from mpgtbl

In SQL, we will simply place the distinct keyword before the column or the column list.

4. Sort Rows

We use sort_values to sort the rows in pandas. In SQL, we use the order keyword instead. By default, in both, values are sorted in the ascending order. When you need to sort data in the descending order, we have different options.

# Pandas
df.sort_values(['mpg', 'weight'], ascending=[False, True])
# SQL
select *
from mpgtbl
order by mpg desc, weight

In SQL, when you need to reverse the order for the column, you need to use desc following the column name.

5. Membership Checking

This operation is similar between pandas and SQL, and both involve the use of the in-related keyword in a certain form.

# Pandas
df[df['origin'].isin(['japan', 'europe'])]
df[~df['origin'].isin(['usa'])]
# SQL
select *
from mpgtbl
where origin in ('japan', 'europe');
select *
from mpgtbl
where origin not in ('usa')
  • Pandas uses the isin method to determine whether the row’s value is contained in the list of items. To negate the boolean value, simply use the ~ sign preceding the expression.
  • SQL uses the in or not in for membership checking. Instead of using square brackets, SQL uses the pair of parentheses to list all the items.

6. Frequency Table by Group

We often need to know the frequency of the items grouped by one or multiple columns. Here’s how you do.

# Pandas
df.groupby('origin').size()
df.groupby(['origin', 'model_year']).size()
# SQL
select origin, count(*)
from mpgtbl
group by origin;

select
origin, model_year, count(*)
from
mpgtbl
group by origin, model_year
  • The groupby function creates a GroupBy object, with which the size method will count the number of rows for each group.
  • In SQL, the count method will count the number of records. When the data are grouped, it will count each group separately. The key syntax here is group by col1, col2, col3….
  • In both, you can specify multiple columns.

7. Create Table for the Frequency

It’s a follow-up operation of the previous one. Suppose that we want to create a data table for long-term storage. In pandas, we create a new DataFrame. To introduce a new feature, let’s suppose that we want to give the frequency count column a name car_count.

# Pandas>>> df_summary = df.groupby('origin', as_index=False).size().rename(columns={"size": "car_count"})
>>> df_summary
origin car_count
0 europe 70
1 japan 79
2 usa 249

As shown above, this operation uses the setting of the parameter as_index as False in the groupby function.

Let’s see how we do it in SQL.

# SQLcreate table car_origin as 
select origin, count(*) as car_count
from mpgtbl
group by origin;
  • The syntax for creating a new table is create table tbl_name as the_select_stmt. In essence, the selected data from the statement will be saved as the data in the newly created table.
  • In the previous section, we use the count method already, but here, what makes this operation different is the giving of a name to the count field, and the syntax is count(*) as the_name.

8. Overall Aggregation Operations

Besides the frequency, there are several aggregation operations that we perform with our datasets, such as max, min, and mean.

# Pandas
df.agg({"mpg": ["mean", "min", "max"]})

If you want to perform these aggregation functions on other columns, you can add additional key-value pairs to the dictionary parameter. The corresponding SQL operation is shown below.

# SQL
select avg(mpg), min(mpg), max(mpg) from mpgtbl

If you want to give names to these calculated fields, you can still use the calculated_field as the_name format.

9. Aggregation Operations By Group

Extending upon the previous operation, we often need to create aggregations by group. This is achieved by manipulating the GroupBy object using the groupby function in pandas, as shown below.

# Pandas
df.groupby('origin').agg(
mean_mpg=("mpg", "mean"),
min_hp=("horsepower", "min")
)

The above operation uses the named aggregates, which doesn’t only specify the column but also the function you want to apply. Certainly, as mentioned at the beginning, there are alternative ways to create aggregates. You can find more operations with the GroupBy object in my previous article.

When you need aggregations in SQL, its syntax looks a little bit cleaner to me.

# SQLselect avg(mpg) as mean_mpg, min(horsepower) as min_hp 
from mpgtbl
group by origin

By far, everything should be straightforward to you.

10. Join Tables

When we have multiple DataFrame and want to merge them, we can use the merge function in pandas, as shown below. Please note that the df_summary was created in #7, which has two columns: origin and car_count.

df.merge(df_summary, on="origin")

In SQL, we can have pretty much similar logic for this join.

select *
from mpgtbl
join car_origin using (origin)

The new keywords here are join and using. join connects two tables that are to be joined, while using is used to specify the column. Please note that this syntax is relevant when the same columns are used for joining.

In a more general situation, the tables may have different columns for joining. Here’s what you can do in both approaches.

# Pandas
df.merge(another_df, left_on="colx", right_on="coly")
# SQL
select *
from tbl1 x
join tbl2 y on x.col1 = y.col2

Instead of using, you now use on to specify the columns that are used for joining in SQL. Another thing to note is that you can create a shortened reference to the table for easier referencing. In the example, we call the two tables x and y, respectively.

By default, the join is the inner kind. There are other kinds of joins available in SQL, too. Here’s a useful reference you can start with.

11. Update Record

When you need to update the record for a specific column, you can do the following thing in Pandas. Suppose that the correct weight for “ford torino” should be 3459 instead of 3449.

>>> df.loc[df["name"] == "ford torino", "weight"]
4 3449
Name: weight, dtype: int64
>>> df.loc[df["name"] == "ford torino", "weight"] = 3459
>>> df.loc[df["name"] == "ford torino", "weight"]
4 3459
Name: weight, dtype: int64

We use the loc property to access the specific cell and update the record successfully. SQL has a similar logic, and a possible solution is shown below.

update mpgtbl
set weight = 3459
where name = 'ford torino'

The new keywords here are update and set. The general syntax is below:

update tbl_name
set col1 = val1, col2 = val2, ...
where the_condition

12. Insert Record(s)

When you need to add a record to the existing DataFrame, there are multiple ways to do it. Here’s showing you a possible solution. For simplicity, let’s just suppose that you want to add two more records, and these records are simply the last two rows in the existing DataFrame object.

new_records = df.tail(2)
pd.concat([df, new_records]).reset_index(drop=True)

When you insert new records, you can make them another DataFrame object, then you can use the concat method to concatenate these DataFrame objects. Here’s how you can do the same operation in SQL.

insert into mpgtbl
select *
from mpgtbl order by "index" desc limit 2
  • We use the insert into syntax to add new records to an existing table.
  • The select portion is simply a way to extract the last two records in the table. Please note that the table includes a column named index, which happens to be a SQL keyword, so we can use double quotes to make it qualify as a column name. As a side note, it’s my preference that we use non-keywords for any names, just to avoid any hassles like that.

More often, we want to add individual data records instead of records that are extracted from the existing table. In SQL, a more common insertion operation is like below.

insert into tbl_name (col_list)
values (value_list1), (value_list2), ...

Specifically, we specify the values for each record following the values keyword. Please note that if you will add values for all the columns, you can omit the (col_list) part.

13. Delete Records

The dataset may contain the records that we don’t need. In most cases, we drop out the records that meet specific criteria. The drop method is perfect for this in Pandas. A possible solution is shown below.

df.drop(df[df['name'] == 'ford torino'].index)

The SQL solution involves the use of the delete keyword. Please be extremely cautious about the specification of the condition, because if you leave the condition, everything will be deleted from the table.

delete from mpgtbl
where name = 'ford torino'

14. Add Columns

Sometimes we need to add new columns to the existing dataset. In pandas, we can do something like below. Please note that it’s just a trivial example, and we know that the dataset only contains years in the 20th century, so we simply add 1900 to make the years 4-digit numbers.

df['full_year'] = df['model_year'] + 1900

To do the same operation in SQL, we need to introduce a new keyword — alter, which involves the update of the table. Let’s see it.

alter table mpgtbl
add full_year as (model_year + 1900)

We just need to specify the new column following the add keyword and include the calculation. If you want to make the computed column physically stored in the database, you can add the persisted keyword following the calculation.

15. Remove Columns

As you may know, in Pandas, the drop method can’t only work on rows, but it also work on columns. The following code snippet shows you how to drop columns in Pandas.

df.drop(columns=["mpg", "horsepower"])

Similar to adding a column, removing a column is a kind of alteration of the existing table. Thus, as you may guess, this operation uses alter.

alter table mpgtbl drop column name;

16. Concatenate Datasets

In a previous section when we learned about inserting records, we used the concat method in Pandas. As you know, this method is more often used to concatenate datasets that share the same columns.

pd.concat([df, df])

In SQL, this operation involves the union keyword, as shown below.

select * from mpgtbl
union all
select * from mpgtbl

Please note that you can optionally leave out all, in which case, only distinct records from the latter table will be kept .

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: