5 Must-Know Pandas Functions for Data Science


Original Source Here

The dataset for this article that I will refer to is Kaggle’s house price prediction data. You can download it from here.

Let’s have a look at our data first.

import pandas as pd
df = pd.read_csv("House data.csv")
Data Screenshot

This is how our data looks like. As this is house price prediction data — we have bedrooms, bathrooms, floors and other factors that can help us decide the price of the house with any specifications.

Let’s now apply some pandas functions to this data.

1. Count() Function

Let’s say you want to quickly check if there are any null values in your table. In that case, the count function provides us with the count of cells that have value in them.

Count Function

Great news, we have no null values in our dataset. So, let’s assign a null value and see the changes.

df.at[0,'price']= np.nan
After Assigning null value

Now, if I will check the count — I will get the below result.

Count with null

2. idxmin() and idxmax() functions

These functions return the index of the particular row where the desired condition is met.

Let’s say to want to get the details of the house where the price is minimum. There can be many ways by applying the data subsetting method. But, the most efficient way is to use these functions.


By running the above code — I can get the details of the house that it is having the minimum price, as shown below.

So, we are getting a house with three bedrooms in the Federal Way city at zero price. 😁

I know this is the data error as we are playing with open source dummy data. But, I think you got the things. 🙂 The same way we can use the idxmax() to get the maximum price house.

What if? You have more than one house with the minimum or maximum price. In that case, these functions will return the first occurrence. In the function article, we will see how we can tackle this case.😉

3. cut() Function

Let’s say you have a variable with continuous values. But, as per your business understanding, this variable should be treated as a categorical variable.

The cut() function can help you bucket your continuous variable by sorting them and then making data range buckets out of them.

In this data, I want to make a bucket of price data as price value ranges from 0 to 26590000. If I can bucket it, then decision making can be a bit easier.

pd.cut(df["price"], 4)
Bucketing Data

You can also assign labels to each bucket as shown below.

Looks good! Right? We can either replace the price column with this or can create a new fresh column.

4. pivot_table()

Every excel person must have used this function in their data. We can do the same with pandas.

Let’s say we want to find the average price of the house in each city based on the different bedrooms.

df.pivot_table(index="city" , columns="bedrooms" ,values="price" , aggfunc="mean")

Here you can find null values, as it is not necessary — every city has two bedrooms. It depends on the data.

5. nsmallest() and nlargest() functions

We have seen how we can use the idxmin(), and idxmax() functions to get the minimum and maximum observations.

What if? You want to get the top 3 maximum price house data. In that case, these functions can save our time.

df.nlargest(3, "price")[["city","price"]]
df.nsmallest(3, "price")[["city","price"]]

Here we go! We now have three cities with the house that have zero price. 🙂


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: