5 Common Excel Tasks Simplified with Python

https://miro.medium.com/max/1200/0*KrMU6jKdQ6_fOYXP

Original Source Here

5 Common Excel Tasks Simplified with Python

Using Pandas, OS, and the datetime module to simplify Excel tasks with Python.

Photo by Mélody P on Unsplash

One of my first jobs involved doing simple Excel tasks that at scale got tedious and time-consuming. I could do all that stuff manually and become an expert in doing mundane tasks, but I decided to learn Python to simplify them instead.

Excel is a good tool for working with data but has some limitations when it comes to working with hundreds of files that need to be modified. Here’s where Python can help.

In this article, we will simplify 5 Excel tasks that I (and probably you) often had to do at work.

First things first

Install The Libraries

To simplify these Excel tasks with Python I will use Pandas, OS, and datetime. The last two libraries come by default with Python, while Pandas need to be installed. To do so, write the following command in your terminal.

pip install pandas

In this article, I will assume you know how to import and export an Excel file using Pandas. But in case you don’t know how to it, here’s how:

import pandas as pd# Read an Excel file and save it to a dataframe
df = pd.read_excel(excel_file_path)
... # (all the operations we'll do in this article)# Export dataframe to a new Excel file
df.to_excel(file_path, index=False)

The Data

For this guide, I’m using 2 Excel files that are located in the same directory my Python script is. The spreadsheets contain dummy data that looks like this.

The data isn’t so relevant. The main idea is that all the data is divided into multiple chunks (Book1, Book2, Book3, …) and our task is to perform operations to all of them.

You can create your own dummy data, use your own Excel files or download these two files available on my Github. Put those files in the same directory your Python script is located.

1) Concatenate Multiple Excel Files with Python

Our first task is to combine all these chunks in one Excel file (say we want to share all the data with our boss).

If there are 2 Excel files, doing this will be as easy as copying the data in one file and pasting it into the other file; however, doing this with hundreds of files becomes impractical.

A simple solution to this will be using Pandas pd.concat().

import os
import pandas as pd
my_list = []
for file in os.listdir():
if file.endswith('.xlsx'):
df = pd.read_excel(file)
my_list.append(df)

df_concat = pd.concat(my_list, axis=0)
df_concat.to_excel('concatenate.xlsx', index=False)

First, we list all the files in the current directory with os.listdir(). Then we filter files out with the .endswith(‘.xlsx’) method.

All the Excel files will be stored into a list called my_list and then combined with pd.concat() that return a dataframe we name df_concat. This dataframe is exported to an Excel file.

Note that you can use read_csv() and to_csv(), in case you’re working with comma-separated values (CSV files)

2) Change Name or Extension of Multiple CSV Files with Python

CSV is a common file extension that is used for storing and reading data. It can be opened with Excel and is widely used at work.

I often had the task to rename CSV files so anyone could easily identify them (for example, adding the date at the beginning of the name) and also had to convert .csv files to .txt files and vice-versa,

Both are simple tasks, but they become a headache when working with hundreds of files. Fortunately, I managed to do these tasks in a couple of seconds using the .split() and os.rename() methods as shown in the code below.

import os
import datetime
# 1.Getting the current date in YYYYMMDD format
today = datetime.date.today()
today_str = today.strftime('%Y%m%d')
# 2.Adding date and changing extension to CSV files within directory
for file in os.listdir():
if file.endswith('.csv'):
file_name, file_extension = file.split('.')
new_file_name = f'{today_str}_{file_name}'
new_extension = '.txt'
new_name = f'{new_file_name}{new_extension}'
os.rename(file, new_name)

First, we have to use the datetime module to obtain the current date and format it as we want (in this case year, month, day).

Then we filter out files other than CSV and use the .split('.') to split the file name every time it reaches a dot sign (.), so we separate the extension from the name of the file. We concatenate names with the f-string (f’’) and rename the file with os.rename(). This takes the current file as a first argument and the new name as second argument

3) Using Excel’s String Functions LEFT-RIGHT in Python

Imagine you have to add the country code to each phone number of the “Number” column in the spreadsheet we’ve seen before.

We could execute Excel’s LEFT function to each file, but that could take a lot of time. Instead, we can use a for loop, read each Excel file into a Pandas dataframe and then use lambda and f-string to add the country code to each phone number.

import os
import pandas as pd
for file in os.listdir():
if file.endswith('.xlsx'):
file_name = file.split('.')[0]
df = pd.read_excel(file)
df = df.astype({'Number': str})
df['Number'] = df['Number'].apply(lambda x: f'44{x}')
df.to_excel(f'country_code_{file_name}.xlsx', index=False)

If we use theapply andlambda functions together, we can modify the value inside a column. In this case, I added the “44” code at the beginning of each value within the column “Number.” Note that we have to make sure the values in the column “Number” are string (str), so we used the .astype() method to change data types.

Finally, we exported the dataframe to a new Excel file that contains the word “country_code” in the name.

4) Remove Duplicates Within a Column and Drop NaN Values

Some basic data cleaning performed in Excel like dropping duplicates and NaN values can be easily done with Pandas.

You can easily get rid of duplicate rows based on all columns using .drop_duplicates().

df = df.drop_duplicates()

But in case you want to remove duplicates on a specific column add the subset argument. Let’s say we want to drop duplicated within the “Number” column.

df = df.drop_duplicates(subset=['Number'])

On the other hand, dropping NaN values can be achieved with .dropna()

df = df.dropna(subset=[‘column_name’])

You can also replace null values instead of dropping the whole row. Let’s say we want to replace null values with zero.

df = df.fillna(0)

In addition to that, it’s possible to fill those null values with values above (forward filling) or below (backward filling) a row.

# Forward filling
df.fillna(method='ffill')
# Backwards filling
df.fillna(method='bfill')

5) Sort Values

Sorting a Pandas dataframe is as easy as sorting an Excel spreadsheet. We only have to use the .sort_values() and specify either ascending or descending (ascending=True, ascending=False) and also the column(s) to sort by.

df = df.sort_values(ascending=True, by=['Year'])

In case we want to save the changes without having to set the value equal to df as shown above, we can use the inplace argument

df.sort_values(ascending=True, by=['Number'], inplace=True)

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: