Original Source Here
Data extraction involves pulling data from different sources and converting it into a useful format for further processing or analysis. It is the first step of the Extract-Transform-Load pipeline (ETL) in the data engineering process.
Data is not always available as a CSV file. Sometimes you will need to combine data that is available in multiple file formats such as JSON, XML, and SQL.
In this tutorial, we will use python libraries such as pandas, json, and requests to read data from different sources and load them into a Jupyter notebook as a pandas dataframe.
1. CSV files
This refers to a ‘ comma-separated values’ file that is used to store data in a tabular format, similar to a spreadsheet. Each line in the file is an observation (or record), and each record has one or more features separated by commas.
Often, not always, the first line represents the feature names.
We will use the math results data from the data.gov website.
Before we create a pandas data frame from the data, it is important to first peek into the file using python’s inbuilt reading library. The following code displays the first 5 lines of the downloaded file.
What to look out for:
- Whether we will need extra arguments with
- If the first line contains the variable names. If not, we would use
- If there are any empty rows at the top we need to skip using
The code below downloads the file directly from the website and creates a Pandas dataframe.
file_link = 'https://data.cityofnewyork.us/api/views/7yig-nj52/rows.csv'
df_csv = pd.read_csv(file_link)
2. XML data
Extensible Markup Language, or XML, is a markup language very similar to HTML. You can recognize XML formatted data because it’s marked down by unique tags specific to the data they represent, unlike the pre-defined tags of HTML.
We will continue using the math results data from the data.gov website which provides an option to download the data in an XML format.
Running the code below in a Jupyter notebook will print the first 20 lines.
To read XML data, we use a python library called BeautifulSoup. To install BeautifulSoup, paste the code below into the Jupyter notebook.
!pip install bs4
Then import the library, open the file and load its contents into BeautifulSoup.
from bs4 import BeautifulSoupwith open("math results.xml") as f:
soup = BeautifulSoup(f, "lxml")
From the XML data, we will use the
find_all() method of BeautifulSoup that returns the tags as a list.
For our case, the tags of interest are
‘district’, ‘grade’, ‘number_tested’, and ‘mean_scale_score’.
We will then loop through the rows, and for each tag, retrieve the value inside it using the
The list of lists named ‘data’ represents all our records.
The next step is to create a pandas dataframe and pass the columns that will be used for the data.
df_xml = pd.DataFrame(data, columns=['district', 'grade','number_tested', 'mean_scale_score'])
3. JSON data
JSON was developed after XML but has become more popular over time. The two formats are frequently used for sending data via web APIs which are discussed in point 5 below.
Pandas library offers a simple way to read in a json file;
pd.read_json(). You can pass the orient option depending on how the data is formatted. The pandas’ documentation explains the orient options.
Our json data is formatted slightly different and
pd.read_json() returns an error. It has a lot of meta-information, almost 920 lines. The actual data starts at the 921st line.
The image below shows a screenshot of the data using a notepad++ code editor. Note that the first 8 fields per row are useless.
We will use the json library. This library reads in the json data as a dictionary.
import jsonwith open('math results.json') as f:
json_data = json.load(f)type(json_data)
In our case, there will be 2 dictionary entries; meta and data. We can therefore get the data information using
data = json_data.get('data')
data = [x[8:] for x in data]
The left image below shows our data as a list of lists. The right image uses list comprehension to return only items from the 8th element in each list.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot