What is Data Extraction? Python Review of JSON, XML, APIs, SQL, and CSV Formats



Original Source Here

Data extraction involves pulling data from different sources and converting it into a useful format for further processing or analysis. It is the first step of the Extract-Transform-Load pipeline (ETL) in the data engineering process.

Data is not always available as a CSV file. Sometimes you will need to combine data that is available in multiple file formats such as JSON, XML, and SQL.

In this tutorial, we will use python libraries such as pandas, json, and requests to read data from different sources and load them into a Jupyter notebook as a pandas dataframe.

1. CSV files

This refers to a ‘ comma-separated values’ file that is used to store data in a tabular format, similar to a spreadsheet. Each line in the file is an observation (or record), and each record has one or more features separated by commas.

Image of a csv file in a text editor

Often, not always, the first line represents the feature names.

We will use the math results data from the data.gov website.

Before we create a pandas data frame from the data, it is important to first peek into the file using python’s inbuilt reading library. The following code displays the first 5 lines of the downloaded file.

What to look out for:

  • Whether we will need extra arguments with pd.read_csv()
  • If the first line contains the variable names. If not, we would use pd.read_csv(file, header=None).
  • If there are any empty rows at the top we need to skip using pd.read_csv(file, skip_rows=n).

The code below downloads the file directly from the website and creates a Pandas dataframe.

file_link = 'https://data.cityofnewyork.us/api/views/7yig-nj52/rows.csv'
df_csv = pd.read_csv(file_link)

2. XML data

Extensible Markup Language, or XML, is a markup language very similar to HTML. You can recognize XML formatted data because it’s marked down by unique tags specific to the data they represent, unlike the pre-defined tags of HTML.

We will continue using the math results data from the data.gov website which provides an option to download the data in an XML format.

Running the code below in a Jupyter notebook will print the first 20 lines.

Image of XML file on a code editor

To read XML data, we use a python library called BeautifulSoup. To install BeautifulSoup, paste the code below into the Jupyter notebook.

!pip install bs4

Then import the library, open the file and load its contents into BeautifulSoup.

from bs4 import BeautifulSoupwith open("math results.xml") as f:
soup = BeautifulSoup(f, "lxml")

From the XML data, we will use the find_all() method of BeautifulSoup that returns the tags as a list.

Image of find_all by author

For our case, the tags of interest are ‘district’, ‘grade’, ‘number_tested’, and ‘mean_scale_score’.

We will then loop through the rows, and for each tag, retrieve the value inside it using the get_text method.

The list of lists named ‘data’ represents all our records.

Image of the first 5 records by author

The next step is to create a pandas dataframe and pass the columns that will be used for the data.

df_xml = pd.DataFrame(data, columns=['district', 'grade','number_tested', 'mean_scale_score'])

3. JSON data

JavaScript Object Notation, or JSON, compacts data in a dictionary-style format. The json file for the math results is available at this link. Below is a screenshot of the file using a code editor.

A screenshot of a json file by author

JSON was developed after XML but has become more popular over time. The two formats are frequently used for sending data via web APIs which are discussed in point 5 below.

Pandas library offers a simple way to read in a json file; pd.read_json(). You can pass the orient option depending on how the data is formatted. The pandas’ documentation explains the orient options.

Our json data is formatted slightly different and pd.read_json() returns an error. It has a lot of meta-information, almost 920 lines. The actual data starts at the 921st line.

Image of collapsed meta and data tags of json file by author

The image below shows a screenshot of the data using a notepad++ code editor. Note that the first 8 fields per row are useless.

We will use the json library. This library reads in the json data as a dictionary.

import jsonwith open('math results.json') as f:
json_data = json.load(f)
type(json_data)
###dict

In our case, there will be 2 dictionary entries; meta and data. We can therefore get the data information using dict.get(key).

data = json_data.get('data')
data = [x[8:] for x in data]

The left image below shows our data as a list of lists. The right image uses list comprehension to return only items from the 8th element in each list.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: