What is Data Extraction?


Original Source Here

What is Data Extraction?

A Python Guide to Real-World Datasets

image from unsplash

Data extraction consists of the extraction and conversion of data from various sources to a useable format for subsequent processing or analysis. It is the initial phase in the data engineering process of the extract-transform-load pipeline. You may need to merge data in several file formats such as JSON, XML, CSV and SQL as a data scientist. In this, we utilise python libraries like pandas, json, and requests for data to be read from various sources and loaded as data pandas to a Jupyter notebook.

CSV files

This refers to a ‘comma-separated values file, comparable to a table, which is used to store data in a table format. Each line in the file represents an observation, and each record includes one or more commas. It is vital to first look inside the file using the integrated python read library before we generate a Pandas data frame from the data. The first five lines of the downloaded file are shown in the following code. The following code downloads the file from the internet and builds a data frame for Pandas.

file_link = 'https://data.cityofnewyork.us/api/views/7yig-nj52/rows.csv'
df_csv = pd.read_csv(file_link)

XML data

Extensible Markup Language is an HTML-like markup language. The XML structured data may be recognised because, unlike the default HTML elements, it has unique tags specific to the data they represent. The math data on the data.gov website will continue to be used, which allows you to download the data in XML format. We utilise a BeautifulSoup package for reading XML data. We utilise it. Paste the code below on the Jupyter notebook to instal BeautifulSoup.

!pip install bs4

The library should then be imported and the file opened, loading in BeautifulSoup.

from bs4 import BeautifulSoupwith open("math results.xml") as f:
soup = BeautifulSoup(f, "lxml")

The BeautifulSoup find_all() function that returns tags as a list is used from the XML data. For us it is ‘district,’ ‘grade,’ ‘number tested’ and ‘mean scale score’ that are the tags of interest. We loop through the rows and, using the get_text function for each tag, recover the value within. The following step consists of creating the pandas and passing the columns required for the data.

df_xml = pd.DataFrame(data, columns=['district', 'grade','number_tested', 'mean_scale_score'])

JSON data

JavaScript Object Notation, or JSON, compacts data in the format of a dictionary. This site provides the json file for the math results. The next picture shows a code editor snapshot of the file. JSON has been created following XML but over time it is becoming popular. Both formats are often used for data transmission using Web APIs explained at point 5 below. The Pandas library gives a straightforward approach to read in a file of json; Depending on how data is structured, you can pass the orientation option. The documentation on pandas outlines the possibilities for orientation.

We style our json data somewhat different and the error is returned by pd.read json(). It has plenty of metadata, almost 920 lines. The current data begins on line 921.

import jsonwith open('math results.json') as f:
json_data = json.load(f)type(json_data)

In our example, 2 dictionary items will be available: meta and data. Therefore we may use dict.get(key) for the data information.

data = json_data.get('data')
data = [x[8:] for x in data]

Our data are shown as a list of lists in the left picture below. The right image utilises the list understanding just to return elements in each list from the 8th element. Next, construct a data frame for the pandas. We will pass the names of the columns based on our data set information.

SQL databases

Structured query language, or SQL, permits access to and use of databases. The panda’s package allows you to extract data from SQL files and databases. You can open a database, or do a SQL query. Depending on the kind of database, two pythons libraries may be used for the connection; sqlite3 or the library of sqlalchemy. We have an SQLite database file for our needs, which has the math results we work with. For single apps and devices, SQLite enables local data storage. Therefore, we will utilise the library of sqlite3. Only one table is in the database file.

import sqlite3
import pandas as pd
conn = sqlite3.connect('maths_results.db')
data_sql = pd.read_sql('SELECT * FROM maths_results', conn)

Use the sqlalchemy library if you work with big SQL client/server databases like MySQL and PostgreSQL. Seek how to load different databases including SQLite that we are using the code below. See SQLAlchemy manual.

import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///maths_results.db')
df_sql = pd.read_sql("SELECT * FROM maths_results", engine)

Extracting data from the web using APIs

Application programming interfaces or APIs give a means of sharing their data with consumers by enterprises such as Facebook, Twitter, Google, etc. It prevents online scraping that presents a lawful region with grey areas. They give web URLs for publicly accessible APIs that enable developers to access, analyse or construct their own apps using information. Some demand login information while others are public when submitting queries. Documents regarding how to format URL applications are also provided. A collection of downloadable indicators with the API. We shall look at “the number of female entrepreneurs” showing the results by nation every year.

Click on the ‘API’ access option and scroll down to the indicator data you would need. It redirects you to a new website containing raw json information. To launch a new Jupyter notebook, copy the URL from the address bar. Write the following code, change the copied URL content. Python offers a library of requests to transmit python to HTTP requests. The following code returns an answer object.

import requests
import pandas as pdurl = 'https://api.worldbank.org/v2/en/country/all/indicator/IC.WEF.LLCO.FE?format=json&per_page=20000&source=14'r = requests.get(url)

We will call the response as the data is in json format. json() to provide back info from json.


This produces a list of two portions of the dictionary. The first portion includes the meta-info. The second section contains the real information. Therefore, list[1] which indexes the second member of the list is used to generate the data framework.

data_api = pd.DataFrame(r.json()[1])data_api.head()


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: