CSV Files for Storage? Absolutely Not. Use Apache Avro Instead



Original Source Here

How to work with Avro in Python?

There are two installable libraries for working with Avro files:

The latter states the former library is dog slow, as it takes about 14 seconds to process 10K records. You’ll stick with fastavro for that reason.

Here’s how to set up a new virtual environment and install necessary libraries (for Anaconda users):

conda create --name avro_env python=3.8
conda activate avro_env

conda install -c conda-forge pandas fastavro jupyter jupyterlab

Execute the following command to start JupyterLab session:

jupyter lab

You’ll use the NYSE stock prices dataset for the hands-on part. The dataset comes in CSV format — around 50 MB in size. Use the following snippet to import the required libraries and load the dataset:

import pandas as pd
from fastavro import writer, reader, parse_schema

df = pd.read_csv('prices.csv')
df.head()

Here’s how the stock prices dataset looks like:

Image 3 — NYSE stock prices dataset (image by author)

Converting a Pandas DataFrame to Avro file is a three-step process:

  1. Define the schema — You’ll have to define a JSON-like schema to specify what fields are expected, alongside their respective data types. Write it as a Python dictionary and parse it using fastavro.parse_schema().
  2. Convert the DataFrame to a list of records — Use to_dict('records') function from Pandas to convert a DataFrame to a list of dictionary objects.
  3. Write to Avro file — Use fastavro.writer() to save the Avro file.

Here’s how all three steps look like in code:

# 1. Define the schema
schema = {
'doc': 'NYSE prices',
'name': 'NYSE',
'namespace': 'stocks',
'type': 'record',
'fields': [
{'name': 'date', 'type': {
'type': 'string', 'logicalType': 'time-millis'
}},
{'name': 'symbol', 'type': 'string'},
{'name': 'open', 'type': 'float'},
{'name': 'close', 'type': 'float'},
{'name': 'low', 'type': 'float'},
{'name': 'high', 'type': 'float'},
{'name': 'volume', 'type': 'float'}
]
}
parsed_schema = parse_schema(schema)

# 2. Convert pd.DataFrame to records - list of dictionaries
records = df.to_dict('records')

# 3. Write to Avro file
with open('prices.avro', 'wb') as out:
writer(out, parsed_schema, records)

It’s not as straightforward as calling a single function, but it isn’t that difficult either. It could get tedious if your dataset has hundreds of columns, but that’s the price you pay for efficiency.

There’s also room for automating name and type generation. Get creative. I’m sure you can handle it.

Going from Avro to Pandas DataFrame is also a three-step process:

  1. Create a list to store the records — This list will store dictionary objects you can later convert to Pandas DataFrame.
  2. Read and parse the Avro file — Use fastavro.reader() to read the file and then iterate over the records.
  3. Convert to Pandas DataFrame — Call pd.DataFrame() and pass in a list of parsed records.

Here’s the code:

# 1. List to store the records
avro_records = []

# 2. Read the Avro file
with open('prices.avro', 'rb') as fo:
avro_reader = reader(fo)
for record in avro_reader:
avro_records.append(record)

# 3. Convert to pd.DataFrame
df_avro = pd.DataFrame(avro_records)

# Print the first couple of rows
df_avro.head()

And here’s how the first couple of rows look like:

Image 4 — NYSE stock prices dataset read from Avro file (image by author)

Both CSV and Avro versions of the dataset are identical — but which one should you use? Let’s answer that next.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: