Stop Using CSVs for Storage — This File Format is Faster and Lighter



Original Source Here

ORC — What is it?

ORC stands for Optimized Row Columnar. It’s a data format optimized for reads and writes in Hive — a data query and analysis tool for big data environments.

If you have any experiences with Hive, you know it’s as slow as they come. Even the simplest queries take forever, no matter the dataset size. Folks at Hortonworks decided to speed up Hive back in 2013, which resulted in developing the ORC file format.

ORC files are made of stripes — and those contain index data, row data, and a footer. The following diagram shows you approximately how an ORC file looks beneath the surface:

Image 1 — ORC file format structure (image by author)

The index data for each stripe include min and max values for every column and their row index position. In addition, the index position provides offsets, so ORC can enable searching in the right block. In other words, ORC comes with a row-skipping functionality which makes reads faster than the alternatives.

The File footer contains a list of stripes in the ORC file and metadata about each stripe, like a number of rows, data types, and summary statistics.

In Python, you can read ORC files with Pandas using the read_orc function. Unfortunately, there’s no alternative function for writing ORC files. You’ll have to use the pyarrow library to do so. Here’s how to install it:

# Pip
pip install pyarrow

# Anaconda
conda install -c conda-forge pyarrow

There are other libraries for working with ORC, but PyArrow works on every major OS — including the M1 Macs.

You now have everything needed to get started. Open up JupyterLab or any other data science IDE, as the next section covers the basics of ORC.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: