Use the Datasets library of Hugging Face in your next NLP project



Original Source Here

Use the Datasets library of Hugging Face in your next NLP project

A quick guide to use Hugging Face’s datasets library!

Photo by Marco on Unsplash

Data science is all about data. There are various sources on the web to get data for your Data Analysis or Machine Learning project. One of the most popular sources is Kaggle, which I am sure each one of us must have used in our data journey.

Recently, I came across a new source to get data for my NLP projects and I would love to talk about it. This is Hugging Face’s dataset library, a fast and efficient library to easily share and load dataset and evaluation metrics. So, if you are working in Natural Language Understanding (NLP) and want data for your next project, look no beyond Hugging Face. 😍

Motivation: The dataset format provided by Hugging Face is different than our pandas data frame, so initially using the Hugging Face dataset might look daunting.😱 Hugging Face has great documentation but that’s a lot of information. I have just written a few of the basic steps we do while working with our dataset. 😄 The article is by no means exhaustive and I highly encourage you to look at their documentation if you want to do more with your dataset.

Let’s first know a bit about Hugging Face and the datasets library and then take an example to know how to use a dataset from this library. 😎

Hugging Face 🤗 is an open-source provider of natural language processing (NLP) technologies. You can use hugging face state-of-the-art models (under the Transformers library) to build and train your own models. You can use the hugging face datasets library to share and load datasets. You can even use this library for evaluation metrics.

Datasets Library

As per the Hugging Face website, the Datasets library currently has over 100 public datasets. 😳 The datasets are not only in English but in other languages and dialects too. 👌 It supports one-liner data loaders for a majority of these datasets which makes loading of data a hassle-free task. 🏄🏻 As per the information given on the website, besides easy access to the dataset, the library has the following interesting features:

  • Thrive on large datasets: Datasets naturally frees the user from RAM limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
  • Smart caching: never wait for your data to process several times.
  • Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
  • Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2, and JAX.

Wow! That’s quite a lot of benefits. 👏

In this article, I will show some of the steps we normally do in our data science or analysis tasks to understand our data or transform our data in a required format. So, let’s quickly dive into this library and write some easy Python code. 🐍 Please note that this article only covers datasets and not metrics.

Datasets version: 1.7.0

Installation using pip

!pip install datasets

Import

from datasets import list_datasets, load_dataset
from pprint import pprint

From the datasets library, we can import list_datasets to see the list of datasets available in this library. pprint module provides a capability to “pretty-print”. You can learn more about this module here. 👈🏼

The dataset library has 928 datasets as of June 7, 2021. 🙌 We can see the list of datasets available using the following code:

datasets = list_datasets()
print("Number of datasets in the Datasets library: ", len(datasets), "\n\n")
#list of datasets in pretty-print format
pprint(datasets, compact=True)
A snippet of the datasets list

What if you want to know the attributes of a dataset before even downloading it? We can do that with a one-liner code. ☝️ Simply set the index to be the name of the dataset and you are good to go!

#dataset attributes 
squad = list_datasets(with_details=True)[datasets.index('squad')]
#calling the python dataclass
pprint
(squad.__dict__)
Dataset attributes

Quite interesting! 😬

Load the dataset

squad_dataset = load_dataset('squad')

What happened under the hood? 🤔 The datasets.load_dataset() did the following:

  1. Downloaded and imported in the library the SQuAD python processing script from Hugging Face GitHub repo or AWS bucket (if it’s not already stored in library).
  2. Ran the SQuAD script to download the dataset. Processed and cached SQuAD in a cache Arrow table.
  3. Returned a dataset based on the split asked by the user. By default, it returns the entire dataset.

Let’s understand the dataset we got.

print(squad_dataset)

The squad dataset has two splits — train and validation. The features object contains information about the columns — column name and data type. We can also see the number of rows (num_rows) for each split. Quite informative!

We can also specify the split while loading the dataset.

squad_train = load_dataset('squad', split='train')
squad_valid = load_dataset('squad', split='validation')

This will save the training set in squad_train and validation set in squad_valid.

However, you will realize that loading a few datasets you throw an error and while inspecting the error you can realize that you need a second parameter config.

Here is an example:

amazon_us_reviews = load_dataset('amazon_us_reviews')

Error message:

Error is thrown while loading a dataset

Some datasets comprise several configurations which define a sub-part of a dataset that needs to be selected.

Solution:

amazon_us_reviews = load_dataset('amazon_us_reviews', 'Watches_v1_00')

This will load the amazon_us_reviews dataset with configuration watches.

So, if loading any dataset throws an error simply follow the traceback as Hugging Face has given nice information regarding the error. 👍

Let’s move on to our dataset. 🏃🏻

We saw the number of rows in the dataset information. We can even get that using our standard len function.

print("Length of training set: ", len(squad_train))

Length of training set: 87599

Inspecting the dataset

To see an example of the dataset:

print("First example from the dataset: \n")
pprint(squad_train[0])
Squad train dataset first example

Want to get slices with several examples, the code is the same as we use with a pandas data frame.

print("Two examples from the dataset using slice operation: \n")
pprint(squad_train[14:16])
Slice of examples from the dataset

Want to see values in a column? Index the dataset with the column name. Here is a slice of the column ‘question’.

print("A column slice from the dataset: \n")
pprint(squad_train['question'][:5])
A column slice of squad

You can see that slice of rows has given a dictionary while a slice of a column has given a list. The __getitem__ method returns a different format depending on the type of the query. For example, items like dataset[0] will return a dictionary of elements, slices like dataset[2:5] will return a dictionary of list of elements while columns like dataset[‘question’] or slice of a column will return a list of elements. This looks surprising initially, but Hugging Face has done this because it’s actually easier to use for data processing than returning the same format for each of these views.

See this interesting example:

print(squad_train[‘question’][0])
print(squad_train[0][‘question’])

Output:

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?

Both returned the same output. Let’s verify! 🕵

print(squad_train['question'][0] == squad_train[0]['question'])

The output will be True. Nice! A common mistake we do while using pandas data frame is not a mistake here.

Note: The dataset is backed by one or several Apache Arrow tables which are typed and allows for fast retrieval and access. You can load the datasets of arbitrary size without worrying about the RAM limitation as the dataset takes no space in RAM and is directly read from the drive as and when needed.

Let’s inspect the dataset more.

print("Features: ")
pprint(squad_train.features)
print("Column names: ", squad_train.column_names)
Dataset features and column names
print("Number of rows: ", squad_train.num_rows)
print("Number of columns: ", squad_train.num_columns)
print("Shape: ", squad_train.shape)

Output:

Number of rows: 87599
Number of columns: 5
Shape: (87599, 5)

Note that you can get the number of rows using the len function as well.

Add/Remove a new column

Add a column named “new_column” with entries “foo”.

new_column = ["foo"] * len(squad_train)
squad_train = squad_train.add_column("new_column", new_column)
print(squad_train)
A new column added to the dataset

Let’s now remove this column.

squad_train = squad_train.remove_columns("new_column")

Rename a column

squad_train = squad_train.rename_column("title", "heading")
print(squad_train)
Title column renamed to Heading

Modify/Update dataset

To modify or update the dataset, we can use the dataset.map. map() is a powerful method inspired by tf.data.Dataset map method. We can apply this function to just one example or even a batch of examples or even generate new rows or columns.

Modifying example by example:

updated_squad_train = squad_train.map(lambda example: {'question': 'Question: ' + example['question']})
pprint(updated_squad_train['question'][:5])

Output:

Use of map to append “Question” to each row of the column column

Let’s add a new column using an existing column and remove the old one.

updated_squad_train = squad_train.map(lambda example: {'new_heading': "Context: " + example['heading']}, remove_columns=['heading'])
pprint(updated_squad_train.column_names)
pprint(updated_squad_train['new_heading'][:5])

Output:

Column ‘’new_heading” has been added using the content from column “heading” (and a prefix) and column “heading” has been removed from the dataset

You can use the map to do multiple things with your dataset. Do try out new things based on your requirements. 🙃

Apart from this, you can also process data in batches.

Display examples like Pandas data frame

We always like to see our dataset as a nicely formatted table like we see a pandas data frame. We can convert our dataset to the same format for displaying purposes.

import random
import pandas as pd
from IPython.display import display, HTML
def display_random_examples(dataset=squad_train, num_examples=5):
assert num_examples < len(dataset)

random_picks = []
for i in range(num_examples):
random_pick = random.randint(0,len(dataset)-1)
random_picks.append(random_pick)

df = pd.DataFrame(dataset[random_picks])
display(HTML(df.to_html()))

display_random_examples(squad_train, 3)

Output is a nicely formatted table. 👌

Nicely formatted table of our dataset

That’s it for this article. From here, you can pre-process your data based on your project requirements and build your model or create nice visualizations. It is not possible to cover everything in one article. However, by going through this article, you can get an idea of how to use the methods available in the datasets library. If you need to do more with your dataset, please look at the documentation. There are many many more methods out there like sorting, shuffling, sharding, select, filter, concatenating datasets, etc. You can also format your dataset for PyTorch, Tensorflow, Numpy, and Pandas. And, if you want to share your own dataset then you can also do so. Read about it here!

If you want to look at the code, please refer to this link to my Github repo.

References:

  1. For datasets: https://huggingface.co/datasets
  2. For datasets documentation: https://huggingface.co/docs/datasets/

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: