Original Source Here
High-Quality Data Comes With High-Quality Validations
Here’s how you can ensure the quality of Pandas dataframes in every stage of your data pipeline
You can build a castle out of the garbage by picking the right trash.
Most data science projects are like that — building castles 🏰 from the trash. In a pool of thousands of datasets in the data lake, you need to pick the right one and repair the almost-right ones.
You need a robust dataset validation tool for it.
Data quality is a fundamental aspect of any modern analytics project. But my old-school techniques to validate datasets have more bugs 🐛 than butterflies.
I write my own validation code; with lots of exception handling.
Trying a different logic would take significant time to re-code 😦. Also, I need to write dedicated documentation to understand my validation logic.
But I recently discovered a Python library that takes most of the boilerplate out. With it, you can write more intuitive validation logic for Pandas dataframes in a schema.
It takes out a massive chunk of my code. Instead, a few lines of code give more powerful ways of handling validation errors.
Let’s learn how to use Pandera, the Pandas validation toolkit, to ensure high-quality data.
In this post, we’ll discuss
- using YAML configurations for validating Pandas dataframes;
- validation annotation to reuse at any point in your data pipeline;
- define on-the-fly validations, and;
- validating dataframes with complex hypotheses.
But before we do anything, let’s have Pandera installed on your computer.
pip install pandera
Let’s also create a dummy dataset to work along with the examples.
Validating Pandas dataframes with YAML configurations
I love YAML configurations! They are easy to understand and flexible to extend. You don’t need a 130 IQ to make or modify one.
YAML files are hierarchical key-value mappings. Think of them like dictionaries in Python with no curly braces. A list of items can be placed in separate lines starting with hyphens. You can use indentation to branch out sub configurations.
For a more in-depth understanding of YAML files, please see this guide by Alek Sharma.
The following YAML file does some basic checks on our dataset. Isn’t the configuration straightforward?
Please read the comments on the code to understand the validation checks better.
For every column in the dataset, we specify if the field can have null values, duplicates, etc. We could also test if field values are in a predefined list or confirm logic such as less than a specific amount.
Let’s use this on the project to validate our dummy dataset.
The above code reads the yaml file from the filesystem and creates a schema from it. We can then use the schema object to validate dataframes.
The validation call above will return the df as nothing has happened. 😧
But let’s slightly change our YAML file. Note that the newer version has an additional validation to check if the age variable is less than 60.
You’ll see the following error if you try to validate the sample dataframe once again.
This error is instrumental. Besides raising it, Pandera also prints out records that don’t meet the requirement.
You can implement this logic using plain Python code. But using a YAML file is more comfortable for us and anyone who reads our code.
Use validation annotation to test dataframes in your pipeline conveniently.
In complex pipelines, you need to test your dataframes at different points. Often, we need to check data integrity before and after a transformation.
You could use the validation method of a loaded schema as before. But, Pandera has a more elegant way to do this using function annotation.
Pandera has check_input and check_output annotations. When annotating a function with these, it will test dataframes in the function argument and the return value.
To use it, we need to define two schemas; one for the input (argument) and the other for the return dataframe.
In addition to the previous YAML file, let’s create another one as specified below and load it. Let’s call it
The following is an example of using annotation to validate input and output dataframes to a function. Please note that Pandera assumes the first argument is a dataframe. If you have more than one argument, make sure the first is your dataframe
Pandera tests the above function twice. It tests the input dataset with
schema just before the function execution. Also, it tests the output dataframe with
schema2 soon after the function execution.
Play around with
schema2 and induce some errors.
Define schemas on the fly for quick validations.
Loading from a YAML file isn’t the only way to create a schema. It’s, however, my favorite way. You could create a schema entirely inside your Python code.
The following is the Python version of our first YAML file. Note the similarities.
Defining schemas inside Python code has an extra benefit. You could define custom checks in addition to the built-in ones.
For instance, the following code checks if the sum of the order_value column is greater than 1000.
But you’d rarely need to write custom codes because the most frequently used ones are already covered in the module. You just grab the right one.
Validate dataframes in the pipeline with complex hypotheses.
Out of all the great features, this one is my favorite.
Checking a dataframe for common anomalies is fine. But, doing complete hypothesis testing is a gamechanger. We can validate datasets against more complex hypotheses.
For instance, in our shopping dataset, we could check if male customers purchase more than female customers.
The following code does it. Note that for this one to work, you need to install the hypothesis and scipy.
pip install hypothesis scipy pandera[strategies,hypothesis]
Note that, the relationship we’re trying to test is ‘greater_than.’ Since our sample dataset has no evidence to support this hypothesis, it would fail. But if you change the relationship to ‘equal’ the test will pass. Also, ‘equal’ is the default. So you can just remove it.
Validating dataframes has never been easier.
I used to write dedicated code for every single check I needed to perform. And it’s a hustle when you work on many datasets. That’s often the case when you work on a data pipeline.
You’ll need to write similar code on every dataset. Often my code spreads across several notebooks.
But with the discovery of Pandera, I could manage them all in one place. I prefer to have them in YAML files. But we can also define it inside our Python code.
It takes a ton of boilerplate code out of our code. It makes our code more readable, less error-prone and raises more informative messages.
Besides, we can also use hypothesis tests on our dataset column, which is super awesome.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot