A Data Scientist’s Guide to Identify and Resolve Data Quality Issues



Original Source Here

A Real-world Messy Example

Bear with the mess for a while. (Photo by Pixabay from Pexels)

We will use the transformed census dataset for this example, which you can download from this GitHub repository. You can find all codes used in this tutorial in this Jupyter Notebook. I recommend you either clone the repository or download the notebook to follow along with the example.

Step 1: Load the dataset

As a first step, we will load the dataset and the necessary libraries. Note that the package has multiple modules (Bias & Fairness, Data Expectations, Data Relations, Drift Analysis, Erroneous Data, Labelling, and Missing) for separate data quality issues, but we can start with the DataQuality engine, which wraps all the individual engines into a single class.

from ydata_quality import DataQuality
import pandas as pd
df = pd.read_csv('../datasets/transformed/census_10k.csv')

Step 2: Analyze its quality issues

This is supposed to be a lengthy process, but the DataQuality engine does an excellent job of abstracting all the details. Simply create the main class and call the evaluate() method.

# create the main class that holds all quality modules
dq = DataQuality(df=df)
# run the tests
results = dq.evaluate()

We would be presented with a report with the data quality issues.

Let’s dissect the report:

  • Warnings: These contain the details for issues detected during the data quality analysis.
  • Priority: For every detected issue, a priority is assigned (a lower value indicates high priority) based on the expected impact of the issue.
  • Modules: Every detected issue is linked to a data quality test carried out by a module (Eg: Data relations, Duplicates, etc.)

Tying everything together, we notice five warnings have been identified, one of which is a high-priority issue. Detected by the “Duplicates” module, we have an entirely duplicated column that will need fixing. To dive deeper into this issue, we use the get_warnings() method.

Simply type in the following:

dq.get_warnings(test="Duplicate Columns")

We can see the detailed output specific to the issue we want to resolve:

[QualityWarning(category='Duplicates', test='Duplicate Columns', description='Found 1 columns with exactly the same feature values as other columns.', priority=<Priority.P1: 1>, data={'workclass': ['workclass2']})]

Based on the evaluation, we can see that the columns workclass and workclass2 are entirely duplicated, which can have serious consequences downstream.

Step 3: Analyze quality issues using specific modules

A complete picture of data quality requires multiple perspectives, and hence the need for eight separate modules. Though they are encapsulated in the DataQuality class, some modules will not run unless we provide specific arguments.

For example, DataQuality class did not execute Bias & Fairness quality tests since we didn’t specify the sensitive features. But the beauty of the library is, we can treat it as a standalone test and perform it.

Let’s understand it better by performing Bias and Fairness tests.

from ydata_quality.bias_fairness import BiasFairness#create the main class that holds all quality modules
bf = BiasFairness(df=df, sensitive_features=['race', 'sex'], label='income')
# run the tests
bf_results = bf.evaluate()

When we ran the code above, we generated another similar report specific to the chosen module.

From the report, we understand that we may have a proxy feature leaking information about a sensitive attribute and severe under-representation of feature values of a sensitive attribute. To investigate the first warning, we can fetch more details with the get_warnings() method filtering for a specific test.

bf.get_warnings(test='Proxy Identification')

We can see the detailed output specific to the issue we want to resolve:

[QualityWarning(category='Bias&Fairness', test='Proxy Identification', description='Found 1 feature pairs of correlation to sensitive attributes with values higher than defined threshold (0.5).', priority=<Priority.P2: 2>, data=features
relationship_sex 0.650656
Name: association, dtype: float64)]

Based on the detailed warning, we inspect the columns relationship and sex and notice that some relationship statuses (e.g., Husband, Wife) are gender-specific, thus impacting the correlation. We could change these categorical values to be gender-neutral (e.g., Married).

Step 4: Resolving the identified issues

Let’s be practical. We can never have 100% cleaned data. It’s all about tackling down the most impactful issues in the time available. As a data scientist, it’s a decision that you need to take based on your constraints.

For this example, let’s aim to have no high priority (P1) issues and tackle at least one bias and fairness warning. A simple data cleaning function based on the warnings raised can look as below:

We drop the duplicated column work_class2and replace the relationship values to be more general and gender-neutral.

If you’d like to do further data cleaning, please feel free to. I would love to see how the data cleaning looks like, should you chose to progress further. Remember, you’re the data scientist — and that decision is always in your hand.

Step 5: Run a final quality check

You may skip this step, but I’m in peace of mind when I check my processed data through another final check. I highly recommend you do it, too, so you know the status of the data after completing your data cleaning transformations.

You can simply call the quality engine first and the evaluate() method to retrieve the sample report again. Here’s how the reports for the DataQuality engine and the BiasFairness engine look like after we have cleaned the data.

We can infer from the two reports above that our high-priority issue has been resolved, and another lower priority issue has been resolved as we aimed for.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: