Original Source Here
I was involved in a Kaggle competition on NLP multi-class classification task where we were given descriptions of various jobs and our objective was to match them with a list of 28 different jobs like Dentist, Teacher, Lawyer etc.
Note: The data in this Kaggle competition was collected from CommonCrawl, which is free to use.
Here, we have an explicit gender feature assigned to every job description. There was some implicit gender information present in the text of job descriptions such as pronouns (he/ she), gendered words (mother, father, son, daughter etc).
From the screenshot of the dataset, we can see the gender pronouns in the job description and also a gender feature.
Let’s check the gender disparity between various jobs in the dataset.
Top 10 jobs with more female examples:
I took a ratio (male:female) examples present in the dataset for each job title. From the image, you can see that the jobs like dietician, nurse, teacher appear to be more dominated by women.
The jobs like poet and journalist have ratio closer to 1 where we see almost equal proportions of both genders.
Top 10 jobs with more male examples:
If we visit the other end of spectrum, we see that jobs like rapper, surgeon, dj are mostly male dominated.
This shows the disparity that we are going to feed as input to the AI model. The model would train on these disparities and gender sensitive features to become a biased predictive model.
Although, we would come up with an accurate model which performs pretty good on the test set, but it would not be a fair model.
In this case, the bias can creep into the model from:
- Gender feature
- Implicit gender information in the description
- Highly imbalance job classes such as dietician, rapper.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot