Original Source Here
18 Non-Cliché Datasets for Beginner Data Scientists to Build a Strong Portfolio
Unique datasets ranging from microbiology to sports
You might have already developed a slightly sickened feeling when you constantly saw people using the same datasets over and over (and over) again.
You just can’t help it — everyone wants the easy thing. Beginners use datasets like Titanic, Iris, and Ames Housing Dataset because they are stupidly straightforward; most course creators and bloggers use them because they are just a single Google search away (or even bookmarked).
In 115+ articles I have written, I honestly can’t remember using any of those clichés (if my memory is letting me down and did use them one or more times, I apologize!). This is mainly thanks to the many hours I have spent searching for a good dataset and delivering content in novel ways to my precious audience.
Today, I have decided to share a list of curated example datasets I used in my posts and as part of my learning. Enjoy!
Regression datasets
1️. Diamond prices and carat regression
My favorite from this list is the diamonds dataset. It is ideal in length for practice (+50k samples) and has multiple targets you can predict as a regression or a multi-class classification task:
🎯 Targets: ‘carat’ or ‘price’
🔗 Link: Kaggle
📦Dimensions: (53940, 10)
⚙Missing values: No
2️. Age of Abalone shells
This is a unique dataset from the field of zoology. The task is to predict the age of Abalone shells (a type of mollusk) using several physical measurements. Traditionally, their age is found by cutting through their cone, staining them, and counting the number of rings inside the shell under a microscope.
For zoologists, this might be fun, but for data scientists, not so much:
🎯 Target: ‘Rings’
🔗 Link: Kaggle
📦Dimensions: (4177, 9)
⚙Missing values: No
3️. King county house sales
This is the dataset for those who are still interested in real estate and house prices regression:
🎯 Target: ‘price’
🔗 Link: Kaggle
📦Dimensions: (21613, 17)
⚙Missing values: Yes
4️. Cancer death rate
This dataset challenges you to find cancer mortality rate per capita (100,000) using several demographic variables:
🎯 Target: ‘TARGET_deathRate’
🔗 Link: Data.world
📦Dimensions: (3047, 33)
⚙Missing values: Yes
5️. Life expectancy
How long will a person live? This is one of the hardest questions unanswered in science. Several studies have been undertaken to understand human life and longevity, and this dataset provided by WHO (World Health Organization) is one of them:
🎯 Target: ‘Life expectancy.’
🔗 Link: Kaggle
📦Dimensions: (2938, 21)
⚙Missing values: Yes
6️. Car prices
The title says it all — predict car prices using variables like mileage, fuel type, transmission, and several domain-specific features. This is also an excellent dataset for pumping out your feature engineering muscles:
🎯 Target: ‘selling_price’
🔗 Link: Kaggle
📦Dimensions: (8128, 12)
⚙Missing values: Yes
Binary classification
7️. NBA rookie stats
The first binary classification dataset in the list requires you to predict if a rookie basketball player will last more than 5 years in the league:
🎯 Target: ‘TARGET_5Yrs’
🔗 Link: Data.world
📦Dimensions: (8128, 12)
⚙Missing values: Yes
8️. Stroke prediction
Another medical dataset asks you to predict whether a patient will have a stroke or not based on their history with interesting features:
🎯 Target: ‘stroke’
🔗 Link: Kaggle
📦Dimensions: (5110, 11)
⚙Missing values: Yes
9️. Water potability
Safe drinking water is the most basic human right and a major influencer on health. Using this dataset, you should classify water bodies into potable (drinkable) and not potable using several chemical properties:
🎯 Target: ‘Potability’
🔗 Link: Kaggle
📦Dimensions: (3276, 10)
⚙Missing values: Yes
10. Smart grid stability
This is an augmented version of the “Electrical Grid Stability Simulated Dataset” created by Vadim Arzamasov. It is donated to UCI and made available on Kaggle. You will be predicting the stability of 4-node smart grid systems (whatever they mean):
🎯 Target: ‘stabf’
🔗 Link: Kaggle
📦Dimensions: (60000, 13)
⚙Missing values: No
1️1. IBM HR analytics & employee attrition
This fictional dataset created by IBM datasets tasks you to uncover which factors lead to employee attrition (whether they will leave their role):
🎯 Target: ‘Attrition’
🔗 Link: Kaggle
📦Dimensions: (1470, 35)
⚙Missing values: No
1️2. Can I eat this mushroom?
Another one-of-a-kind dataset is classifying mushrooms into edible and poisonous. It also presents a unique challenge — all features are categorical:
🎯 Target: ‘class’
🔗 Link: Kaggle
📦Dimensions: (8124, 23)
⚙Missing values: Yes
1️3️. Banknote authentication
Even though this dataset has very few features, I wanted to include it because the task is really interesting — using physical attributes of banknotes, you should classify them into forged or original:
🎯 Target: ‘class’
🔗 Link: Kaggle
📦Dimensions: (1372, 5)
⚙Missing values: No
1️4️. Adult income dataset
Predict whether a person will end up earning more than 50k using factors like age, education, background, gender, marital status, etc.:
🎯 Target: ‘income’
🔗 Link: Kaggle
📦Dimensions: (48842, 15)
⚙Missing values: Yes
Multi-class classification datasets
1️5️. Yeast classification
This dataset will give you a small taste from the world of microbiology. You are tasked to classify a fungus called yeast into species:
🎯 Target: ‘class_protein_localization’
🔗 Link: OpenML
📦Dimensions: (1484, 9)
⚙Missing values: No
1️6️. Kaggle TPS May 2021
Kaggle hosts monthly competitions called the “Tabular Playground Series” with beginner-to-medium difficult tasks. The most important point is that a new synthetic dataset of considerable size is created each month using the CTGAN framework. This one is from the May edition.
🎯 Target: ‘target’
🔗 Link: Kaggle
📦Dimensions: (100000, 52)
⚙Missing values: No
1️7️. Kaggle TPS June 2021
A similar dataset with more features and samples:
🎯 Target: ‘target’
🔗 Link: Kaggle
📦Dimensions: (200000, 77)
⚙Missing values: No
1️8️. Diamonds, again
Just mentioning the diamonds dataset again because it has three categorical features, which can be multi-class targets on their own:
🎯 Targets: ‘cut’, ‘color’, ‘clarity’
🔗 Link: Kaggle
📦Dimensions: (53940, 10)
⚙Missing values: No
Finding a good, novel dataset is hard, especially, if you are a beginner. I hope I made the process easier and could put together a list you can bookmark. Thanks for reading.
AI/ML
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot