18 Non-Cliché Datasets for Beginner Data Scientists to Build a Strong Portfolio



Original Source Here

18 Non-Cliché Datasets for Beginner Data Scientists to Build a Strong Portfolio

Unique datasets ranging from microbiology to sports

Photo by Doğukan Şahin on Pexels. All images are by the author unless specified otherwise.

You might have already developed a slightly sickened feeling when you constantly saw people using the same datasets over and over (and over) again.

You just can’t help it — everyone wants the easy thing. Beginners use datasets like Titanic, Iris, and Ames Housing Dataset because they are stupidly straightforward; most course creators and bloggers use them because they are just a single Google search away (or even bookmarked).

In 115+ articles I have written, I honestly can’t remember using any of those clichés (if my memory is letting me down and did use them one or more times, I apologize!). This is mainly thanks to the many hours I have spent searching for a good dataset and delivering content in novel ways to my precious audience.

Today, I have decided to share a list of curated example datasets I used in my posts and as part of my learning. Enjoy!

Regression datasets

1️. Diamond prices and carat regression

My favorite from this list is the diamonds dataset. It is ideal in length for practice (+50k samples) and has multiple targets you can predict as a regression or a multi-class classification task:

🎯 Targets: ‘carat’ or ‘price’

🔗 Link: Kaggle

📦Dimensions: (53940, 10)

⚙Missing values: No

📚Starter notebook

2️. Age of Abalone shells

This is a unique dataset from the field of zoology. The task is to predict the age of Abalone shells (a type of mollusk) using several physical measurements. Traditionally, their age is found by cutting through their cone, staining them, and counting the number of rings inside the shell under a microscope.

For zoologists, this might be fun, but for data scientists, not so much:

🎯 Target: ‘Rings’

🔗 Link: Kaggle

📦Dimensions: (4177, 9)

⚙Missing values: No

📚Starter notebook

3️. King county house sales

This is the dataset for those who are still interested in real estate and house prices regression:

🎯 Target: ‘price’

🔗 Link: Kaggle

📦Dimensions: (21613, 17)

⚙Missing values: Yes

📚Starter notebook

4️. Cancer death rate

This dataset challenges you to find cancer mortality rate per capita (100,000) using several demographic variables:

🎯 Target: ‘TARGET_deathRate’

🔗 Link: Data.world

📦Dimensions: (3047, 33)

⚙Missing values: Yes

5️. Life expectancy

How long will a person live? This is one of the hardest questions unanswered in science. Several studies have been undertaken to understand human life and longevity, and this dataset provided by WHO (World Health Organization) is one of them:

🎯 Target: ‘Life expectancy.’

🔗 Link: Kaggle

📦Dimensions: (2938, 21)

⚙Missing values: Yes

📚Starter notebook

6️. Car prices

The title says it all — predict car prices using variables like mileage, fuel type, transmission, and several domain-specific features. This is also an excellent dataset for pumping out your feature engineering muscles:

🎯 Target: ‘selling_price’

🔗 Link: Kaggle

📦Dimensions: (8128, 12)

⚙Missing values: Yes

📚Starter notebook

Binary classification

7️. NBA rookie stats

The first binary classification dataset in the list requires you to predict if a rookie basketball player will last more than 5 years in the league:

🎯 Target: ‘TARGET_5Yrs’

🔗 Link: Data.world

📦Dimensions: (8128, 12)

⚙Missing values: Yes

📚Starter notebook

8️. Stroke prediction

Another medical dataset asks you to predict whether a patient will have a stroke or not based on their history with interesting features:

🎯 Target: ‘stroke’

🔗 Link: Kaggle

📦Dimensions: (5110, 11)

⚙Missing values: Yes

📚Starter notebook

9️. Water potability

Safe drinking water is the most basic human right and a major influencer on health. Using this dataset, you should classify water bodies into potable (drinkable) and not potable using several chemical properties:

🎯 Target: ‘Potability’

🔗 Link: Kaggle

📦Dimensions: (3276, 10)

⚙Missing values: Yes

📚Starter notebook

10. Smart grid stability

This is an augmented version of the “Electrical Grid Stability Simulated Dataset” created by Vadim Arzamasov. It is donated to UCI and made available on Kaggle. You will be predicting the stability of 4-node smart grid systems (whatever they mean):

🎯 Target: ‘stabf’

🔗 Link: Kaggle

📦Dimensions: (60000, 13)

⚙Missing values: No

📚Starter notebook

1️1. IBM HR analytics & employee attrition

This fictional dataset created by IBM datasets tasks you to uncover which factors lead to employee attrition (whether they will leave their role):

🎯 Target: ‘Attrition’

🔗 Link: Kaggle

📦Dimensions: (1470, 35)

⚙Missing values: No

📚Starter notebook

1️2. Can I eat this mushroom?

Another one-of-a-kind dataset is classifying mushrooms into edible and poisonous. It also presents a unique challenge — all features are categorical:

🎯 Target: ‘class’

🔗 Link: Kaggle

📦Dimensions: (8124, 23)

⚙Missing values: Yes

📚Starter notebook

1️3️. Banknote authentication

Even though this dataset has very few features, I wanted to include it because the task is really interesting — using physical attributes of banknotes, you should classify them into forged or original:

🎯 Target: ‘class’

🔗 Link: Kaggle

📦Dimensions: (1372, 5)

⚙Missing values: No

📚Starter notebook

1️4️. Adult income dataset

Predict whether a person will end up earning more than 50k using factors like age, education, background, gender, marital status, etc.:

🎯 Target: ‘income’

🔗 Link: Kaggle

📦Dimensions: (48842, 15)

⚙Missing values: Yes

📚Starter notebook

Multi-class classification datasets

1️5️. Yeast classification

This dataset will give you a small taste from the world of microbiology. You are tasked to classify a fungus called yeast into species:

🎯 Target: ‘class_protein_localization’

🔗 Link: OpenML

📦Dimensions: (1484, 9)

⚙Missing values: No

1️6️. Kaggle TPS May 2021

Kaggle hosts monthly competitions called the “Tabular Playground Series” with beginner-to-medium difficult tasks. The most important point is that a new synthetic dataset of considerable size is created each month using the CTGAN framework. This one is from the May edition.

🎯 Target: ‘target’

🔗 Link: Kaggle

📦Dimensions: (100000, 52)

⚙Missing values: No

📚Starter notebook

1️7️. Kaggle TPS June 2021

A similar dataset with more features and samples:

🎯 Target: ‘target’

🔗 Link: Kaggle

📦Dimensions: (200000, 77)

⚙Missing values: No

📚Starter notebook

1️8️. Diamonds, again

Just mentioning the diamonds dataset again because it has three categorical features, which can be multi-class targets on their own:

🎯 Targets: ‘cut’, ‘color’, ‘clarity’

🔗 Link: Kaggle

📦Dimensions: (53940, 10)

⚙Missing values: No

📚Starter notebook

Finding a good, novel dataset is hard, especially, if you are a beginner. I hope I made the process easier and could put together a list you can bookmark. Thanks for reading.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: