All the Datasets You Need to Practice Data Science Skills and Make a Great Portfolio*X_VggbK1iKzS5fF7

Original Source Here

All the Datasets You Need to Practice Data Science Skills and Make a Great Portfolio

A Great Collection of Different kinds of Datasets

Every time I attempt to do a project for learning a new topic or for a project I spend a significant amount of time finding a suitable dataset for that. That way I have quite a lot of datasets that helped me learn and do some cool projects for my portfolio. I am going to share those datasets in this article so that you have a dataset to practice and make your portfolio.

Olympic Dataset

This dataset has information on the Olympic results. Each row contains the data of a country. This dataset will give you a taste of data cleaning to start with.

I learned Python’s libraries like Numpy and Pandas using this dataset.

Download this dataset from here

Housing Price dataset

This dataset is commonly used to teach and learn Regression Models. Surely, It can be used for other staff as well.

This dataset contains these columns: id, date, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zip code, lat, long, sqft_living15, sqft_lot15.

Here is the link.

Heart Disease Dataset

This dataset is from Kaggle. I used it in several articles for demonstration purposes.

These are two examples:

There is some exploratory data analysis done and also the details about the features in Kaggle.

Download this dataset from this link.

Mushrooms Dataset

I found this dataset in the course Applied Data Science With Python Specialization in Coursera.

I used it for Classification problems. It can be used for other purposes as well.

It contains these columns: class, cap-shape, cap-surface, cap-color, bruises, odor, gill-attachment, gill-spacing, gill-size, gill-color, stalk-shape, stalk-root, stalk-surface-above-ring, stalk-surface-below-ring, stalk-color-above-ring, stalk-color-below-ring, veil-type, veil-color, ring-number, ring-type, spore-print-color, population, habitat.

Here is the link to this dataset

NHANES Dataset

This is a big dataset that includes a lot of continuous and categorical features. So, you can use the whole dataset or part of it for so many different purposes. The column named may not look very understandable in the beginning. But once you get used to it, it can be a very useful dataset to practice Data Analysis, Visualization, Statistical Modeling, and Machine Learning models(both classification and regression).

In this article, I cut a piece of the dataset and used it for multiple linear regression:

Here I used it for some visualization demonstration:

Download it from here

Titanic Dataset

Another very popular dataset. I myself used it a lot, I saw different experienced people using this dataset to present a concept.

This dataset contains these columns: PassengerId, Survived, P-class, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked.

This dataset is good for Exploratory Data Analysis, Machine Learning Models specially Classification Models, Statistical Analysis, and Data Visualization Practice.

This is a tutorial where I used this dataset:

Here is a demonstration of some pandas functions using this dataset:

Here is the link to this dataset

Census Dataset

If you want to get a taste of how to explore a big dataset, work with this one. This dataset is very big.

This one is great for Exploratory Data Analysis, Statistical Analysis & Modeling, and, Data Visualization practice.

Here is some practice of data analysis with this dataset:

Download this dataset from here.

Credit Card Fraud Dataset

This dataset is different than the other datasets mentioned here. Because there are no feature names. Sometimes Data Scientists have to deal with datasets like that.

This dataset is about credit card fraud detection. It is very likely that a bank will not share its client information with a data scientist. So, the feature names won’t be available. This dataset gives a flavor of that. It has a binary column that indicates if a transaction is fraudulent or not. This dataset can be used for classification models.

An example in this GitHub page:

This dataset can also be used for Exploratory Data Analysis and Visualization.

Download this dataset from this link.

Movie Dataset

This dataset contains features related to different movies. This is a good dataset for some Natural Language Processing projects.

These are the features:

index, budget, genres, homepage, id, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count, cast, crew, director.

Here is a demonstration of a movie recommendation algorithm using this dataset:

Download this dataset from this link.

People Wiki Dataset

This dataset is Wikipedia profiles of different genres of people. It has three features: URI, name (name of the person), and text (it includes the Wikipedia profile). As you may have already thought, this is also a good dataset for Natural Language Processing

Here is an example of a project with this dataset:

Here is the link to this dataset

Amazon Product Review Dataset

This dataset contains millions of product reviews of the products of amazon.

It has three columns: Name of the product, review, and rating. This dataset is almost a real dataset, very good for Natural Language Processing.

I have a sentiment analysis project and an article where I used this dataset. Please check it out here:

Download this dataset from this link.

BBC Text Dataset

Another wonderful dataset for Natural Language Processing.

This dataset contains information on different types of news from BBC archives. It’s a big text dataset.

It is normally popular for Multiclass Classification problems.

The dataset is big but it has only two columns: text and category.

Here is the link for this dataset

Digits dataset

This dataset contains pixel values of 10 digits. It is commonly used for image recognition problems.

I used this dataset for a few different types of Multiclass Classification problems.

This is a logistic regression algorithm:

Here is a demonstration of a neural network in python:

Download this dataset from this link.

Cifer Dataset

Also, a dataset that contains the pixel values of different images. But the difference from the digits dataset is, the pixel values are three-dimensional matrices.

Here is a project where I tried different neural network structures using Tensorflow and Keras with this dataset:

Cats vs Dogs

Very commonly used to practice Image Classification.

This dataset contains images of cats and dogs.

It is good for computer vision problems.

Here is the link

Malignant vs Benign

Another useful dataset for Computer Vision Problems

This dataset also contains images of two types of skin cancer.

Good for Image Classification problems

Download this dataset from here

Cars Dataset

This is a reasonable size dataset that can be used to practice some Regression Models and Exploratory Data Analysis.

This dataset contains these columns: YEAR, Make, Model, Size, (kW), Unnamed: 5, TYPE, CITY (kWh/100 km), HWY (kWh/100 km), COMB (kWh/100 km), CITY (Le/100 km), HWY (Le/100 km), COMB (Le/100 km), (g/km), RATING, (km), TIME (h).

Here is the link for this dataset

Canada Immigration Dataset

This dataset provides information about how many immigrants came from which country by year.

A great dataset to practice Exploratory Data Analysis and Data Visualization

Here is the link

Facebook Stock Data

It provides Facebook stock performance per day.

The columns in this dataset are Date, Open, High, Low, Close, Adj Close, Volume.

This one can be very useful in Time Series Analysis and Visualization or Time Series Related problems.

Here are some time series analysis and visualization tutorials using this dataset:

Here is the link

Airbnb Dataset

I received this dataset as a part of an interview a while ago.

I was asked to do an Exploratory Data Analysis and develop a Machine Learning Model using this dataset.

This dataset has a lot of text data and numerical data. You can use this dataset to practice a lot of different types of projects.

You will see several datasets in this link. But I was asked to download the listings.csv file for my interview.

Florida Subsidence Incidents Report

I wanted to add one dataset that includes latitude and longitude data if you are interested to work on some geospatial analysis. I used this dataset for some visualization practice:


These are all the datasets I wanted to share today. You should find good enough sets of datasets and some projects idea as well from this page to practice the necessary skills and make a portfolio. Hope this helps.

Please feel free to follow me on Twitter.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: