Original Source Here
The first step to solving any data science problem is data collection. Sometimes, this data will be available in the form of an SQL database or an Excel sheet. At other times, you will need to extract data yourself — either by using APIs or web scraping.
Below, I will list some of the most common data collection libraries in Python. I use these libraries very often depending on the type of data I need to collect, and they have made my data science workflow a lot easier.
If the data you need to extract is in the form of an SQL database, you will need to load the database into Python before pre-processing and analyzing it.
MySQLConnector is a library that allows you to establish a connection with an SQL database using Python.
You can load database tables easily with the help of this library, then convert the tables into Pandas data frames to perform further data manipulation.
You can also create databases and write to them with the help of this library.
Get started with MySQLConnector:
Companies often depend on external data when making business decisions — they might want to compare prices of competitor products, analyze competitor brand reviews, etc.
BeautifulSoup is a Python library that can help you scrape data from any web page.
Here is a tutorial to help you get started with BeautifulSoup in Python.
Social media APIs
Social media platforms like Twitter, Facebook, and Instagram generate large amounts of data on a daily basis.
This data can be useful for many data science projects, such as:
Company A has just released a product and come up with a special discount. How are their customers responding to the product and this discount? Are people talking more about the brand than usual? Is the promotion driving higher brand awareness? How good is the overall product sentiment when compared to competitor brands?
It is difficult for a company to gauge things like overall brand sentiment (on a large scale) solely with internal data.
Social media analysis plays a huge role collecting data for tasks like churn prediction and customer segmentation.
And it really isn’t difficult to collect data from social platforms, since there are a lot of publicly available APIs that can help you do this quickly. Some of them include:
Here are some tutorials to help you get started:
Here is an example of a sentiment analysis project I created with a Twitter API.
Real world data is dirty. It doesn’t always come in the format of an Excel sheet or a .csv file. It could come in the format of an SQL database, text file, JSON dictionary, or even a PDF file.
As a data scientist, a huge portion of your time will be dedicated to creating data frames, cleaning them, and merging them together.
Some Python libraries that can help with data preparation include:
Numpy is a package that allows you to perform operations quickly on large amounts of data.
You can convert data frames into arrays, manipulate matrices, and easily find basic statistics (like the median or standard deviation) of a population with the help of Numpy.
Some tutorials to help you get started with Numpy:
Pandas is one of the most popular and widely used Python packages for data science.
You can easily read different file types and create data frames with the help of Pandas. Then, you can create functions to pre-process this data really quickly — you can clean the data frame, remove missing/invalid values, and perform data scaling/standardization.
To learn Pandas, you can take the following tutorials:
Have you ever encountered invalid values, weird symbols, or whitespaces when working with Pandas data frames?
Although RegEx isn’t a library specifically built for data scientists, I’m adding it to this list because it is incredibly useful.
You can use RegEx (or Regular Expressions) to identify a set of characters within data. This library can be used to find rows of data that specify a certain condition. It can also be used to pre-process data and remove invalid values that don’t match a specific format.
Some tutorials to start using RegEx with Pandas data frames:
The most important library to perform data analysis is Pandas. I’ve explained the use of Pandas for data pre-processing above, so I will now go through one of the best modules for data analysis within Pandas:
Pandas-profiling is an incredibly useful module for data analysis. Once you run pandas-profiling on a data frame, it provides you with summary statistics of the data as shown below:
It also provides you with a description of each variable, their correlation with each other, their distribution and cardinality.
To learn more about Pandas profiling, read this article.
Another crucial part of any data science project is visualization. It is important to visualize the spread of variables, check their skewness, and understand the relationship between them.
Seaborn is a library you can use for this purpose. It is quick to import and you can make charts easily, with only one or two lines of code.
Here are some learning resources to help you get started with Seaborn:
Here is a data visualization tutorial I created in Seaborn:
Plotly is another visualization library I’m adding to this list. With Plotly, you can make beautiful, interactive visualizations.
It takes slightly more code and a bit more effort to customize Plotly visualizations.
I generally use Seaborn if I want to quickly check the distribution/relationship between variables. I use Plotly if I need to present visualizations to others, Plotly’s charts are interactive and look nice.
Plotly also allows you to build interactive choropleth maps, that allow you to easily plot location data. If you need to present data by region, country, or latitude/longitude, Plotly’s choropleth maps are the best way to do so.
Some learning resources to get started with Plotly:
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot