Original Source Here
Data Science | Statistics | Machine Learning
Statistics for Data Science
Some basic statistical concepts for data scientists and how to use them
Maths and statistics are powerful tools in the world of data science. Math and Statistics are essential because these two fields form the basics of all the machine learning algorithms. And in order to succeed as a Data Scientist, you must know your basics.
Statistics is the use of maths to perform technical analysis on the data to gain meaningful insights. With statistics, we can operate on the data in an information-driven and targeted manner.
So, how is data science different from statistics? While the fields are closely related in the sense that both data scientists and statisticians aim to extract knowledge from the data, the main difference is the way in which these two communities approach things. Data science is often defined as the confluence of three areas: computer science, mathematics/statistics, and specific domain knowledge.
Josh Wills once said,
“Data Scientist is a person who is better at statistics than any programmer and better at programming than any statistician.”
In this article, we will look at some basic statistical concepts for data science that every data scientist should know and how to apply them effectively.
1. Decoding the summary table
The description table is also called the 5 point summary because it gives us information using a minimum of 5 points from which we can summarize the dataset. Those are: min, max, 25%, 50%, 75%. It also tells us the mean, standard deviation, and count. The 50% value is also the median value of the variable. For e.g. the median value for applicant income in this dataset is $3622.50
If mean > median, the tail is more on the right. i.e. the data has more variation on the right and can be termed as right-skewed. And vice-versa. This is exactly what you will see when you plot the distribution plot. For a symmetrical plot, mean = median.
The 25% is the lower quartile range, which means 25% of the data lies in the range of $150-$2732. The 75% is the upper quartile range (i.e. 75% of the data lies between $150-$5000) and the 50% is the interquartile range. Therefore, we know that 50% of the population makes between $2732 and $5000. So, if we want to target 50% of the population for our product, we know how much they make, and we can decide the price of our product accordingly.
2. Bayes’ Theorem
Bayes’ theorem (alternatively Bayes’ law or Bayes’ rule), describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
Let’s understand this with an example. Let us suppose we want to determine the probability of an email being spam given there is a word ‘winner’ in the email OR given that there is a word ‘winner’ in the email, what is the probability that the email is spam?
According to the above formula we get,
Assuming values for simple maths,
P(spam) = 30%,
P(not spam) will be 100 –30 = 70%
from the training data we can get values for,
P(winner|spam) = 75%
P(winner|not spam) = 35% (do not calculate like above 100–75, since ‘winner’ is not the only word that indicates spam email, and not every email with the word winner is spam)
After substituting the values we can say that the probability that it’s a scam when the word winner is in the email is 0.56 or 56%
3. Binomial distribution
A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated ‘n’ number of times.
Let’s understand this with an example. Let us suppose a bank wants to determine out of a sample space of 10, what is the probability that 4 people will pay their credit card bill on time.
Assuming that the probability of one person paying on time is 0.6, we get
Therefore, the probability that 4 out of 10 paying on time is 0.31 or 31%
4. Over Sampling and Under Sampling
Over-sampling and under-sampling are techniques used for classification problems. Sometimes, the dataset can be heavily imbalanced. For example, we may have 4000 observations for diabetic patients, but only 400 for non-diabetic patients. That’ll throw off a lot of the machine learning algorithms and the models would perform well only when predicting the outputs for diabetic patients. Over-sampling and under-sampling can tackle this problem.
While under-sampling, we will randomly select only records from the data in the majority class to match the records in the minority class. in this e.g. 400 records. This selection maintains the probability distribution of the class. This however is not preferred since we end up discarding so much data.
Oversampling on the other hand will create copies of the minority class in order to have the same number of examples as the majority class. in this e.g. 4000 records. This can be done either by just duplicating the minority class records or by using a method called Synthetic Minority Over-Sampling Technique (SMOTE).
5. Dimensionality Reduction
The number of input variables or features for a dataset is referred to as its dimensionality. 3 features are 3 dimensional, 5 features are 5 dimensional, etc. Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset.
One way towards dimensionality reduction is through the feature selection technique. Wherein we use a scoring or statistical method to select which features to keep and which features to drop. The most common stats technique used for dimensionality reduction is Principal Component Analysis (PCA) which essentially creates vector representations of features showing how important they are to the output i.e their correlation. While PCA tries to keep as much variance in the data, using PCA however, reduces the interpretability and makes it is impossible to explain the model outputs.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot