Basics of Statistics for Data Science

Original Source Here

Descriptive and Inferential Statistics

A descriptive statistic quantitatively describes or summarizes the data. It provides the mean, median, and mode of the data. Descriptive statistics use graphical representation to visualize the distribution and tells how each variable is related to one another.

Inferential statistics allows making predictions (“inferences”) from that data. With inferential statistics, the data from the samples are taken, and generalizations about a population are made.

Data is categorized into Population and Sample. The population is a collection of all items of interest. The sample is the subset of the people. Sample data can be further categorized as a Random sample or Representative sample. A random sample is a random subset from the population, whereas a representative sample accurately represents the population parameters. Data is also categorized as Numerical and Categorical.

Representation of Categorical Data

1. Frequency distribution data
2. Bar chart
3. Pie chart
4. Pareto diagram

The Pareto diagram contains both bar and line graphs. The bar represents the value for each element, and the line represents the cumulative total. The Pareto principle states that 80% effect comes from 20% of the cause, e.g., software developers fix 20% of the bugs, which solves 80% of the problem.

Representation of Numerical Data

1. Histogram
2. Crosstable
3. Box plot
4. Scatter plot

Interval width = (Largest no. — Smallest no.)/no. of the desired intervals

(10,20] an interval ranging from 10–20 will include values from 11–20, 10 will not be included in this interval it will be preset in [0,10] the first interval will consist of both the end values.

Box plot depicts the numerical data through their quartile.

Central Tendency

The tendency for the values of a random variable to cluster around its mean, mode, or median.

Skewness

Right skew or positive skew is when: mean>median>mode

left skew or negative skew is when: mode>median>mean

Variance

variance is the dispersion of data points around the mean

σ²=Variance, N=total points

μ=mean,x=data points

x̅=mean

Standard Deveation= √S², √σ²

Coefficient of Variance

The coefficient of variance is the relative standard deviation; it is calculated by Standard deviation/mean. It is used to compare the spread of data that are on a different scale.

e.g., price of pizza in \$ and ₹, let’s say the std for pizza is \$=3.27 and the std for pizza is ₹=61.56

After calculating the coefficient of Variance is \$=0.60, and for ₹=0.60 (Std/Mean), Note the spread is the same in both. It’s just on a different scale. We can compare two data on a different scale.

Covariance

Covariance is the measure of the relationship between variables

x̅ y̅=mean of x, y

• Positive covariance: Indicates that two variables tend to move in the same direction.
• Negative covariance: This reveals that two variables tend to move in inverse directions
• Neutral covariance: both variables are independent

Correlation Coefficient

x̅ y̅=mean of x, y

The correlation coefficient adjusts covariance so that the relation between two variables is easy and intuitive to interpret.

Covariance : 0, +ve or -ve ( with Covariance, you can know how the two variables are related whether it moves in the same direction or reverses, it does not tell about the magnitude of the relationship)

Correlation coefficient: -1≤x≤1

Correlation of 1 means entire variability 1 variable is explained by another variable( e.g., house size and price as size inc/dec price also inc/dec)

Correlation of -1 means entire variability 1 variable is explained by another variable( Icecreams and umbrella sold in summer and rainy, during summer increase sales increase and umbrella sales decrease and vise versa during rainy)

Correlation of 0 indicates that two variables are independent (e.g., house price and price of coffee in a different country)

Correlation of ± 0.x = one variable will explain x % another variables variance

Types of Correlation

1. Pearsons Correlation: Pearsons correlation measure the strength and direction of linear association between two variable. They are used on continuous data (Above formula).
2. Spearman’s Correlation: Spearman’s correlation is the nonparametric version of Pearsons’s correlation. The spearman’s correlation measures the strength and direction of monotonic association between two ranked variables. (Assigning rank by giving the highest value as rank 1, the second-highest is ranked 2nd and so on), correlation is found on the ranked variables. Spearman correlation can be applied to ordinal data and continuous data. range [-1 to 1]. (Monotonic relationship: if one variable increase/decrease the other variable will not decrease/increase)

di=Diffrence between the two ranks

n=number of observation

Causation

Two events are correlated but did not cause each other.

e.g., Andy gets A+ when it is sunny Sunny correlated with A+, but Sunny did not cause A+

Quantiles and Percentiles

quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities

a percentile is a score below which a given percentage of scores in its frequency distribution falls

e.g., the median is a quantile; it splits the data in two (50% quantile) the value of median or the 50% quantile is 4.5 this is the percentile (the value that a quantile gives is a percentile, quantiles are used to see what value the 25%,50% or 75% holds in a distribution)

The 5 point summary

• 1st Quartile: The 25th percentile.
• Median(2nd Quartile): The middle value in the sample, also called the 50th percentile or the 2nd quartile.
• 3rd Quartile: The 75th percentile.
• Minimum: The smallest observation in the sample.
• Maximum: The largest observation in the sample.

Inferential statistics

Z Statistics

Standardization: alter every element In distribution to get a new distribution with similar characteristics. putting different variables on the same scale

Standard normal distribution→a normal distribution that has a mean of 0 and standard deviation of 1

The above figure the standard normal distribution of the x values, which increase or decrease by 1 is taken as a z score. The Z score tells how many standard deviations an observation is away from the center( mean)( e.g., Z=-2 tells the observation is 2 standard deviations to the left, Z=1.5 tells the observation is 1.5 standard deviations to the right).

x=data points

μ=mean, σ=Std. deviation(population)

1010–1010/20 →0

1030–1010/20→1

990–1010/20→ -1

now p(x<980), what is the probability an observation lies in an area less than 950, convert to z. p(z< -1.5)

to calculate the area, we need to look at the z table

p(z< -1.5)=0.0668 →p(x<980)=0.0668

T Statistics

Similar to Z statistics, since the number of samples is low and the population variance is unknown, the sample variance is used instead of the population variance.

x̅=smaple points

μ=mean, n=total points

s=standarnd deviation(sample)

Central Limit Theorem

No matter the underlying distribution, the sampling distribution will approximate a normal distribution. The samples are taken with replacement from the population.

Original distribution → mean= μ , variance=σ2

Sampled distribution→mean = μ, variance=σ²/n where n is the number of samples taken from the population n should be greater than 30. The standard error is given by √σ2/n ( the standard error decreases as n increase)

Confidence interval

Instead of saying the average is 22.50 (point estimate), it can be said the average is between 20–25 (confidence interval). The confidence interval quantifies the uncertainty on an estimated population variable, such as the mean or standard deviation

level of confidence

90% confidence → α=10%

95% confidence → α=5%

99% confidence → α=1%

a lower confidence level provide a wider interval, 90% confidence level indicate that there is a 90% chance the point estimate value lies between the interval, The formula for finding the confidence interval is given by

[Point estimate ± reliability factor × Standard Error]

reliability factor = Z(α/2) if population variance is known, if not T(DOF,α/2) (DOF Degree of Freedom ((no. of rows)*(no. of columns))-1)

Confidence interval for one sample

e.g., Suppose a data with 50 samples with a mean of 70 and a standard deviation of 20 at a 95% confidence level

reliability factor (Zα/2) α=5% →0.05/2=0.025

Z 0.025 = 1.96 (Z-table) ( 1–0.025 =0.975)

for 0.975 in z-table 1.9+0.06

Z0.025=1.96

Standard error=20/√50 = 2.828, 2.828×1.96 =5.542

CI=[70±5.542]→[75.572,64.458] at 95% confidence level

The hypothesis will be statistically significant if the Confidence Interval does not include the null hypothesis value

Confidence interval for two population

Two samples each from the respective population can be categorized as :

1. Dependent: both samples are dependent, e.g., A training program assessment takes pre-test and post-test scores from the same group of people.
2. Independent: Both samples are independent, e.g., a random sample of 100 females and another random sample of 100 males. The result would be two samples that are independent of each other.

Independent samples can be further classified into 3.

1. Two samples with known population variance
2. Two samples with unknown population variance but assumed to be equal
3. Two samples with unknown population variance but assumed to be not equal

Dependent Samples

CI=Xd±z(α/2)×σ/√n (Xd= the mean of the difference of 2 samples, z=reliability factor, σ/√n=standart error)

CI=Xd±t(DOF,α/2)×s/√n-1 ( if the sample is less than 30 then t statistics is used instead of z statistics for reliability factor)

e.g., compare systolic blood pressures before after a training program

assume null hypothesis: the difference of the two means is 0

1. The difference of 2 values are found
2. mean of the difference (Xd) is calculated
3. Then the difference of Xd and the means of the difference X̅d
4. Xd=-81/10 →-8.1

5. σ=1848.9/14 ( σ for sample divide by n-1, for population divide by n)

6. standard error = 34.08

7. T statistic = for 95% confidence 2.145

8. CI=8.1±2.145×34.08

9. [-81.2 — 65.00]

Since the null hypothesis value is in the confidence interval, we can accept the null hypothesis that is there is no statistically significant difference between the blood pressure before and after the training.

Independent sample, Known population variance

Marks of engineering student and management student, find A 95% confidence interval for the difference between the grade of the engineering student and management student.

The variance of the difference→ 1.36

CI=-7±1.96*1.36

CI=[-9.66,-4.34]

95% confident the difference in the grade of engineering and management is [-9.66,-4.34], the interval is negative as engineer students scored less than management if we had considered x̅ as management and y̅ as engineer the CI is [9.66,4.34]

Independent sample, unknown population variance but assumed equal

Price of apples in city A and City B, here population variance is not known as we cannot get the apple price in the entire city and samples can be collected only from few shops, so the population variance is not known and is assumed to be equal.

Note: T Statistics is used in the reliability factor as there is no information about population variance

CI=(3.94–3.25)+-2.12√(0.05/10 + 0.05/8)

CI=[0.47,0.92]

Independent sample, unknown population variance, assumed not equal

In the previous example, instead of comparing apples in 2 cities, we can compare apples and oranges in a town. Since we compare apple and orange, which have different prices and demand, we can’t assume the population variance to be equal.

Hypothesis Testing

The hypothesis is a claim about a population parameter

H0: Null hypothesis: the idea to be tested

H1: Alternate hypothesis: An idea that contradicts the null hypothesis

e.g., suppose the average age of the student is 23

H0: μ=23

H1: μ≠23

Significance level →α (Probability of rejection of the null hypothesis if it’s true)

A hypothesis is tested using the Z-test or the T-test.

x̅=sample mean

μ=hypothesiszed mean

σ/√n=std error

Rejection Region For Null Hypothesis

The rejection region is the interval beyond which the null hypothesis is rejected.

Two Tail Test

H0: μ=23

H1: μ≠23 ( the alternate hypothesis can be > or < 23 )

One Tail Test

H0: μ≥23

H1: μ<23 ( the alternate hypothesis can< 23 )

The alternate hypothesis decides the one tail or two-tail test if H1≠ , then it is a two-tail test if H1< then it is one tail test(left tail test) and if H1> then it is (right tail test)

e.g.,

let the average weight of a group is 168 lbs (population mean) with a std of 3.9 (population variance). A nutritionist believe the average weight to be different, so she sampled weight of 36 individual is 169.5 lbs . at 95% confidence is this enough to discard the groups average in 36

H0: μ=36

H1: μ≠36 (Two tail test)

x̅=169.5 , n=36, σ=3.9, μ=168 , α=1–0.95=0.05

Zc ( critical values ) = Zα/2→Z0.025→1.96( value taken from the Z-table) and -1.96 since two-tail

Z(Test Statistics)= (169.5–168)/(3.9/√36)

Z=2.31

There is two way to check if the null hypothesis is accepted or rejected

1. The -Zc<Z<Zc since 2.307 > 1.96 has crossed into the rejection region, thus rejecting the null hypothesis.
2. P value= The P-value tells us if it is unlikely that we would observe such a test statistic in the direction of H1 if the null hypothesis were true. So if P-value < α, then the null hypothesis is rejected if P-value >α, the null hypothesis is accepted

The P-value for the one-tail test

1. left tail test P=(area value of Z(Test Statistics))
2. right tail test p=1-(area value of Z(Test Statistics))

The P-value for a two-tail test

1. For the two-tail test, if the (area value of Z) is < 0.5, then P=2×(area value of Z)
2. If the (area value of Z ) is >0.5, then P=((1-(area value of Z))×2

area value of Z → in the above example Z=2.31

The area value of Z=2.31 is 0.9896

Since this is a two-tail test and 0.9896 > 0.5

P=((1–0.9896))×2=0.0208

0.0208<0.05(α)

H0 is rejected

If the population variance is unknown, then T-Statistics is used instead of Z Statistics in the above example. If the population variance is not known, then the sample std deviation is used. Suppose sample std deviation=3

H0: μ=36

H1: μ≠36 (Two tail test)

x̅=169.5 , n=36, s=3, μ=168 , α=1–0.95=0.05

T= (169.5–168)/(3/√36)=3

DOF=36–1=35

α=0.05

Two tail test

Tc=-1.697–1.697

T>Tc hence H0 is rejected

P-value

The P-value is the probability that a random chance generated the data or something else equal or even rarer.

e.g., two coin toss resulted in the head, does the coin have some advantage for the head?

H0: no difference in coin, probably for the 3rd event

H1: Probability of getting head is >

tossing 2 coins {HH,HT,TH,TT}

P-value of 2 head= 3 parts to calculate P-value

1. probability of the event {H,H} =0.25
2. an event with similar probability={T,T} = 0.25
3. an event with greater probability than the event we find P-value for= 0 since there is no event with probability >0.25

P-value of {H,H} = 0.25+0.25+0=0.5