Original Source Here
Fully Explained P-Distribution with Python example
The statistics concept for data science analysis
UNDERSTANDING THE P-VALUE.
Hypothesis testing is the most misunderstood concept. Let’s try to understand the concept with a very simple example.
Hypothesis testing is nothing but, checking whether your assumption is correct or not based on your data record.
An example to make you more clear picture.
Suppose I say that Akash is the best student in my class. Now, Akash remains the best student in my class, until someone claims against him. Now, I try to collect evidence that can support what I said, i.e. evidence that can support Akash being the best student. Now, if that evidence supports me very much, I have been proven correct and we will have to believe that Akash is the best student. But, if this evidence fails to support me, we will have to believe that Akash is not the best student in my class.
Now, talking terminologies, what I assumed correctly was ‘Akash is the best student’. This is the null hypothesis.
The null hypothesis is the statement that has been assumed correct.
Someone claiming against me, and saying that ‘Akash is not the best student’, is an alternate hypothesis.
The alternate hypothesis is the statement that makes claims against the null hypothesis.
We discussed that if the evidence supports me “very much”, I will be proven correct, and if they do not support me “that much”, I will fail to prove my point. Now, the question that will arise is the criteria of “very much” and “not that much”.
p-value: p-value is a score given based on how much does the evidence supports the null hypothesis.
If this value is less than 0.05, i.e. 5%, it means that there is very little evidence that supports the null hypothesis and so, we will have to reject the null hypothesis. i.e. we will have to reject the fact that Akash is the best student in my class.
Sometimes, the p-value comes more than 0.05, which is 5% of the alpha value, so, it clearly states that there is a significant amount of assumption that supports the null hypothesis, and hence, we can not reject the null hypothesis. This means that we have failed to reject the fact (we will have to accept this fact now), that Akash is the best student in the class.
0.05 can be thought of as a threshold value on which decision of rejecting, or failing to reject the
the null hypothesis is based.
Now, a question arises, that why only 0.05 and not something else? Generally, 0.05 is taken as a tolerance value. Let’s see what is meant by tolerance value.
0.05 means 5%. And anything less than or equal to 5% can be 1%, 2%, 3%, and so on. Now, if 1% of the evidence supports my hypothesis, i.e. null hypothesis, i.e. ‘Akash is the best student’, I have two options:
Fighting that 1% of the evidence has proven me correct and hence, I am correct!
Tolerating and accepting that this 1% support from the evidence may have been due to some error, and may have been some coincidence, and hence accepting that I may be wrong and Akash may not be the best student.
It is the same case with 2%, 3%, 4%, 5% but that’s it! It’s just 5% that I’ll tolerate and I will assume that more than 5% can’t be a coincidence and can’t be because of error. If more than 5% of evidence supports my hypothesis, I am correct, and hence, that’s why, if the p-value is greater than 0.05, we can not reject the null hypothesis.
So, generally, the tolerance level is taken 5%, but in some cases where the decision is not very important, or where we can tolerate more, this can be changed to 10% or something else.
Summary: If the p-value is less or equal to 0.05, we can reject the null hypothesis. And if greater than 0.05, we can not reject the null hypothesis.
Let’s try to implement an example for better understanding.
We have data of ages of 32 people, and we will take 10 random values as a sample from this population.
For an assumption, Let’s the null hypothesis i.e. (H0) be ‘The mean age of the sample and population have no difference’.
The alternate hypothesis i.e. (Ha) will be ‘There is some significant difference between the sample mean age and the population mean age’.
Practical with python
#Creating a dataages = [10,20,35,50,28,40,55,18,16,55,30,25,43,18,30,28,14,24,
16,17,32,35,26,27,65,18,43,23,21,20,19,70]#finding the length of the data
Now, we will find the mean of the ages
import numpy as npages_mean = np.mean(ages)
Now, we will take the sample size for our assumption.
sample_size = 10
age_sample = np.random.choice(ages, sample_size)age_sample#output:
Now, we will test the p-value with the help of stats library
from scipy.stats import ttest_1sampttest, p_value = ttest_1samp(age_sample, 30)print(p_value)#output:
To reject or not to reject the null hypothesis we compare it with the standard value i.e. 0.05.
#alpha value is 0.05 or 5%if p_value < 0.05:
print("We reject the null hypothesis")else:
print("We can not reject the null hypothesis")
We can see how we fail to reject the null hypothesis.
1. NLP — Zero to Hero with Python
2. Python Data Structures Data-types and Objects
3. Exception Handling Concepts in Python
4. Why LSTM more useful than RNN in Deep Learning?
5. Neural Networks: The Rise of Recurrent Neural Networks
6. Fully Explained Linear Regression with Python
7. Fully Explained Logistic Regression with Python
8. Differences Between concat(), merge() and join() with Python
9. Data Wrangling With Python — Part 1
10. Confusion Matrix in Machine Learning
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot