Learn Python
Learn Data Structure & Algorithm
Learn Numpy
Learn Pandas
Learn Matplotlib
Learn Seaborn
Statistics Introduction
Statistics Variable
Statistics Sample & population
Statistics Measure of central tendency
Statistics Measure of Dispersion
Statistics Distribution
Statistics Z-score
Statistics PDF & CDF
Statistics Center Limit Theorem
Statistics Correlation
Statistics P value & hypothesis test
Statistics Counting rule
Statistics Outlier
Learn Math
Learn MATLAB
Learn Machine learning
Learn Github
Learn OpenCV
Learn Deep Learning
Learn MySQL
Learn MongoDB
Learn Web scraping
Learn Excel
Learn Power BI
Learn Tableau
Learn Docker
Learn Hadoop
The P-value is the probability of the null hypothesis being true.
It is an assumption that treats everything equally and similarly. It means the null hypothesis treats
everything similar or equal.
Suppose the GDP of a county is 10 million. After 5 years the GDP is also 10 million. Is that possible? It is
not possible that GDP is same after 5 years of a country. But the null hypothesis treats everything similar.
So null hypothesis will say, it is possible. The null hypothesis will say that the GDP of the country before
and after 5 years is same.
Another example can be: null hypothesis will say that all the students of a class are equally intelligent.
By using the P-value you will try to prove, is the null hypothesis true or wrong.
It means that from the above example: the p-value tries to prove that the GDP of a country before and after 5
years is same or not. It may increase or decrease. So you can say that P-value is the process to accept or
reject the null hypothesis.
Calculation:
1. First collect data
2. Significance level: For P-value, the significance level is 0.5. It means, in 0.5% case the the null
hypothesis should be true.
For example, you take 100 countries then at least for 5 countries the null hypothesis should be true.
3.Decision: accept or reject the null hypothesis. Accept means what is said by the null hypothesis is
true and reject means not true.
Symbol:
H0=Accept and H
a=Reject
The range of significant values for rejecting or accepting the null hypothesis:
1.P value <=0.01 --> Very strong evidence against null hypothesis
2.0.05<= P value <=0.1 -->strong evidence against null hypothesis
3.P value <=0.1 --> Mild evidence against null hypothesis
4. P value >=0.1 --> No evidence against null hypothesis.If you get this then accept the null
hypothesis.
After deciding the significance level do some tests to get P-value.
The test are:
1. T test
2. Chi square test
3. Anova test
4. Z test
Step 1: Get the data:
Here you have a table. This table contains people's marital status according to a different stage of
education.
Here significance value is 0.5. This table is called the observed value table:
Marital Status | Bachelor's | Masters | Ph.D | Total |
---|---|---|---|---|
Never married | 19 | 16 | 21 | 56 |
Married | 12 | 36 | 45 | 93 |
Divorced | 6 | 9 | 9 | 24 |
Widowed | 3 | 9 | 9 | 21 |
Total | 39 | 84 | 84 | 194 |
In the table total column, you have the total value of each row. In the total row, you have a total of each
column. Here 194 is the grand total.
Step2: Now calculate the expected values from observed value.
Formula of calculate expected value: exv=(row total*column total)/grand total.
Marital Status | Bachelor's | Masters | Ph.D | Total |
---|---|---|---|
Never married | 11.27 | 24.25 | 24.25 |
Married | 18.69 | 40.27 | 40.27 |
Divorced | 4.82 | 2.57 | 2.57 |
Widowed | 4.22 | 9.09 | 9.09 |
Step 3: Find deviation between observed and expected values.
Formula: (observed value - Expected value)2/Expected value
Observed value(O) | Expected value(E) | (O-E)2/Expected value |
---|---|---|
19 | 11.27 | 5.30 |
16 | 24.25 | 0.65 |
21 | 24.25 | 0.43 |
12 | 18.69 | 1.73 |
36 | 40.27 | 0.45 |
45 | 40.27 | 0.56 |
6 | 4.82 | 0.29 |
9 | 2.57 | 16.09 |
9 | 2.57 | 16.09 |
3 | 4.22 | 0.35 |
9 | 9.09 | 0.00 |
9 | 9.09 | 0.00 |
Total=41.94 Chi square value |
Step 4: calculate Degree of freedom
Formula:(column-1)*(row-1)
Here,
Column means the number of columns where numerical values are present for calculation in the table.
Row means the number of row present where numerical values are present for calculation in the table.
So,
Degree of freedom=(3-1)*(4-1)=2*3=6
Now go to test table and find the value according to degree of freedom and significance value.
Now compare the chi square value that you get from calculation and that value which you get from test
table.
In chi square if calculate value > tabular value
then null hypothesis is rejected.
Here,
calculated value means that chi square value that you get from calculation and tabular value means that value
which you get from the test table.
T test is used to compare the mean of two groups.It is used to determine that whether a process affects the
population interest or two groups are different or not. T-test is used for small sample test. Here the sample
size is less than 30. Here population variable are unknown. In t-test the correlation of coefficient in
population is zero.
Formula:
t=(x1-x2)/√{s2(1/n1+1/n2)}
Here,
t= t tst value
x1 and x2 are the mean of two group
s2 is the pooled standard error of two groups
n1 and n2 are the number of observations in each of the groups
Example calculation:
Subject | Student 1 score(X) | Student 2 score(Y) |
---|---|---|
1 | 3 | 13 |
2 | 3 | 13 |
3 | 15 | 29 |
4 | 12 | 20 |
5 | 23 | 25 |
6 | 16 | 32 |
7 | 17 | 23 |
8 | 19 | 20 |
9 | 32 | 30 |
10 | 24 | 15 |
11 | 3 | 20 |
Step 1: find X-Y
Subject | Student 1 score(X) | Student 2 score(Y) | X-Y |
---|---|---|---|
1 | 3 | 13 | -10 |
2 | 3 | 13 | -10 |
3 | 15 | 29 | -14 |
4 | 12 | 20 | -8 |
5 | 23 | 25 | -2 |
6 | 16 | 32 | -16 |
7 | 17 | 23 | -6 |
8 | 19 | 20 | -1 |
9 | 32 | 30 | 2 |
10 | 24 | 15 | 9 |
11 | 3 | 20 | -17 |
Step 2: Calculate sum of X-Y column
Subject | Student 1 score(X) | Student 2 score(Y) | X-Y |
---|---|---|---|
1 | 3 | 13 | -10 |
2 | 3 | 13 | -10 |
3 | 15 | 29 | -14 |
4 | 12 | 20 | -8 |
5 | 23 | 25 | -2 |
6 | 16 | 32 | -16 |
7 | 17 | 23 | -6 |
8 | 19 | 20 | -1 |
9 | 32 | 30 | 2 |
10 | 24 | 15 | 9 |
11 | 3 | 20 | -17 |
sum=-73 |
Step 3: Calculate the square of the X-Y column and calculate the sum
Subject | Student 1 score(X) | Student 2 score(Y) | X-Y | (X-Y)2 |
---|---|---|---|---|
2 | 3 | 13 | -10 | 100 |
3 | 3 | 13 | -10 | 100 |
5 | 15 | 29 | -14 | 196 |
4 | 12 | 20 | -8 | 64 |
9 | 23 | 25 | -2 | 4 |
6 | 16 | 32 | -16 | 256 |
7 | 17 | 23 | -6 | 36 |
8 | 19 | 20 | -1 | 1 |
11 | 32 | 30 | 2 | 4 |
10 | 24 | 15 | 9 | 81 |
1 | 3 | 20 | -17 | 289 |
sum=-73 | sum=1131 |
You can also write the formula like this for better understanding.
Formula:
t=(∑D/N)/√[∑D2-{(∑D)2/N}]/[{(N-1)(N)}]
∑D=Sum of X-Y column (in step 2)
∑D2=Sum of (X-Y)2 column(in step 3)
(∑D)2=Sum of the differences in step 2.
So,
t=(-73/11)/[{1131-{(-73)2/11}}/(11-1)(11)]
t=-2.74
Step 4:
Subtract 1 from the sample size to get the degrees of freedom. You have 11 items, so 11-1=10
Step 5:
Find p-value in the t-table, using the degrees of freedom in step 6.
Use significance level 0.05(5%) with df=10 and the t value is 2.228.
Step 6:
Compare t-table value from step 5 (2.228) to the calculated t value(-2.74)
Here the calculated t value is greater than the table value at significance level 0.5 and the p-value is less
than the significance level. So you can reject the null hypothesis that there is no difference between
mean.
Note:
You can ignore the minus or plus sign while comparing two t values. A larger or higher t value shows that the
difference between groups mean are greater than the pooled standard error. Which indicates a more significant
difference between the groups. You can also compare the calculated t value against the values in the critical
value chart to determine whether the t value is greater than the expected value, then reject the null
hypothesis.
Z test is used if the population coefficient correlation is not zero. In the z test sample size is
large(n>30). Z test is used whether two populations mean are different and also population variance is known.
Z test is based on the standard normal distribution.
Formula: Z=(x-m)/(σ/√n)
Here,
x=sample mean
m=population mean
σ=standard deviation
n=number of sample
Let's see an example:
You get a complaint that the boys in the primary school were underfed. Now you have to find that is the
complaint is true or false. Now let's solve this problem using the Z test.
Consider the weight of boys as a variable. By previous statistical analysis, you know that the the average
weight of boys of age 10, is 32 kg whereas SD is 9kg.
Let's take a sample of 25 students and the average is 29.5 kg.
Here significance level is 0.05.
Here,
n=29.5
m=32
σ=9
n=25
So, Z= (29.5-32)/(9/√25)=-1.39
Now find the p value using the table.
In this case the p value is 0.0823
Now compare p value with significance value.
Here p-value is greater than the significance value 0.0823>0.05
Because the p-value is greater than the significance level then accept the null hypothesis and here null
hypothesis is, there is no significant difference between the students. It means the complaint is false.
There are two types of anova test:
1.One way anova
2.Two way anova
Low noise | Medium noise | High noise | ||||||
---|---|---|---|---|---|---|---|---|
Student | Question(X) | Student | Question(X) | Student | Question(X) | |||
1 | 10 | 5 | 8 | 9 | 4 | |||
2 | 9 | 6 | 4 | 10 | 3 | |||
3 | 6 | 7 | 6 | 11 | 6 | |||
4 | 7 | 8 | 7 | 12 | 4 | |||
The table represents the amount of noise and the number of questions students can solve. There are three
categories low, medium, and high noise. Here the number of students in each category is 4. Find the relation
between noise and the number of questions solved. Here you can see that if noise increases then the number of
questions solve is decreases. You can say that there is a negative correlation. By using ANOVA you can find
that how much or what percentage of correlation features have.
To calculate :
step 1: Find X2 and calculate sum of question and X2
column
Low noise | Medium noise | High noise | ||||||
---|---|---|---|---|---|---|---|---|
Student | Question(X) | X2 | Student | Question(X) | X2 | Student | Question(X) | X2 |
1 | 10 | 100 | 5 | 8 | 64 | 9 | 4 | 16 |
2 | 9 | 81 | 6 | 4 | 16 | 10 | 3 | 9 |
3 | 6 | 36 | 7 | 6 | 36 | 11 | 6 | 36 |
4 | 7 | 49 | 8 | 7 | 49 | 12 | 4 | 16 |
sum=32 | sum=266 | sum=25 | sum=165sum=17 | Sum=77 |
step 2:
Calculate correction term
Formula: C=(∑X)2/N
Here,
C=(32+25+17)2/12=456.33
Step 3:
Sum of all squared of total and minus it by correction term.
SST=∑X2-C=(266+165+77)=456.33=51.67
Step 4:
Divide the sum of all X total and divide by n and then subtract it by C.
SSA={(∑X)2/N}-C=(322/4)+(172/4)+(322/4)-456.33=28.17
Step 5:
Here calculate SST-SSA.
SSW=SST-SSA=51.67-28.17=23.5
Step 6:
Mean of sum squares among groups:
Formula: MSSa=SSA/(k-1)
Here k is number of categories.
So MSSA=28.17/3-1=14.085
Step 7:
Mean of the sum of squares within groups:
MSSW=SSW/(N-K)=23.5/12-3=2.611
Step 8:
Here we calculate F ratio.
F ratio=MSSA/MSSW=14.085/2.611=5.394
Step 9:
Calculate the degree of freedom for among groups and within groups. Let's see the summarize of our
calculation.
df | SS | MSS | F ratio | |
---|---|---|---|---|
Among Groups | k-1=2 | 28.17 | 14.085 | |
Within Groups | N-k=9 | 23.5 | 2.6111 | 5.394 |
Total | 11 |
Now using the degree of freedom among groups and within groups, you get the value 4.2565 in the table. Now if you compare this value with the F ratio value then you will see that the F ratio is greater than the table value. If it happens then reject the null hypothesis and accept the alternative hypothesis and that is there is a correlation.
Student | Low noise | Medium noise | High noise |
---|---|---|
Male students | ||
10 | 7 | 4 |
12 | 9 | 5 |
11 | 8 | 6 |
9 | 12 | 5 |
Female student | ||
12 | 13 | 6 |
13 | 15 | 6 |
10 | 12 | 4 |
13 | 12 | 4 |
In the table find that does noise and gender affects the marks of a student and does gender effects, how a
student reacts to noise.
Step 1:
Calculate the row total and column total.
Student | Low noise | Medium noise | High noise | Row total |
---|---|---|---|---|
Male students | ||||
10 | 7 | 4 | R1=98 | |
12 | 9 | 5 | ||
11 | 8 | 6 | ||
9 | 12 | 5 | ||
Female student | ||||
12 | 13 | 6 | R2=120 | |
13 | 15 | 6 | ||
10 | 12 | 4 | ||
13 | 12 | 4 | ||
column total | C1=90 | C2=88 | C1=40 |
Step 2:
Correction term(Cx)=(∑X)2/No of observation
Here,
Cx=(10+12+11+9+7+..+4)2/24=1980
Step 3:
Find sum f squares of total (SST)=∑X2-Cx
Here,
SST=(102+122+112+92+72+..+42)-1980=274
Step 4:
Sum of squares of column(SSC)={(992+882+402)/8}-1980=200
Here 8 means the number of readings contains by each column.
Step 5:
Sum of squares of row (SSR)={(982+1202)/12}-1980=20
Here you have total 12 reading in each row.
Step 6:
Sum of squares within groups(SSG)
SSG={(422+482+..+202)/6}-1980-200-20=16.33
Step 7:
Residual sum of squares(SSE)
SSE=SST+SSC+SSR+SSG=37
Step 8:
df | SS | MSSX(SS/df) | F ratio(MSSX/MSSE) | |
---|---|---|---|---|
Noise | C-1=2 | 200 | MSSC-100 | 48.73 |
Gender | R-1=1 | 20 | MSSR=20 | 9.81 |
Interaction | (C-1)*(R-1) | 16.33 | MSSG=8.167 | 3.97 |
Residual | C*R*(n-1)=18 | 37 | MSSE=2.06 | |
Total | N-1=23 | 59.25 |
In the table,
C=No of columns(in this example noise categories)=3
R=No of rows(here student categories)=2
N=Total number of students=24
n=No of students in a group=4
Step 9:
df | F ratio(MSSX/MSSE) | F critical at 0.05 | ||
---|---|---|---|---|
Noise | C-1=2 | 48.73 | F(2,18)=3.55 | |
Gender | R-1=1 | 9.81 | F(1,18)=4.41 | |
Interaction | (C-1)*(R-1) | 3.97 | F(2,18)=3.55 | |
Residual | C*R*(n-1)=18 | |||
Total | N-1=23 |
Now if the critical value is less than the F ratio then reject the null hypothesis and if greater then accept the null hypothesis.