Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Statistics Introduction

Statistics Variable

Statistics Sample & population

Statistics Measure of central tendency

Statistics Measure of Dispersion

Statistics Distribution

Statistics Z-score

Statistics PDF & CDF

Statistics Center Limit Theorem

Statistics Correlation

Statistics P value & hypothesis test

Statistics Counting rule

Statistics Outlier

Learn Math

Learn MATLAB

Learn Machine learning

Learn Github

Learn OpenCV

Learn Deep Learning

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

P-value and hypothesis testing in statistics

What is P value?

The P-value is the probability of the null hypothesis being true.

What is Null hypothesis?

It is an assumption that treats everything equally and similarly. It means the null hypothesis treats everything similar or equal.

Suppose the GDP of a county is 10 million. After 5 years the GDP is also 10 million. Is that possible? It is not possible that GDP is same after 5 years of a country. But the null hypothesis treats everything similar. So null hypothesis will say, it is possible. The null hypothesis will say that the GDP of the country before and after 5 years is same.

Another example can be: null hypothesis will say that all the students of a class are equally intelligent.

What is the use of P value?

By using the P-value you will try to prove, is the null hypothesis true or wrong.
It means that from the above example: the p-value tries to prove that the GDP of a country before and after 5 years is same or not. It may increase or decrease. So you can say that P-value is the process to accept or reject the null hypothesis. Calculation:
1. First collect data
2. Significance level: For P-value, the significance level is 0.5. It means, in 0.5% case the the null hypothesis should be true.
For example, you take 100 countries then at least for 5 countries the null hypothesis should be true.
3.Decision: accept or reject the null hypothesis. Accept means what is said by the null hypothesis is true and reject means not true.
Symbol:
H₀=Accept and H
_a=Reject

The range of significant values for rejecting or accepting the null hypothesis:
1.P value <=0.01 --> Very strong evidence against null hypothesis
2.0.05<= P value <=0.1 -->strong evidence against null hypothesis
3.P value <=0.1 --> Mild evidence against null hypothesis
4. P value >=0.1 --> No evidence against null hypothesis.If you get this then accept the null hypothesis.

After deciding the significance level do some tests to get P-value.
The test are:
1. T test
2. Chi square test
3. Anova test
4. Z test

Chi square Test

Step 1: Get the data:
Here you have a table. This table contains people's marital status according to a different stage of education.
Here significance value is 0.5. This table is called the observed value table:

Marital Status	Bachelor's	Masters	Ph.D	Total
Never married	19	16	21	56
Married	12	36	45	93
Divorced	6	9	9	24
Widowed	3	9	9	21
Total	39	84	84	194

In the table total column, you have the total value of each row. In the total row, you have a total of each column. Here 194 is the grand total.

Step2: Now calculate the expected values from observed value.
Formula of calculate expected value: exv=(row total*column total)/grand total.

Marital Status	Bachelor's	Masters	Ph.D
Never married	11.27	24.25	24.25
Married	18.69	40.27	40.27
Divorced	4.82	2.57	2.57
Widowed	4.22	9.09	9.09

Step 3: Find deviation between observed and expected values.
Formula: (observed value - Expected value)²/Expected value

Observed value(O)	Expected value(E)	(O-E)²/Expected value
19	11.27	5.30
16	24.25	0.65
21	24.25	0.43
12	18.69	1.73
36	40.27	0.45
45	40.27	0.56
6	4.82	0.29
9	2.57	16.09
9	2.57	16.09
3	4.22	0.35
9	9.09	0.00
9	9.09	0.00
		Total=41.94 Chi square value

Step 4: calculate Degree of freedom
Formula:(column-1)*(row-1)
Here,
Column means the number of columns where numerical values are present for calculation in the table.
Row means the number of row present where numerical values are present for calculation in the table.

So,
Degree of freedom=(3-1)*(4-1)=2*3=6

Now go to test table and find the value according to degree of freedom and significance value.
Now compare the chi square value that you get from calculation and that value which you get from test table.

In chi square if calculate value > tabular value
then null hypothesis is rejected.
Here,
calculated value means that chi square value that you get from calculation and tabular value means that value which you get from the test table.

T-test

T test is used to compare the mean of two groups.It is used to determine that whether a process affects the population interest or two groups are different or not. T-test is used for small sample test. Here the sample size is less than 30. Here population variable are unknown. In t-test the correlation of coefficient in population is zero.

Formula: t=(x₁-x₂)/√{s²(1/n₁+1/n₂)}
Here,
t= t tst value
x₁ and x₂ are the mean of two group
s² is the pooled standard error of two groups
n₁ and n₂ are the number of observations in each of the groups

Example calculation:

Subject	Student 1 score(X)	Student 2 score(Y)
1	3	13
2	3	13
3	15	29
4	12	20
5	23	25
6	16	32
7	17	23
8	19	20
9	32	30
10	24	15
11	3	20

Step 1: find X-Y

Subject	Student 1 score(X)	Student 2 score(Y)	X-Y
1	3	13	-10
2	3	13	-10
3	15	29	-14
4	12	20	-8
5	23	25	-2
6	16	32	-16
7	17	23	-6
8	19	20	-1
9	32	30	2
10	24	15	9
11	3	20	-17

Step 2: Calculate sum of X-Y column

Subject	Student 1 score(X)	Student 2 score(Y)	X-Y
1	3	13	-10
2	3	13	-10
3	15	29	-14
4	12	20	-8
5	23	25	-2
6	16	32	-16
7	17	23	-6
8	19	20	-1
9	32	30	2
10	24	15	9
11	3	20	-17
			sum=-73

Step 3: Calculate the square of the X-Y column and calculate the sum

Subject	Student 1 score(X)	Student 2 score(Y)	X-Y	(X-Y)²
2	3	13	-10	100
3	3	13	-10	100
5	15	29	-14	196
4	12	20	-8	64
9	23	25	-2	4
6	16	32	-16	256
7	17	23	-6	36
8	19	20	-1	1
11	32	30	2	4
10	24	15	9	81
1	3	20	-17	289
			sum=-73	sum=1131

You can also write the formula like this for better understanding.
Formula: t=(∑D/N)/√[∑D²-{(∑D)²/N}]/[{(N-1)(N)}]
∑D=Sum of X-Y column (in step 2)
∑D²=Sum of (X-Y)² column(in step 3)
(∑D)²=Sum of the differences in step 2.

So,
t=(-73/11)/[{1131-{(-73)²/11}}/(11-1)(11)]
t=-2.74

Step 4:
Subtract 1 from the sample size to get the degrees of freedom. You have 11 items, so 11-1=10

Step 5:
Find p-value in the t-table, using the degrees of freedom in step 6.
Use significance level 0.05(5%) with df=10 and the t value is 2.228.

Step 6:
Compare t-table value from step 5 (2.228) to the calculated t value(-2.74)
Here the calculated t value is greater than the table value at significance level 0.5 and the p-value is less than the significance level. So you can reject the null hypothesis that there is no difference between mean.

Note:
You can ignore the minus or plus sign while comparing two t values. A larger or higher t value shows that the difference between groups mean are greater than the pooled standard error. Which indicates a more significant difference between the groups. You can also compare the calculated t value against the values in the critical value chart to determine whether the t value is greater than the expected value, then reject the null hypothesis.

Z test

Z test is used if the population coefficient correlation is not zero. In the z test sample size is large(n>30). Z test is used whether two populations mean are different and also population variance is known. Z test is based on the standard normal distribution.
Formula: Z=(x-m)/(σ/√n)
Here,
x=sample mean
m=population mean
σ=standard deviation
n=number of sample

Let's see an example:
You get a complaint that the boys in the primary school were underfed. Now you have to find that is the complaint is true or false. Now let's solve this problem using the Z test.

Consider the weight of boys as a variable. By previous statistical analysis, you know that the the average weight of boys of age 10, is 32 kg whereas SD is 9kg.
Let's take a sample of 25 students and the average is 29.5 kg.
Here significance level is 0.05.
Here,
n=29.5
m=32
σ=9
n=25
So, Z= (29.5-32)/(9/√25)=-1.39
Now find the p value using the table.
In this case the p value is 0.0823
Now compare p value with significance value.
Here p-value is greater than the significance value 0.0823>0.05

Because the p-value is greater than the significance level then accept the null hypothesis and here null hypothesis is, there is no significant difference between the students. It means the complaint is false.

Anova test

There are two types of anova test:
1.One way anova
2.Two way anova

One way anova

Low noise		Medium noise		High noise
Student	Question(X)	Student	Question(X)	Student	Question(X)
1	10	5	8	9	4
2	9	6	4	10	3
3	6	7	6	11	6
4	7	8	7	12	4

The table represents the amount of noise and the number of questions students can solve. There are three categories low, medium, and high noise. Here the number of students in each category is 4. Find the relation between noise and the number of questions solved. Here you can see that if noise increases then the number of questions solve is decreases. You can say that there is a negative correlation. By using ANOVA you can find that how much or what percentage of correlation features have.
To calculate :

step 1: Find X² and calculate sum of question and X² column

sum=165

Low noise			Medium noise			High noise
Student	Question(X)	X²	Student	Question(X)	X²	Student	Question(X)	X²
1	10	100	5	8	64	9	4	16
2	9	81	6	4	16	10	3	9
3	6	36	7	6	36	11	6	36
4	7	49	8	7	49	12	4	16
	sum=32	sum=266		sum=25			sum=17	Sum=77

step 2:
Calculate correction term
Formula: C=(∑X)²/N
Here,
C=(32+25+17)²/12=456.33

Step 3:
Sum of all squared of total and minus it by correction term.
SST=∑X²-C=(266+165+77)=456.33=51.67

Step 4:
Divide the sum of all X total and divide by n and then subtract it by C.
SSA={(∑X)²/N}-C=(32²/4)+(17²/4)+(32²/4)-456.33=28.17
Step 5:
Here calculate SST-SSA.
SSW=SST-SSA=51.67-28.17=23.5

Step 6:
Mean of sum squares among groups:
Formula: MSSa=SSA/(k-1)
Here k is number of categories.
So MSSA=28.17/3-1=14.085

Step 7:
Mean of the sum of squares within groups:
MSSW=SSW/(N-K)=23.5/12-3=2.611
Step 8:
Here we calculate F ratio.
F ratio=MSSA/MSSW=14.085/2.611=5.394
Step 9:
Calculate the degree of freedom for among groups and within groups. Let's see the summarize of our calculation.

df	SS	MSS	F ratio
Among Groups	k-1=2	28.17	14.085
Within Groups	N-k=9	23.5	2.6111	5.394
Total	11			5.394

Now using the degree of freedom among groups and within groups, you get the value 4.2565 in the table. Now if you compare this value with the F ratio value then you will see that the F ratio is greater than the table value. If it happens then reject the null hypothesis and accept the alternative hypothesis and that is there is a correlation.

Two way anova

Student	Low noise	Medium noise	High noise
Male students
	10	7	4
	12	9	5
	11	8	6
	9	12	5
Female student
	12	13	6
	13	15	6
	10	12	4
	13	12	4

In the table find that does noise and gender affects the marks of a student and does gender effects, how a student reacts to noise.

Step 1:
Calculate the row total and column total.

Student	Low noise	Medium noise	High noise	Row total
Male students
	10	7	4	R₁=98
	12	9	5
	11	8	6
	9	12	5
Female student
	12	13	6	R₂=120
	13	15	6
	10	12	4
	13	12	4
column total	C₁=90	C₂=88	C₁=40

Step 2:
Correction term(C_x)=(∑X)²/No of observation
Here,
C_x=(10+12+11+9+7+..+4)²/24=1980

Step 3:
Find sum f squares of total (SST)=∑X²-C_x
Here,
SST=(10²+12²+11²+9²+7²+..+4²)-1980=274

Step 4:
Sum of squares of column(SSC)={(99²+88²+40²)/8}-1980=200
Here 8 means the number of readings contains by each column.

Step 5:
Sum of squares of row (SSR)={(98²+120²)/12}-1980=20
Here you have total 12 reading in each row.

Step 6:
Sum of squares within groups(SSG)
SSG={(42²+48²+..+20²)/6}-1980-200-20=16.33

Step 7:
Residual sum of squares(SSE)
SSE=SST+SSC+SSR+SSG=37

Step 8:

df	SS	MSSX(SS/df)	F ratio(MSSX/MSSE)
Noise	C-1=2	200	MSSC-100	48.73
Gender	R-1=1	20	MSSR=20	9.81
Interaction	(C-1)*(R-1)	16.33	MSSG=8.167	3.97
Residual	CR(n-1)=18	37	MSSE=2.06
Total	N-1=23	59.25

In the table,
C=No of columns(in this example noise categories)=3
R=No of rows(here student categories)=2
N=Total number of students=24
n=No of students in a group=4

Step 9:

df	F ratio(MSSX/MSSE)	F critical at 0.05
Noise	C-1=2	48.73	F_(2,18)=3.55
Gender	R-1=1	9.81	F_(1,18)=4.41
Interaction	(C-1)*(R-1)	3.97	F_(2,18)=3.55
Residual	CR(n-1)=18
Total	N-1=23

Now if the critical value is less than the F ratio then reject the null hypothesis and if greater then accept the null hypothesis.

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us

Subject	Student 1 score(X)	Student 2 score(Y)	X-Y
1	3	13	-10
2	3	13	-10
3	15	29	-14
4	12	20	-8
5	23	25	-2
6	16	32	-16
7	17	23	-6
8	19	20	-1
9	32	30	2
10	24	15	9
11	3	20	-17

Subject	Student 1 score(X)	Student 2 score(Y)	X-Y
1	3	13	-10
2	3	13	-10
3	15	29	-14
4	12	20	-8
5	23	25	-2
6	16	32	-16
7	17	23	-6
8	19	20	-1
9	32	30	2
10	24	15	9
11	3	20	-17

Subject	Student 1 score(X)	Student 2 score(Y)	X-Y
1	3	13	-10
2	3	13	-10
3	15	29	-14
4	12	20	-8
5	23	25	-2
6	16	32	-16
7	17	23	-6
8	19	20	-1
9	32	30	2
10	24	15	9
11	3	20	-17