Learn Python
Learn Data Structure & Algorithm
Learn Numpy
Learn Pandas
Learn Matplotlib
Learn Seaborn
Statistics Introduction
Statistics Variable
Statistics Sample & population
Statistics Measure of central tendency
Statistics Measure of Dispersion
Statistics Distribution
Statistics Z-score
Statistics PDF & CDF
Statistics Center Limit Theorem
Statistics Correlation
Statistics P value & hypothesis test
Statistics Counting rule
Statistics Outlier
Learn Math
Learn MATLAB
Learn Machine learning
Learn Github
Learn OpenCV
Learn Deep Learning
Learn MySQL
Learn MongoDB
Learn Web scraping
Learn Excel
Learn Power BI
Learn Tableau
Learn Docker
Learn Hadoop
Dispersion is the state of getting spread. The measure of dispersion allows understanding the distribution of the data. This technique is used to measure the spread of the data from a selected(mean/median/mode) point. It helps to calculate or find that how much the data are spread.
It is the difference between the smallest and largest value of the data.
Formula: Range=Largest value - smallest value
Example:4,6,7,10,15
Range=15-4=11
Here data is divided into four subgroups or parts.
Q1=25%.It is equal to 25th percentile. Below 25% data lies here.
Q2=50%.It is equal to 50th percentile. From 25% and below 50% data lies here.
Q3=75%.It is equal to 75th percentile. From 50% and below 75% data lies here.
Q4=100%.It is equal to 100th percentile. From 75% and below 100% data lies here.
To find quartiles at first sort the data into ascending order.
Example:
Data: 5,10,15,20,25,30,35,40
Q1,
i=(25*8)/100=2, Q1=(10+15)/2=12.5
First quartile 25%, that's why you divide 25/100.Here You have total 8 values. So you wrote (25*8)/100 and you
got the output 2.
Because the output is 2, so you will take the second position and the next position value and divide it by 2.
In the example, you have 10 in the second position and 15 in the next position.
Q2,
i=(50*8)/100=4, Q2=(20+25)/2=22.5
Here you wrote 50 because Q2 means 50%. The output of i is 4. So you will take the fourth position value and
the next value. Here fourth position value is 20 and then the next value is 25.
Q3,
i=(75*8)/100=6, Q3=(30+35)/2=32.5
Here you wrote 70 because Q2 means 75%. The output of i is 6. So you will take the sixth position value and
the next value. Here sixth position value is 30 and then the next value is 35.
Range of values between first and third quartiles.
Data must be sorted in ascending order. At first, find the mid-value of the data. Then divide the whole data
into two parts according to mid-value. Then individually find the mid-value for each part. The first part
mid-value is Q1 and the second part mid-value is Q3.
Formula: Interquartile range=Q3-Q1
Example: 2,4,6,7,10,11,12,14,16
Median=10
First part = 2,4,6,7
Second part=11,12,14,16
Median of first part,
Q1=(4+6)/2=5
Median of second part,
Q2=(12+14)/2=13
Interquartile range=Q3-Q1=13-5=7
Variance is a value of the squared variation of a random variable from its mean value. Variance calculates
that how far a set of numbers (random) are spread out from their mean value. The variance depends on the
standard deviation of the data set. Variance is equal to the square of standard deviation. If the value of
variance is more then the data is more scattered from its mean and if the value of variance is low then it is
less scattered from the mean. That's why variance is also called the measure of spread of data from the mean.
Variance is represented by using the σ2 symbol.
Ex: 20,34,30,35,45,40,55
Step 1: find mean
( 20+34+30+35+45+40+55) / 7=37
Step 2: Now subtract mean by each value.
20-37=-17, 34-37=-3, 30-37=-7, 35-37=-2, 45-37=8, 40=37=3, 55, 37=18
Step 3: Now do square each value that you got after subtraction.
(-17)2=289, (-3)2=9, (-7)2=49, (-2)2=4, (8)2=64, (3)2=9, (-18)2=324
Step 4: Now do sum all those values which you got after doing square.
289+9+49+4+64+9+324=748
Step 5: Now divide the value of the sum by the total number of elements.
748/7=106.857
So first of all you need to find the boundaries and then frequency.
Boundaries(.5 less from First value and .5 more from second value) | Frequency(f)(The total number of values present in boundary) | Midpoint(m)(Summation of Boundaries values and then divide by 2) | D(f*m) | E(m^2/f) |
---|---|---|---|---|
3.5-8.5 | 2 | 6 | 12 | 72 |
8.5-13.5 | 3 | 11 | 33 | 363 |
13.5-18.5 | 1 | 16 | 16 | 256 |
18.5-23.5 | 3 | 21 | 63 | 1323 |
23.5-28.5 | 5 | 26 | 130 | 3380 |
28.5-33.5 | 4 | 36 | 124 | 3844 |
28.5-33.5 | N=18 | 36 | ∑ f*m=378 | ∑ f*m2=9238 |
Formula:
σ2={n(∑fx2)-(∑fm)^2} / {n(n-1)} ={18(9238)-(378)2} / {18(18-1)} =
76.47
It is the positive square root of the variance. We represent standard deviation by the σ symbol.
How to calculate standard deviation?
Ex: 20,34,30,35,45,40,55
Step 1: find mean
( 20+34+30+35+45+40+55) / 7=37
Step 2: Now subtract Mean by each value.
20-37=-17, 34-37=-3, 30-37=-7, 35-37=-2, 45-37=8, 40=37=3, 55, 37=18
Step 3: Now do square each value that we get after subtraction.
(-17)2=289, (-3)2=9, (-7)2=49, (-2)2=4, (8)2=64, (3)2=9, (-18)2=324
Step 4: Now do sum all those values which we get after doing square.
289+9+49+4+64+9+324=748
Step 5: Now divide the value of the sum by the total number of elements.
748/7=106.857
Step 6 : Now do square route.
Route of 106.857=10.38
So first of all we need to find the boundaries and then frequency.
Boundaries(.5 less from First value and .5 more from second value) | Frequency(f)(The total number of values present in boundary) | Midpoint(m)(Summation of Boundaries values and then divide by 2) | D(f*m) | E(m^2/f) |
---|---|---|---|---|
3.5-8.5 | 2 | 6 | 12 | 72 |
8.5-13.5 | 3 | 11 | 33 | 363 |
13.5-18.5 | 1 | 16 | 16 | 256 |
18.5-23.5 | 3 | 21 | 63 | 1323 |
23.5-28.5 | 5 | 26 | 130 | 3380 |
28.5-33.5 | 4 | 36 | 124 | 3844 |
28.5-33.5 | N=18 | 36 | ∑ f*m=378 | ∑ f*m2=9238 |
Formula:
Variance(σ2)={n(∑fx2)-(∑fm)2} / {n(n-1)} ={18(9238)-(378)2} /
{18(18-1)} = 76.47
Standard deviation(σ)=√76.47 = 8.74
Mean absolute deviation shows the difference between data elements and the mean.
It calculates the means of a dataset and also average absolute difference between each data point of a
dataset. MAD is used as a measure of variability or dispersion in a data set. To calculate, first calculate
the mean of the data. Then subtract each element of the data by the mean value
Example: Data:4,6,7,8,10 Mean=(4+6+7+8+10)/5=7 Deviations from the mean:
(4-7)=-3,(6-7)=-1,(7-7)=0,(7-8)=1,(7-10)=3
Here minus(-) sign means before the mean value. So you can say that if you go 3 steps backward from the mean
value then you will get element 4 and if you go 3 steps forward then you will get element 10.
X | X- µ | |X- µ| |
---|---|---|
4 | -3 | +3 |
6 | -1 | +1 |
7 | 0 | 0 |
8 | +1 | +1 |
10 | +3 | +3 |
Sum=0 | Sum=8 |
In the table X means the elements of the data and after finding the mean, subtract each element with the
mean(X- µ). Now if you get the sum of all values of X- µ, you will see that is 0. Now by 0, you can't do
anything. So you have to do the modulus of each value. After this, you will have all the positive values. Now
if you do sum then you will get the result 8.
Now divide 8 by total number of elements
Formula mean absolute error= Sum of (|X- µ|)/total number of elements
MAD=8/5=1.6
X | F | FX |
---|---|---|
1 | 3 | 3 |
2 | 2 | 4 |
3 | 1 | 3 |
8 | 4 | 31 |
7 | 7 | 49 |
4 | 3 | 12 |
Sum=20 | Sum=103 |
mean absolute error=(F*|X- µ|)/F=48.70/20=2.435
Coefficient of variation is used when you compare two or multiple different surveys or test results that have
different units/measures/values. The coefficient of variation converts all the units into one unit
(percentage). It also shows the degree of variation of a set of data points.
Suppose you are comparing two results of two different tests. If sample X has CV(Coefficient of variation) of
36% and sample Y has CV(Coefficient of variation) of 29%, then you will say that sample A has more variation,
relative to its mean.
Formula:Cvar=(standard Deviation /Mean)*100%
Example:
The mean value of the number of sales of magazines over a 6-month period id 172, and the standard deviation is
10. The mean of the bonuses paid to employees is $6,000, and the standard deviation is $800. Compare the
coefficient of variation.
Cvar for magazine= (10/172)*100% =5.81%
Cvar for employees=(850/6500)/100%=13.07%
The Coefficient of variation is larger for employees' bonuses compared to the sales of magazines, So the
bonuses are more variable than sales.