Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Statistics Introduction

Statistics Variable

Statistics Sample & population

Statistics Measure of central tendency

Statistics Measure of Dispersion

Statistics Distribution

Statistics Z-score

Statistics PDF & CDF

Statistics Center Limit Theorem

Statistics Correlation

Statistics P value & hypothesis test

Statistics Counting rule

Statistics Outlier

Learn Math

Learn MATLAB

Learn Machine learning

Learn Github

Learn OpenCV

Learn Deep Learning

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

Correlation in statistics

What is Correlation?

Here you will try to find the correlation between features.
Now here can be a question that what kind of relation?

Let's consider you have an independent feature X and a dependent feature Y. The relation can be like, if X increases then Y increases or not, or if X decreases then Y decreases or not. So you try to find these types of relationships and these types of relations plays a very important role. If increasing then positively correlated and if decreasing then negatively correlated. You can also find the strength of correlation means if two features are correlated to each other then how much correlated they are or you can say how much positive or negative correlated they are with each other.

What is Covariance?

Suppose you asked 5 students their grades in two math classes, Math 1 and Math 2.

Student	Math 1(X)	Math 2(Y)
1	96	71
2	78	55
3	100	88
4	63	48
5	83	63

Find the correlation between these two subjects' marks.
Formula:
Cov(X,Y)=E(XY)-E(X)E(Y)
Here E means expected value.
Calculating expected value of E(X), E(Y) and E(XY)

Student	Math 1(X)	Math 2(Y)	Math 1* Math 2(XY)
1	96	71	6816
2	78	55	4290
3	100	88	8800
4	63	48	3024
5	83	63	5229
5	E(X)=84	E(Y)=65	E(XY)=5631.8

If you want to find the expected value X and Y then just go and calculate the average of the X and Y column and for E(XY) first multiply X and Y column then find the average to get the expected value.
Now put these values into formula:
Covariance(X,Y)=E(XY)-E(X)*E(Y)=5631.8-(84*65)=171.8
Here you get a positive value. It means that X and Y are positively correlated.

How to calculate correlation?

Formula: Correlation={Cov (X,Y)}/√{(V(X)V(Y)}
Here, V(X) and V(Y) mean-variance of X and Y column and Cov(X, Y) means covariance of X and Y column. The formula of calculating V(X) and V(Y),
V(X)=E(X²)-E(X)²
V(Y)=E(Y²)-E(Y)²

We will continue Covariance example:
We know,
from Covariance example,
Covariance=171.8
E(X)=84,E(X)²=84²=7056
E(Y)=65,E(Y)²=65²=4225
To calculate E(X²) and E(Y²) we need X² and Y².

Student	Math 1(X)	Math 2(Y)	Math 1(X²)	Math 2(Y²)
1	96	71	9216	5041
2	78	55	6048	3025
3	100	88	10000	7744
4	63	48	3969	2304
5	83	63	6889	3969
			E(X²)=7224.4	E(Y²)=4416.6

So, V(X)=7224.4-7056=168.4
V(Y)=4416.6-4225=191.6

Correlation={Cov (X,Y)}/√{(V(X)V(Y)}=171.8/√(168.4*191.6)=1

Pearson Correlation Coefficient

Formula: P_(x,y)=Co-variance(X,Y)/std_x std_y

To get the Pearson correlation coefficient divide co-variance of X and Y by the standard deviation of x and standard deviation of y. In co-variance, you calculate the positive and negative relationship means the direction of the relationship of features but there you don't that how much positively or negatively correlated or what is the positive or negative value. In short, in co-variance you don't know the strength of a relationship. By using the Pearson correlation coefficient you can calculate the positive or negative relationship means the direction of relationship and also the strength of the relationship of features.

Strength means how many features are positively or negatively correlated.

Always remember that when you try to find a correlation coefficient for Pearson, the value will be range between less than equal to -1 to +1.

If the x increases and y also increases then the Pearson correlation coefficient value is 1.

If the x decreases and y also increases then the Pearson correlation coefficient value is -1.

If you don't have a relationship then the Pearson correlation coefficient value is 0. In this case the scenario can be the data are randomly scattered.

If the x decreases and y also decreases but the data points are not in a straight line then Pearson correlation coefficient value is the range between -1 to 0.

Spearman's Rank Correlation

When you have qualitative data which is difficult to express in the form of numbers like dedication, beauty, effort, honesty, etc, then you use Spearman's rank correlation. Those variables which can't be quantified then also use Spearman's rank correlation. Spearman's rank correlation is useful when ranks are given or ranks can be computed. Spearman's Rank correlation is represented by capital R.

How to calculation Spearman's Rank Correlation
1. First, give rank to the traits.
2. In both cases, follow the same pattern. Like if one is in increasing order then the other should be increasing order too. If Decreasing then others should be decreasing too.
3. Find the difference between rank1 and rank2. It is denoted by D.
4.Now find D₂ and do sum of all ΣD₂.
5. Find count N. To find divide number of observations by the number of pairs.
6. Now put the value of ΣD₂ and N

Formula: R=1-(6Σ D₂/N₃-N

Let's See an example:
You have a table of beauty contests where the total contestant is 7 and the number is given by two judges. Now you have to find the correlation between 2 judges' marks.

Contestant	Judge 1	Judge 2
1	35	38
2	36	24
3	41	40
4	38	42
5	17	21
6	34	28
7	28	31

Now convert judges numbers into rank.

Contestant	Judge 1	Judge 2	Judge 1 rank	Judge 2 rank
1	36	39	3	3
2	27	25	6	6
3	42	41	1	2
4	39	43	2	1
5	18	22	7	7
6	35	29	4	5
7	29	32	5	4

To give rank the highest number will be rank 1, the second-highest number will be rank 2, the third-highest rank 3, and so on. To give rank this process will be applied for both judges.
Now using the rank let's do a calculation:

Contestant	Judge 1 rank	Judge 2 rank	D(R₁-R₂)	D²
1	3	3	0	0
2	6	6	0	0
3	1	2	-1	1
4	2	1	1	1
5	7	7	0	0
6	4	5	-1	1
7	5	4	1	1
				ΣD₂=4

Formula: R=1-{(6*4)/7₃-7}=0.9286

Now let's see what is the meaning of the R-value that you get from the calculation.
R=0.9286
It means that there is a strong positive correlation between the marks given by the two judges.

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us