Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Statistics Introduction

Statistics Variable

Statistics Sample & population

Statistics Measure of central tendency

Statistics Measure of Dispersion

Statistics Distribution

Statistics Z-score

Statistics PDF & CDF

Statistics Center Limit Theorem

Statistics Correlation

Statistics P value & hypothesis test

Statistics Counting rule

Statistics Outlier

Learn Math

Learn MATLAB

Learn Machine learning

Learn Github

Learn OpenCV

Learn Deep Learning

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

Correlation in statistics

What is Correlation?

Here you will try to find the correlation between features.
Now here can be a question that what kind of relation?

Let's consider you have an independent feature X and a dependent feature Y. The relation can be like, if X increases then Y increases or not, or if X decreases then Y decreases or not. So you try to find these types of relationships and these types of relations plays a very important role. If increasing then positively correlated and if decreasing then negatively correlated. You can also find the strength of correlation means if two features are correlated to each other then how much correlated they are or you can say how much positive or negative correlated they are with each other.

What is Covariance?

Suppose you asked 5 students their grades in two math classes, Math 1 and Math 2.


Student Math 1(X) Math 2(Y)
1 96 71
2 78 55
3 100 88
4 63 48
5 83 63

Find the correlation between these two subjects' marks.
Formula:
Cov(X,Y)=E(XY)-E(X)E(Y)
Here E means expected value.
Calculating expected value of E(X), E(Y) and E(XY)


Student Math 1(X) Math 2(Y) Math 1* Math 2(XY)
1 96 71 6816
2 78 55 4290
3 100 88 8800
4 63 48 3024
5 83 63 5229
5 E(X)=84 E(Y)=65 E(XY)=5631.8

If you want to find the expected value X and Y then just go and calculate the average of the X and Y column and for E(XY) first multiply X and Y column then find the average to get the expected value.
Now put these values into formula:
Covariance(X,Y)=E(XY)-E(X)*E(Y)=5631.8-(84*65)=171.8
Here you get a positive value. It means that X and Y are positively correlated.

How to calculate correlation?

Formula: Correlation={Cov (X,Y)}/√{(V(X)V(Y)}
Here, V(X) and V(Y) mean-variance of X and Y column and Cov(X, Y) means covariance of X and Y column. The formula of calculating V(X) and V(Y),
V(X)=E(X2)-E(X)2
V(Y)=E(Y2)-E(Y)2

We will continue Covariance example:
We know,
from Covariance example,
Covariance=171.8
E(X)=84,E(X)2=842=7056
E(Y)=65,E(Y)2=652=4225
To calculate E(X2) and E(Y2) we need X2 and Y2.


Student Math 1(X) Math 2(Y) Math 1(X2) Math 2(Y2)
1 96 71 9216 5041
2 78 55 6048 3025
3 100 88 10000 7744
4 63 48 3969 2304
5 83 63 6889 3969
E(X2)=7224.4 E(Y2)=4416.6

So, V(X)=7224.4-7056=168.4
V(Y)=4416.6-4225=191.6

Correlation={Cov (X,Y)}/√{(V(X)V(Y)}=171.8/√(168.4*191.6)=1

Pearson Correlation Coefficient

Formula: P(x,y)=Co-variance(X,Y)/stdx stdy

To get the Pearson correlation coefficient divide co-variance of X and Y by the standard deviation of x and standard deviation of y. In co-variance, you calculate the positive and negative relationship means the direction of the relationship of features but there you don't that how much positively or negatively correlated or what is the positive or negative value. In short, in co-variance you don't know the strength of a relationship. By using the Pearson correlation coefficient you can calculate the positive or negative relationship means the direction of relationship and also the strength of the relationship of features.

Strength means how many features are positively or negatively correlated.

Always remember that when you try to find a correlation coefficient for Pearson, the value will be range between less than equal to -1 to +1.

If the x increases and y also increases then the Pearson correlation coefficient value is 1.

If the x decreases and y also increases then the Pearson correlation coefficient value is -1.

If you don't have a relationship then the Pearson correlation coefficient value is 0. In this case the scenario can be the data are randomly scattered.

If the x decreases and y also decreases but the data points are not in a straight line then Pearson correlation coefficient value is the range between -1 to 0.

Spearman's Rank Correlation

When you have qualitative data which is difficult to express in the form of numbers like dedication, beauty, effort, honesty, etc, then you use Spearman's rank correlation. Those variables which can't be quantified then also use Spearman's rank correlation. Spearman's rank correlation is useful when ranks are given or ranks can be computed. Spearman's Rank correlation is represented by capital R.

How to calculation Spearman's Rank Correlation
1. First, give rank to the traits.
2. In both cases, follow the same pattern. Like if one is in increasing order then the other should be increasing order too. If Decreasing then others should be decreasing too.
3. Find the difference between rank1 and rank2. It is denoted by D.
4.Now find D2 and do sum of all ΣD2.
5. Find count N. To find divide number of observations by the number of pairs.
6. Now put the value of ΣD2 and N

Formula: R=1-(6Σ D2/N3-N

Let's See an example:
You have a table of beauty contests where the total contestant is 7 and the number is given by two judges. Now you have to find the correlation between 2 judges' marks.


Contestant Judge 1 Judge 2
1 35 38
2 36 24
3 41 40
4 38 42
5 17 21
6 34 28
7 28 31

Now convert judges numbers into rank.


Contestant Judge 1 Judge 2 Judge 1 rank Judge 2 rank
1 36 39 3 3
2 27 25 6 6
3 42 41 1 2
4 39 43 2 1
5 18 22 7 7
6 35 29 4 5
7 29 32 5 4

To give rank the highest number will be rank 1, the second-highest number will be rank 2, the third-highest rank 3, and so on. To give rank this process will be applied for both judges.
Now using the rank let's do a calculation:


Contestant Judge 1 rank Judge 2 rank D(R1-R2) D2
1 3 3 0 0
2 6 6 0 0
3 1 2 -1 1
4 2 1 1 1
5 7 7 0 0
6 4 5 -1 1
7 5 4 1 1
ΣD2=4

Formula: R=1-{(6*4)/73-7}=0.9286

Now let's see what is the meaning of the R-value that you get from the calculation.
R=0.9286
It means that there is a strong positive correlation between the marks given by the two judges.

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us

© Copyright All rights reserved www.CodersAim.com. Developed by CodersAim.