Learn Python
Learn Data Structure & Algorithm
Learn Numpy
Learn Pandas
Learn Matplotlib
Learn Seaborn
Statistics Introduction
Statistics Variable
Statistics Sample & population
Statistics Measure of central tendency
Statistics Measure of Dispersion
Statistics Distribution
Statistics Z-score
Statistics PDF & CDF
Statistics Center Limit Theorem
Statistics Correlation
Statistics P value & hypothesis test
Statistics Counting rule
Statistics Outlier
Learn Math
Learn MATLAB
Learn Machine learning
Learn Github
Learn OpenCV
Learn Deep Learning
Learn MySQL
Learn MongoDB
Learn Web scraping
Learn Excel
Learn Power BI
Learn Tableau
Learn Docker
Learn Hadoop
Here you will try to find the correlation between features.
Now here can be a question that what kind of relation?
Let's consider you have an independent feature X and a dependent feature Y. The relation can be like, if X
increases then Y increases or not, or if X decreases then Y decreases or not. So you try to find these types
of relationships and these types of relations plays a very important role. If increasing then positively
correlated and if decreasing then negatively correlated. You can also find the strength of correlation means
if two features are correlated to each other then how much correlated they are or you can say how much
positive or negative correlated they are with each other.
Suppose you asked 5 students their grades in two math classes, Math 1 and Math 2.
Student | Math 1(X) | Math 2(Y) |
---|---|---|
1 | 96 | 71 |
2 | 78 | 55 |
3 | 100 | 88 |
4 | 63 | 48 |
5 | 83 | 63 |
Find the correlation between these two subjects' marks.
Formula:
Cov(X,Y)=E(XY)-E(X)E(Y)
Here E means expected value.
Calculating expected value of E(X), E(Y) and E(XY)
Student | Math 1(X) | Math 2(Y) | Math 1* Math 2(XY) |
---|---|---|---|
1 | 96 | 71 | 6816 |
2 | 78 | 55 | 4290 |
3 | 100 | 88 | 8800 |
4 | 63 | 48 | 3024 |
5 | 83 | 63 | 5229 |
5 | E(X)=84 | E(Y)=65 | E(XY)=5631.8 |
If you want to find the expected value X and Y then just go and calculate the average of the X and Y column
and for E(XY) first multiply X and Y column then find the average to get the expected value.
Now put these values into formula:
Covariance(X,Y)=E(XY)-E(X)*E(Y)=5631.8-(84*65)=171.8
Here you get a positive value. It means that X and Y are positively correlated.
Formula: Correlation={Cov (X,Y)}/√{(V(X)V(Y)}
Here, V(X) and V(Y) mean-variance of X and Y column and Cov(X, Y) means covariance of X and Y column. The
formula of calculating V(X) and V(Y),
V(X)=E(X2)-E(X)2
V(Y)=E(Y2)-E(Y)2
We will continue Covariance example:
We know,
from Covariance example,
Covariance=171.8
E(X)=84,E(X)2=842=7056
E(Y)=65,E(Y)2=652=4225
To calculate E(X2) and E(Y2) we need X2 and Y2.
Student | Math 1(X) | Math 2(Y) | Math 1(X2) | Math 2(Y2) |
---|---|---|---|---|
1 | 96 | 71 | 9216 | 5041 |
2 | 78 | 55 | 6048 | 3025 |
3 | 100 | 88 | 10000 | 7744 |
4 | 63 | 48 | 3969 | 2304 |
5 | 83 | 63 | 6889 | 3969 |
E(X2)=7224.4 | E(Y2)=4416.6 |
So, V(X)=7224.4-7056=168.4
V(Y)=4416.6-4225=191.6
Correlation={Cov (X,Y)}/√{(V(X)V(Y)}=171.8/√(168.4*191.6)=1
Formula: P(x,y)=Co-variance(X,Y)/stdx stdy
To get the Pearson correlation coefficient divide co-variance of X and Y by the standard deviation of x and
standard deviation of y. In co-variance, you calculate the positive and negative relationship means the
direction of the relationship of features but there you don't that how much positively or negatively
correlated or what is the positive or negative value. In short, in co-variance you don't know the strength of
a relationship. By using the Pearson correlation coefficient you can calculate the positive or negative
relationship means the direction of relationship and also the strength of the relationship of features.
Strength means how many features are positively or negatively correlated.
Always remember that when you try to find a correlation coefficient for Pearson, the value will be range
between less than equal to -1 to +1.
If the x increases and y also increases then the Pearson correlation coefficient value is 1.
If the x decreases and y also increases then the Pearson correlation coefficient value is -1.
If you don't have a relationship then the Pearson correlation coefficient value is 0. In this case the
scenario can be the data are randomly scattered.
If the x decreases and y also decreases but the data points are not in a straight line then Pearson
correlation coefficient value is the range between -1 to 0.
When you have qualitative data which is difficult to express in the form of numbers like dedication, beauty,
effort, honesty, etc, then you use Spearman's rank correlation. Those variables which can't be quantified then
also use Spearman's rank correlation. Spearman's rank correlation is useful when ranks are given or ranks can
be computed. Spearman's Rank correlation is represented by capital R.
How to calculation Spearman's Rank Correlation
1. First, give rank to the traits.
2. In both cases, follow the same pattern. Like if one is in increasing order then the other should be
increasing order too. If Decreasing then others should be decreasing too.
3. Find the difference between rank1 and rank2. It is denoted by D.
4.Now find D2 and do sum of all ΣD2.
5. Find count N. To find divide number of observations by the number of pairs.
6. Now put the value of ΣD2 and N
Formula: R=1-(6Σ D2/N3-N
Let's See an example:
You have a table of beauty contests where the total contestant is 7 and the number is given by two judges. Now
you have to find the correlation between 2 judges' marks.
Contestant | Judge 1 | Judge 2 |
---|---|---|
1 | 35 | 38 |
2 | 36 | 24 |
3 | 41 | 40 |
4 | 38 | 42 |
5 | 17 | 21 |
6 | 34 | 28 |
7 | 28 | 31 |
Now convert judges numbers into rank.
Contestant | Judge 1 | Judge 2 | Judge 1 rank | Judge 2 rank |
---|---|---|---|---|
1 | 36 | 39 | 3 | 3 |
2 | 27 | 25 | 6 | 6 |
3 | 42 | 41 | 1 | 2 |
4 | 39 | 43 | 2 | 1 |
5 | 18 | 22 | 7 | 7 |
6 | 35 | 29 | 4 | 5 |
7 | 29 | 32 | 5 | 4 |
To give rank the highest number will be rank 1, the second-highest number will be rank 2, the third-highest
rank 3, and so on. To give rank this process will be applied for both judges.
Now using the rank let's do a calculation:
Contestant | Judge 1 rank | Judge 2 rank | D(R1-R2) | D2 |
---|---|---|---|---|
1 | 3 | 3 | 0 | 0 |
2 | 6 | 6 | 0 | 0 |
3 | 1 | 2 | -1 | 1 |
4 | 2 | 1 | 1 | 1 |
5 | 7 | 7 | 0 | 0 |
6 | 4 | 5 | -1 | 1 |
7 | 5 | 4 | 1 | 1 |
ΣD2=4 |
Formula: R=1-{(6*4)/73-7}=0.9286
Now let's see what is the meaning of the R-value that you get from the calculation.
R=0.9286
It means that there is a strong positive correlation between the marks given by the two judges.