Linear (Straight Line) Correlation

Often we are interested in measuring the strength and direction of a linear relationship between two variables x and y. If when x increases y also increases and when x decreases y also decreases we say the direction is positive. However, when x increases and y decreases or when y increases and x decreases, we say the direction is negative.  We measure the strength and direction of the relationship between two variables by calculating a number between + or - 1 called the coefficient of correlation r. (e.g.  (-1< r < 1). The correlation coefficient, r  measures how close to a straight line the set of points (x, y) would fall if plotted. The closer to zero of  r, the less the points fall along a straight line.



The closer the correlation coefficient is to one the more the points will line up in a straight line sloping from lower left to upper right. The closer the correlation is to negative one the more the points will line up in a straight line that slopes from upper left to lower right.



Calculating The Correlation Coefficient

1. Calculate the mean of x and the mean of y.



2. Calculate the standard deviation of x and the standard deviation of y

3. Calculate the covariance between x and y



4. Calculate the correlation coefficient



Example: Calculate the coefficient of correlation for the two variables (x, y):
(2,2),(3,3),(3,2),(3,1),(4,2)

1. mean of  x =  3           mean of y = 2

2.std dev x =  .633              std dev y = .633

3. covariance =(2-3)(2-2)+(3-3)(3-2)+(3-3)(2-2)+(3-3)(1-2)+(4-3)(2-2)/4 =0

4. r = 0/(.633)(.633) =0 

The Question of Causation

A strong correlation between two variables does not necessarily mean that changes in one variable cause changes in the other variable. The is a strong positive correlation between the number of priest in Boston and the number of murders in Boston. Does this mean that priest are going around murdering people in Boston? Of course not. This is sometimes called a "nonsense correlation". There is however a high positive correlation between smoking and cancer. Does this mean that smoking causes cancer? The government and medical science think so. What do you think?

The picture below shows in outline form how a variety of underlying links between variables can explain association.



The first diagram shows that x causes y by an arrow running from x to y. The second diagram illustrates a common response. The observed association between x and y is explained by a lurking variable z. Both x and y change in response to z. This creates an association between x and y even there is may be no causal link between x and y. The third diagram illustrates confounding. Both x and the lurking variable z may influence the response variable y. Both variables x and z are associated so we cannot distinguish the influence of x from the influence of z. We cannot say how strong the direct effect of x is on  y.

Correlation Matrix

Often several variables are involved in a study and we would like to show all mutual correlations. A convenient way of summarizing a large number of correlation coefficients is to arrange the coefficients in a rectangular array called a correlation matrix. The matrix below shows the mutual correlations of four variables, X1, X2, X3,and X4.


One can see from the table that each variable has a perfect positive correlation of  1 with itself (the left to right diagonal). The correlation coefficient between X1 and X3 is 0.9239 while the correlation coefficient between X4 and X3 is -0.9009.