Introduction to correlation

04/28/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Scatter plots
- Correlation
- Linear correlation coefficient
- Computing correlation

Motivation

Scatter plot of chocolate consumption versus number of Nobel laureats per country.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Intuitively, we think of the idea of correlation or anti-correlation as the behavior of two variables varying together or oppositely.
To the left, we see the number of Nobel Leaureates per million persons in a country plotted in a scatter plot versus the number of kg of chocolate consumed per capita.
At at glance, we can see that the two variables tend to vary together, but not in an exact, determinstic way;

i.e., a \( 1 \) unit increase in chocolate consumption doesn’t automatically correspond identically to a \( 2.5 \) unit increase in the number of Nobel Laureates.

Also, there is no reason to believe that the increase in one causes an increase in the other;

i.e., eating more chocolate doesn’t produce more Nobel prize winning scientist.

Indeed, a more rational explanation is that these values tend to vary together because Nobel Prizes usually go to countries with highly developed academic, cultural and industrial infrastructure.
Likewise, inhabitants of these coutnries can better afford luxury goods like chocolate.
Correlation should never be considered causation, but rather that two measurements tend to have systematic associations, which may be better explained by other latent variables.
We may see a similar association when plotting either of the two above variables versus a third variable that acts as an economic indicator.

Motivation continued

Table plot of chocolate consumption versus number of Nobel laureats per country.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Understanding the limitations of correlation for explanatory power, we can use correlation as a powerful research tool for the purpose that it is intended.
Correlation is a statistic that we will compute from pairs of measurements in a single sample.
In the last example, we had a sample consisting of individual contries.
Each observation (country) corresponded to two distinct measurements:

The number of kg of chocolate consumed per capita.
The number of Nobel Laureates per million inhabitants.

This will always be a feature of computing correlation – we need observations which have two measurements.

We will compute the correlation coefficient between these variables.

Usually, we will use a scatter plot as a first check for a systematic pattern and then we wil then compute the statistic.

Being a statistic, the correlation coefficient is subject to sampling error;
we will also need to test for the significance to quantify the uncertainty of the value in relation to the population parameter.

Motivation continued

Scatter plot of variables correlated with a value equal to 0.859.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Consider the following: as a quick example would you say that the pair of variables \( x \) and \( y \) to the left are:

Positively correlated? I.e., do they vary together?
Negatively or anti-correlated? I.e., do they vary oppositely?
Or uncorrelated? I.e., there is no systematic pattern?

Why would you say this?

In the above, this exhibits positive correlation.
This is because an increase of the value of \( x \) generally corresponds to an increase in the value of \( y \).

Motivation continued

Scatter plot of variables correlated with a value equal to -0.971.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Consider the following: as a quick example would you say that the pair of variables \( x \) and \( y \) to the left are:

Positively correlated? I.e., do they vary together?
Negatively or anti-correlated? I.e., do they vary oppositely?
Or uncorrelated? I.e., there is no systematic pattern?

Why would you say this?

In the above, this exhibits negative or anti-correlation.
This is because an increase of the value of \( x \) generally corresponds to a decrease in the value of \( y \).

Motivation continued

Scatter plot of variables correlated with a value equal to 0.074.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Consider the following: as a quick example would you say that the pair of variables \( x \) and \( y \) to the left are:

Positively correlated? I.e., do they vary together?
Negatively or anti-correlated? I.e., do they vary oppositely?
Or uncorrelated? I.e., there is no systematic pattern?

Why would you say this?

In the above, this exhibits uncorrelated variables.
This is because an increase of the value of \( x \) generally doesn’t correspond to any pattern in the value of \( y \).

Motivation continued

Scatter plot of variables in a nonlinear relationship.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Consider the following: as a quick example would you say that the pair of variables \( x \) and \( y \) to the left are:

Positively correlated? I.e., do they vary together?
Negatively or anti-correlated? I.e., do they vary oppositely?
Or uncorrelated? I.e., there is no systematic pattern?

Why would you say this?

In the above, this exhibits slight positive correlation, but correlation isn’t a good measure of the relationship.
In some areas of \( x \), there is a postive trend in \( y \) when increasing \( x \), and other areas there is a negative trend in \( y \) when increasing \( x \).
This is another key concept in that correlation describes linear relationships;

in the previous examples of positive and negative correlation, we could draw a straight line to show the approximate relationship betwen \( x \) and \( y \).

Here a straight line isn’t a very good fit between the variables, and correlation is not a useful description of the relationship.

Linear correlation coefficient

Formally, we will discuss the linear correlation coefficient as follows:

Let’s suppose that we have \( n \) observations which have pairs of measurements \( (x,y) \) collected by simple random sampling.
Let’s suppose that there are not any extreme or outlier observations among the \( (x,y) \) measurments.
Let’s suppose that upon inspection with a scatter plot, we do not see a nonlinear pattern between the variables like in the last example.
Then, if \( z_{x_i} \) is the z score of the measurement \( x_i \) and \( z_{y_i} \) is the z score of the measurement \( y_i \),
the linear correlation coefficient \( r \) between the variables \( x \) and \( y \) is given as \[ \begin{align} r = \frac{\sum_{i=1}^n \left(z_{x_i} \times z_{y_i}\right)}{n-1} \end{align} \]
Actually, we can always compute the above value \( r \) for any set of paired measurement data.

However, the above conditions are necessary to make the correlation coefficient meaningful in making any conclusions about the existence of a systematic, linear pattern between \( x \) and \( y \).

Using some algebra, it can be shown that the linear correlation coefficient can be computed as, \[ \begin{align} r &= \frac{\sum_{i=1}^n (x_i\times y_i) - \left(\sum_{i=1}^n x_i\right) \times \left(\sum_{i=1}^n y_i\right)}{\sqrt{n\times \left(\sum_{i=1}^n x_i^2\right) - \sum_{i=1}^n x_i^2}\times \sqrt{n\times \left(\sum_{i=1}^n y_i^2\right) - \sum_{i=1}^n y_i^2}} = \frac{s_{xy}^2}{s_x \times s_y} \end{align} \]
There are important reasons why we would want to consider the alternative forms above, but these go beyond the scope of the course;

we will only focus on understanding the basic meaning and use of the coefficient, and computing it in StatCrunch.

Properties of the linear correlation coefficient

Because we will focus on how to use the linear correlation coefficient with statistical software, we will focus on the form \[ \begin{align} r = \frac{\sum_{i=1}^n \left(z_{x_i} \times z_{y_i}\right)}{n-1}. \end{align} \]
We should go over some of the fundamental properties of the linear correlation coefficient here:

The values of \( r \) are bounded as, \( -1 \leq r \leq 1, \) where \( r \) close-to \( 1 \) loosely means strong positive correlation and \( r \) close-to \( -1 \) loosely means strong negative correlation.

However, the interpretation of the linear correlation coefficient will depend on the statistical significance.

When we compute \( r \) on pairs of measurements, we have rescaled the values to z scores as above – therefore, the linear correlation coefficient won’t change if we change the scale of the original measurments.

E.g., if we take measurements in \( x \) inches and \( y \) pounds, but convert to \( \tilde{x} \) milimeters and \( \tilde{y} \) kilograms, both \( r \) scores will remain the same with either unit.

Because the formula above is symmetric in the z scores for \( x \) and \( y \), we can switch the order in which we enter \( x \) and \( y \) and the value won’t change.
The linear correlation coefficient isn’t good for interpreting nonlinear relationships, and thus we have to check for patterns that are not well described by straight lines in the data.
The linear correlation coefficient is very sensitive to outliers, and a single outlier measurement could radically change the value.

Therefore, we should perform visual inspection for outliers to determine if there are such points, and if they are erroneous or truthful measurments.

Understanding the linear correlation coefficient formulation

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Recall the form of the linear correlation coefficient, \[ \begin{align} r = \frac{\sum_{i=1}^n \left(z_{x_i} \times z_{y_i}\right)}{n-1}. \end{align} \]
Using the above form, we can easily interpret why we have:

a value of \( r=1 \) associated to a positive relationship between the variables \( x \) and \( y \); and
why we have a value of \( r=-1 \) associated to a negative relationship between the variables \( x \) and \( y \).

Suppose we plot the z score for the observations as on the left, separated out by quadrant.

If the linear correlation coefficient is positive, this corresponds to many positive terms in the sum above.

Specifically, the sign of pairs of \( z_{x_i} \) and \( z_{y_i} \) need to match for many observations – this corresponds to quadrant 1 and quadrant 3.

Similarly, if the linear correlation coefficient is negative, this corresponds to many negative term sin the sum above.

Specifically, the sign of pairs of \( z_{x_i} \) and \( z_{y_i} \) need to be opposite for many observations – this corresponds to quadrant 2 and quadrant 4.

A line that passes through quadrant 1 and 3 has a positive slope, while a line that passes through quadrants 2 and 4 has a negative slope.
However, the strength of this relationship will be judged in terms of the statistical significance.

Critical values for correlation

Scatter plot for height versus shoe print length.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

In order to say how close to \( \pm 1 \) is close enough for \( r \) to indicate correlation, we will generally use p values, or sometimes critical values.
On the left are measurements for the shoe print length and height of five individuals.
The linear correlation coefficient in this case \( r\approx0.591 \) isn’t close to \( \pm 1 \) or \( 0 \).
Notice in the figure, we also list the critical values for \( r \).
If the linear correlation coefficient \( r \) is at least as extreme as the critical values, we can conclude that there is statistical significance.
The critical values above are \( \approx \pm 0.878 \), but \( r \) is not as close to \( \pm 1 \) as the critical values.
Therefore we fail to reject the null hypothesis.

Consider the following: is this a one-sided or two-sided test of significance? What is the null hypothesis in this case? What is the altenative hypothesis?

The critical values \( \pm 0.878 \) measure the extremeness in distance away from the center, or a two-sided test of signficance.
For a two-sided test of significance, if \( \rho \) is the population parameter, the null and altenative hypotheses take the form \[ \begin{align} H_0 : \rho= 0 & & H_1: \rho\neq 0. \end{align} \]

Critical values for correlation continued

List of critical values for correlation.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

We can consider graphically “how extreme” the linear correlation coefficient is compared to the critical value.
For each number of pairs of measurements, there is an associated critical value that will determine the significance of the correlation.
We measured five individuals, getting five pairs of measurements with the corresponding critical value.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

This critical value corresponds to the inner window of “no correlation” in the diagram on the right-hand-side.

For any linear correlation coefficient (computed on 5 pairs of measurements) that isn’t at least as extreme as \( \pm 0.878 \),

we fail to reject the null hypothesis that the variables are uncorrelated.

If the linear correlation coefficient (computed on 5 pairs of measurements) lies in either \( [-1,-0.878] \) or \( [0.878,1] \)

we reject the null hypothesis with \( 5\% \) signficance, and we say that the variables are correlated.

Critical values for correlation continued

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Car weight and fuel consumption data table.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Above, we have seven measurements of different cars' weights and highway miles per gallon gas consumption.
Suppose we use software to compute that the linear correlation coefficient is \( r \approx -0.987 \).
Consider the following: using the table to the left of the critical values, can you determine if we would call the car weight and highway miles per gallon fuel consumption (anti)-correlated? If so, what does this relationship signify?

We note that there are 7 pairs of measurements, so the corresponding critical value is \( 0.754 \).

The linear correlation coefficient \( -0.987 \) is more extreme than \( -0.754 \), so we say the variables are correlated.

We recall, the negative sign for the correlation coefficient means that the variables of weight and highway MPG vary together oppositely ;

i.e., as the weight goes up, the highway MPG goes down.

Critical values for correlation continued

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Note: the linear correlation coefficient depends on the sample data.

If we take measurements of the height and shoe print of five new people, we will quite likely get a different linear correlation coefficient.

Also, the critical values and p-values depend on the number of observations in the sample.
In the plot to the left, there are 40 total subject for whom we have pairs of measurements.
Consider the following: does the relationship between height and shoe print length show more evidence of correlation now? Do you think the linear correlation coefficient will be close to \( 1 \), \( -1 \) or to \( 0 \)?

In this case, the linear correlation coefficient is \( \approx 0.813 \) suggesting that the varialbes are positively correlated.
Note: \( r \) is not as extreme as the critical value of \( 0.878 \) from before – however, this critical value was for \( 5 \) samples only.
The critical value for \( 40 \) samples is approximately \( 0.304 \), so a coefficient of \( 0.813 \) is much more extreme.
More typically, we will use the p value for the linear correlation coefficient direclty as in the following example.

Computing the linear correlation coefficient manually example

Computation of the linear correlation coefficient.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

We will return to only \( n=5 \) cases of the chocolate/ Nobel Laureate data.
In the left, we see the exact values of the paired measurments from \( 5 \) countries in the original units – laureates per million and kg of chocolate per capita.

In the middle two columns, we have the corresponding z score for each measurement.

In the final column, there are the values of the products of the corresponding z scores for each pair of measurements from the same country.
The sum of thes values is at the bottom of the right column, which divided by \( n-1=4 \) gives the linear correlation coefficient \( r\approx 0.795 \).
We can recall from the last example that the critical value for \( n=5 \) is \( 0.878 \), so we fail to reject the null hypothesis that the variables are uncorrelated.
We can verify this computation by entering the pairs of values manually and computing the correlation coefficient in StatCrunch.
We will also compute the p value of this linear correlation coefficient to verify the hypothesis test.

Computing the linear correlation coefficient from a data set

Table of chocolate consumption versus number of Nobel laureats per country.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

We saw that the threshold for calling a linear correlation coefficient \( r \) significant depends on the number of observations \( n \).
With only a few (\( n=5 \)) observations, it takes a much higher linear correlation coefficient than with many samples (\( n=23 \)) to reject the null hypothesis with significance.

It is much easier for a small number of observations to appear to have a linear relationship just by random chance, than for a large number of observations.

We will now consider the full data set and compute both the linear correlation coefficient and the p value for this statistic in StatCrunch.
Notice, in this case the linear correlation coefficient \( r \) was closer to one, but the p value was extremely small.
This corresponds with the fact that the linear relationship has held over a much larger number of pairs of measurements.

Graphically, we can see the critical region for \( n=23 \) observations corresponds to \( [-0.413,0.413] \), using a two-sided test.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Computing the linear correlation coefficient for the body data

We will now go through one example of computing the linear correlation coefficient with the Body Data in StatCrunch.
This kind of example should be representative of a quiz or exam question and you are recommended to study this type of example on your own for your understanding and practice.
We will show this as follows.