Introduction to regression

04/30/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Review of Correlation
- Linear trends in data
- Regression
- Computing and graphing regression lines
- Explained variation
- Making predictions
- Residuals

Review of correlation

Scatter plot for height versus shoe print length.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Recall in the last lecture, we studied the linear correlation coefficient as a way to quantify a systematic, linear association between pairs of measurments in a sample of data.
In the case where:

the observations were collected by simple random sampling;
upon inspection with a scatter plot, there was not a strong nonlinear pattern; and
there were no outliers in either of the \( x \) or \( y \) measurements;

We used the linear correlation coefficient, \[ \begin{align} r = \frac{\sum_{i=1}^n \left(z_{x_i} \times z_{y_i}\right)}{n-1} \end{align} \] to quantify the strength of this relationship.

Values of \( r \) close to \( 1 \) could loosely be interpreted as exhibiting a positive relationship between the variables \( x \) and \( y \).
Values of \( r \) close to \( -1 \) could loosely be interpreted as exhibiting a negative relationship between the variables \( x \) and \( y \).
However, \( r \) is a sample statistic which we must quantify the uncertainty of due to sampling error;
that is for the population level correlation coefficient parameter \( \rho \) we will usually test the hypothesis, \[ \begin{align} H_0: \rho =0 & & H_1 \rho \neq 0. \end{align} \]

Review of correlation continued

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Recall that critical values and p-values depend on the number of observations in the sample.
A small number of measurements can easily exhibit a linear pattern when there is no correlation between the variables (\( \rho=0 \)) just by random sampling error;
however, this is more difficult for a large number of measurements.
As by the usual convention, if the p value for the hypothesis test is less than \( \alpha =0.05 \) we will rejet the null hypothesis with \( 5\% \) significance and claim that there exists some systematic linear relationship between the variables.
Note: this systematic relationship does not mean one causes the other.

Correlation or anti-correlation simply means that the values tend to vary together or oppositely;

often, additional latent variables that are not measured can provide a better explantion of cause-and-effect.

However, if we understand the purpose and limitations of correlation, this can be a powerful research tool to describe when certain measurements tend to vary together or oppositely as described above.
A natural extension of this idea is if there is a linear trend, we can make a model for this trend in terms of a line.

Regression

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Noting that there was correlation in height and shoe print length, we could say,
“On average, an increase in someone’s height usually goes along with an increase in their shoe print length (and vice versa).”
A regression line (or best fit line) quantifies what this trend looks like.
Recall the equation for a line, \[ y = {\color{red} a} +{\color{blue} b}x \]

The coefficient \( a \) is the intercept.

When the quantity \( x \) is zero, then \( y = {\color{red} a} \).

The coefficient \( b \) is the slope.

An increase of 1 unit of variable \( x \) corresponds to \( {\color{blue} b} \) units of increase in \( y \).

In statitistics, we deal with random quantities that are subject to non-deterministic variation.
In this case, a best fit line doesn’t mean that we have a direct input-output relationship as in the above equation.
Rather, this should be viewed as, if we were to average over the variation we see in the observations, the mean increase in \( y \) with a one unit increase in \( x \) would be \( b \).
This interpretation can then be used to predict the mean value of, e.g., the height, given some value of shoe print length.

Regression continued

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

In regression, we write the equation for a line in a special form: \[ \hat{y} = {\color{red} {b_0} } + {\color{blue} {b_1} } x, \] where we re-name the variables as:

\( y \) – this is called the response;
\( x \) – this is called the predictor;
\( {\color{red} a} \) – this is re-named \( {\color{red} {b_0} } \); and
\( {\color{blue} b} \) – this is re-named \( {\color{blue} {b_1} } \).

In our example, we could use software to compute

\( {\color{red} {b_0} \approx 80.9} \); and
\( {\color{blue} {b_1} \approx 3.22} \).

The regression equation would then be read, \[ \hat{y}_\text{(Height)} = {\color{red} {80.9} } + {\color{blue} {3.22} } x_\text{(Shoe print length)}. \]

We should remember that there are limitations to this equation, and we cannot take the relationship this describes literally if we use values that go far beyond the scope of our observations.
For example, we have no reason to believe that the trend will be accurate for a shoe print length of 100 cm, with an associated mean height of 402.9 cm.
This lies far beyond our observations, and the model will generally fail whenever we stretch this interpretation so far.

Requirements for regression

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Semi-formally, we will introduce regression as follows:

Suppose that we have \( n \) observations in a simple random sample with paired measurements \( (x,y) \).
Suppose that when plotting \( y \) versus \( x \) in a scatter plot, we see mostly a straight line pattern, and no signs of a nonlinear pattern.

For any fixed value \( x_i \), all associated values of \( y_{i,j} \) plotted above \( x_i \) should be roughly normally distributed, and the mean of the \( y_{i,j} \) should lie roughly on the straight line pattern.

Suppose that for each fixed value \( x_i \), the standard deviation of all \( y_{i,j} \) lying above \( x_i \) should be approximately the same as for each other \( y_{k,j} \) lying above \( x_k \) where \( k\neq i \).

Additonally, suppose there aren’t any extreme outliers in the observations.

Under the loose conditions described above, we can effectively estimate a “best-fit” regression line \[ \hat{y} = b_0 + b_1 x; \]

up to some uncertainty, due to sampling error, this regression line estimates the mean value of the \( y_{i,j} \) distributed above \( x_i \) for any given value \( x_i \) in the scope of the model.

Actually, because \( b_0 \) and \( b_1 \) are computed from samples, these are statistics which are estimating unknown popluation parameters which describe the above relationship.

The regression equation

Table of regression parameter versus statistic values.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

For the regression line, we are assuming that there exist actual population parameters that describe the mean of the \( y \) values above \( x \) values, as \[ y = {\color{#1b9e77} {\beta_0} }+ {\color{#1b9e77} {\beta_1} } x. \]
The estimates \( b_0 \) and \( b_1 \) for the intercept and slope of this line are computed as \[ \begin{align} b_1 = r\times \frac{s_y}{s_x} & & b_0 = \overline{y} - b_1 \overline{x} \end{align} \]

where:

\( r \) is the linear correlation coefficent which we discussed last lecture.
\( s_x \) is the sample standard deviation of the \( x \) values and \( s_y \) is the sample standard deviation of the \( y \) values;
\( \overline{x} \) is the sample mean of the \( x \) values and \( \overline{y} \) is the sample mean of the \( y \) values.

Once again, this gives sample-based estimates for population parameters, so we will want to quantify our uncertainty about these values using our methods of confidence intervals and hypothesis tests.
We will also want to make sure the assumptions we discussed loosely on the last slide are accurate (for the most part) – methods for checking these assumptions are called diagnostics.

We note, if there are large numbers of observations, the assumptions of normal distributions and equal standard deviations don’t have to hold exactly, and we can still obtain reasonable results.

However, we should always be conscious of the effects of outliers, because they can dramatically change the way the regression line behaves.

Computing the regression equation

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Performing regression without technology is very impractical, and we will not emphasize this whatsoever.
However, just for sake of viewing the estimated regression equation laid out formally for our running example, we will show this here.
For the example of Nobel Laureates versus chocolate consumption, recall that in the last lecture we found that:

there were no extreme outliers and no nonlinear patterns in the data;
and the correlation coefficient was given as \( r\approx 0.801 \), rejecting the null hypothesis \( \rho=0 \) with \( 5\% \) significance.

Informally, this makes it reasonable to consider a regression plot as pictured above.
The sample standard deviations and sample means of the \( x \) and \( y \) values are given as, \[ \begin{align} \overline{x} \approx 5.8043 & & s_x \approx 3.2792 \\ \overline{y} \approx 11.1043 & & s_y \approx 10.2116 \end{align} \]
Together this gives our estimated regression coefficients as \[ \begin{align} b_1 = 0.801 \times \frac{10.2116}{3.2792} \approx 2.4931 \\ b_0 = 11.1043 - 5.8043 \times 2.4931 \approx -3.3667 \end{align} \]

Computing the regression equation

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

From the calculations on the last slide, we have \[ \begin{align} b_1 \approx 2.49 \\ b_0 \approx -3.37 \end{align} \]
Therefore, our “best-fit” regression equation is given as, \[ \hat{y}_\text{Nobel per million} = -3.37 + 2.49 \times x_\text{kg of chocolate per capita}. \]
Notice that if \( x=0 \), the regression line would say the mean number of Nobel Laureates would be negative;

however, this lies outside of the scope of the model as the smallest \( x \) value is \( 0.7 \), and we should be careful in the interpretation of the smallest and largest values in \( x \).

Crucially, we have not yet peformed any uncertainty quantification of the model.
The correlation in the variables with \( 5\% \) significance is a good indicator for the hypothesis test \[ \begin{align} H_0: \beta_1 = 0 & & H_1 : \beta_1 \neq 0 \end{align} \] because \( b_1 = r \times \frac{s_y}{s_x} \).
This hypothesis test is whether we believe there is a non-zero slope to the line, i.e., a non-trivial linear relationship between \( x \) and \( y \).
If we fail to reject the null, we do not have a strong indication that a change in \( x \) corresponds to any particular change in \( y \).
We will perform this as follows in StatCrunch directly.

Explained variation

Nobel Prize Laureate versus chocolate consumption regression line.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

There is another important connection between the correlation coefficient \( r \) and the regression line which we will discuss as follows.
The value of \( r \) taken as a square is known as the “percent of explained variation” in the value \( y \) by the regression line in terms of \( x \).
The percent explained variation is, how much of the pattern in \( y \) is described by the regression line with a non-zero slope;

alternatively, if there was a zero slope, i.e., if the null hypothesis \[ H_0: \beta_1 =0, \] is true, the variation in \( y \) would be better described simply by the best fit horizontal line.

A horizontal line in the plot of \( y \) versus \( x \) is written as \( \hat{y}=b_0 \) for some best choice of \( b_0 \).

But we saw that \[ b_0 = \overline{y} - b_1 \overline{x} \] so that if \( b_1 \) should actually be equal to \( 0 \) like the true parameter, then we should write \[ \begin{align} &b_0 = \overline{y} & &\Rightarrow & & \hat{y} = \overline{y} \end{align} \]
The values of \( r^2 \) range from \( 0 \) to \( 1 \), with \( r^2 \) close to \( 1 \) telling us that the regression line gives a better fit than the horizontal line \( \hat{y}=\overline{y} \).
In this case, \( r=0.801 \) so that \( r^2 \approx 0.642 \); this loosely says that \( 64.2\% \) of the trend in \( y \) is explained by the regression line;
this is strong improvement over the horizontal line \( \hat{y} = 11.1043 \), which we can see just by visual inspection.
In the case we have a well fitting line, we will next discuss how we can use this to make predictions.

Predictions

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

We saw how the regression line can give us a numerical value for the expected value of the response variable \( y \) given some value \( x \).
For a given choice of \( x_i \) this is given precisely by, \[ \hat{y}_i = b_0 + b_1 x_i, \] but \( x_i \) does not have to be a value we have already observed.

This is what is meant by prediction with the regression function – we can provide an expected value for a new case of \( x \) that has not been observed yet.

However, there are a number of considerations and limitations of these predictions which we will outline as follows.
The first consideration should be if the regression equation is appropriate at all:

Visually, we should see that the data follows the regression line well in a scatter plot, and that it is not strongly affected by outliers.
The value of \( r \) should reject the null hypothesis of \( H_0:\rho=0 \) with significance.

If this were not the case, we cannot reject that the slope of the regression line should be equal to zero as \( b_1 = r \times \frac{s_y}{s_x} \);
if we can’t reject \( H_0:\rho=0 \), we should just use \( \hat{y}= \overline{y} \) instead.

The value of \( x \) shouldn’t go beyond the scope of the observations – the predictions will become unreliable.

Predictions continued

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

In the flow chart to the left, we can see the three conditions before – suppose that these conditions are satisfied.
If these are satisfied, we can obtain the prediction directly as discussed before;

let us suppose that \( x_i \) is some value of \( x \) for which we want to predict the expected value of \( y \).

Then the predicted value is given by substituting this value into the regression equation, \[ \hat{y}_i = b_0 + b_1 x_i, \] rounded to the appropriate decimal place for the problem and scale of interest.
However, if any of the above conditions fail, we should instead use the mean of the \( y \) values directly as the prediction, i.e., \( \hat{y}=\overline{y} \) as discussed earlier.

In the quiz and the final exam, you should be able to examine a real data set for the above conditions, and find the appropriate prediction based on these conditions.
Particularly, you should be able to identify which of the above conditions has failed, if any, and then compute the appropriate value for \( \hat{y} \).
We will look at two such examples in the following in the body data in StatCrunch;
however, even though we will only look at two examples you should be able to identify all of the conditions above and in different data sets included with the book online.

Residuals

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

There are good reasons we might want to consider the difference between an observed value and the value that the model predicts.
Suppose that \( x_i \) is actually some observed measurement paired with the observed value \( y_i \).
In general, we should not believe that \( y_i \) is actually equal to \[ \hat{y}_i = b_0 + b_1 x_i , \] as there is random variation in every piece of sample data \( (x_i,y_i) \).
Even though \( \hat{y}_i \) is our estimate for the expected value of \( y_i \), these observations are randomly distributed around the true expected value;

therefore, there is generally a mismatch even if \( \hat{y} \) is a perfectly accurate estimate of the expected value.

There is a special name for this mismatch \( y_i - \hat{y}_i \), the residual.
In the figure to the left, we can see a diagram of what a residual would look like in terms of the vertical lines.

The regression line is called a “best-fit” line because it is actually the line that minimizes the total of all squared residuals.

Concretely, this is a minimum distance line between the predicted and the observed values of \( y \) when written in terms of \( x \).

This is generally a very desireable property for the predictions, but it is also a weakness when the data has extreme outliers, particularly in the \( x \) values…

Outliers and influential points

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

The regression line will try to fit the extreme outlier at the cost of other observations, when there is an extreme outlier in the \( x \) values.
If there are extreme outliers in the \( x \) values that would otherwise lie off the regression line, including such a point will dramatically change the fit.
Consider the plots to the left.

In the left plot, we have the regression from the original Nobel/ Chocolate data where we have a reasonable fit to the data.
In the right plot, the book’s author has included an influential point, in that it is extreme in the \( x \) values, and lies away from the original regression line.
The regression line in the right plot has tried to minimize the squared residuals for all observations, but has placed special importance on this point at the cost of the others.
This one of the main resons we need to be conscious of outlier points in the data, and to determine if these measurements are truthful or erroneous.
Their presence can dramatically change our understanding of the relationship between \( x \) and \( y \), and the predictions we would obtain therenin.

Residual plots

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

When performing regression in practice, one of the most important aspects is to verify if the assumptions have not held.

This is most typically performed by plotting the residuals above the \( x \) value as in the above figures.

Particularly, the residuals should just look like white-noise, with no particular pattern or structure.
On the left, this is a good case, where there isn’t any particular stucture.
In the middle, there is a nonlinear structure making the model not a good fit.
On the right, the residuals widen as we increase \( x \), meaning that the equal standard deviation assumptionno longer holds.
This is just a quick preview of the kinds of ideas you would see in a future statistical modelling class, but will not be on the exams.