Analysis of variance approach to simple linear regression

09/14/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:

    • Analysis of variance approach to regression
    • Decomposing the variation
    • Degrees of Freedom
    • Mean squares and the ANOVA table

Analysis of variance approach

  • We have seen one approach now for regression analysis which will be the basic framework in which we consider these linear models.
  • However, there are additional ways to approach the regression model, among which is known as Analys of Variance or ANOVA.
  • This approach, which we will introduce in the following, seeks to partition the variation in the signal into different components for creating hypothesis tests.
  • We will introduce the main concepts here, which will underpin a number of the techniques which we will introduce in full generality in multiple regression.

Total sum of squares

  • We note, there are several forms of variation in our regression analysis.
  • Among these is the variation of the response variable around its empirical, sample-based mean, \[ Y_i - \overline{Y} \]
  • Analogously to how we earlier defined the RSS in terms of the squared-deviations of \( Y_i \) from the regression-estimated mean response, \[ RSS = \sum_{i=1}^n \hat{\epsilon}_i^2; \]
  • we will define the Total Sum of Squares (TSS) in terms of the squared-deviations of \( Y_i \) from the sample-based mean of the response: \[ TSS\triangleq \sum_{i=1}^n \left( Y_i - \overline{Y}\right)^2. \]
  • Q: if all observations of the response variable have the same value, then what value does the TSS is attain?
  • A: the TSS must equal zero, as \( Y_i = \overline{Y} \) for all \( i \).
  • In this regard, the greater the overall variation in the response variable across all cases, then the greater is the TSS.
  • The TSS represents the variation around a null model, in which we would consider the variation present in the response to be random variation around its sample-based mean, irrespective of the explantory variable \( X \).
  • In general, the RSS does not equal the TSS for the reason described above
    • in particular, if there is a signal in the data, we expect there to be less variation in the RSS than in the TSS.

Residual sum of squares

  • While we can consider the TSS a measure of the total variation around the null model of random variation around the mean, we can also consider how much of this variation is “explained”.
  • Particularly, consider the quantity, the Explained Sum of Squares (ESS) \[ ESS = \sum_{i=1}^n\left(\hat{Y}_i - \overline{Y}\right)^2; \]
  • This represents how much variation in the signal is explained by our regression model;
    • if our regression model is the null model, i.e., the \( i \)-th fitted value is just the sample-based mean of the observed responses, \( \hat{Y}_i =\overline{Y} \), then \( ESS=0 \).
  • Therefore, as we will show in the following, we can generally consider a larger \( ESS \) corresponding to a regression model with better performance.

Partitioning the errors

  • To demonstrate the meaning of the ESS corresponding to a better performance, we consider the following partition of the variation in the response, \[ \underbrace{Y_i - \overline{Y}}_{TSS} = \underbrace{\hat{Y}_i - \overline{Y}}_{ESS} +\underbrace{ Y_i - \hat{Y}_i}_{RSS}, \] where we say each term loosely corresponds to the TSS, ESS, or RSS as above.
  • This corresponds in a loose sense to decomposing the total deviation of the response around the mean into:
    1. the deviation of the fitted values around the mean (ESS), plus
    2. the deviation of the observed values from the fitted values (RSS)
  • Q: how do we obtain the equality of the right-hand-side with the left-hand-side above?
  • A: we can always add zero to any equation to acheive equality, i.e., \[ Y_i - \overline{Y} = Y_i - \overline{Y} + \left(\hat{Y}_i - \hat{Y}_i \right). \]
  • Re-arranging terms recovers the decomposition as above.

Partitioning the errors – continued

  • While we have motivated the decomposition of the TSS, we haven’t actually shown the decomposition.
  • Specifically, we need to demonstrate that, \[ \sum_{i=1}^n \left(Y_i - \overline{Y}\right)^2 = \sum_{i=1}^n \left(\hat{Y}_i - \overline{Y}\right)^2 + \sum_{i=1}^n \left(Y_i - \hat{Y}_i \right)^2, \] which is non-trivial, and is a consequence of the choice of the estimation by least squares.
  • We will begin by adding zero and expanding terms, \[ \begin{align} TSS&= \sum_{i=1}^n \left[Y_i - \overline{Y}\right]^2 \\ & = \sum_{i=1}^n \left[ \left(\hat{Y}_i - \overline{Y}\right) + \left(Y_i - \hat{Y}_i \right)\right]^2\\ &= \sum_{i=1}^n \left[ \left(\hat{Y}_i - \overline{Y}\right)^2 + \left(Y_i - \hat{Y}_i \right)^2 + 2 \left(\hat{Y}_i - \overline{Y}\right)\left(Y_i - \hat{Y}_i \right)\right]\\ &= ESS + RSS + 2\sum_{i=1}^n \left(\hat{Y}_i - \overline{Y}\right)\left(Y_i - \hat{Y}_i \right). \end{align} \]
  • Therefore, we need to demonstrate that the sum of cross terms vanishes to prove the partition.

Partitioning the errors – continued

  • It will be sufficient to show that \[ \sum_{i=1}^n \left(\hat{Y}_i - \overline{Y}\right)\left(Y_i - \hat{Y}_i \right) =0 \]
  • We will study this property in the class activity, proving that \[ TSS = ESS + RSS. \]
  • In the above form, we see the tradeoff between the two terms in the \( TSS \), particularly,
    1. When the \( RSS \) is large, this says:
      • the squared-distance between the fitted and the observed values, \( RSS= \sum_{i=1}^n\hat{\epsilon}_i^2 \), is large;
      • particularly, the \( ESS \) (explained variation) is small and the fit is close to the null model.
    2. When the \( ESS \) is large, this says:
      • the squared-distance between the fitted and the empirical, sample-based mean, \( ESS=\sum_{i=1}^n \left(\hat{Y}_i - \overline{Y}\right)^2 \), is large;
      • particularly, the \( RSS \) is small, implying a close fit between the predicted and the observed values.

Goodness of fit

  • With the last discussion as a motivation, we can introduce our first metric for the “goodness of fit” of a regression model.
  • A common choice to examine how well the regression model actually fits the data is called the “coefficient of determination” or “the percentage of variance explained”.
  • For short, we define, \[ R^2 = 1 - \frac{\sum_{i=1}^n \left( Y_i - \hat{Y}_i\right)^2}{\sum_{i=1}^n \left(Y_i - \overline{Y}\right)^2} = 1 - \frac{RSS}{TSS} \]
  • Q: recalling the realtionship \( TSS = ESS + RSS \), what is the possible range of \( R^2 \) and what the maximal and minimal value correspond to.
  • A: if \( RSS=TSS \), then we have a value of \( R^2=0 \), corresponding to a null model, i.e., simply random variation about the sample-based mean.
  • The smallest value \( RSS \) can attain is \( 0 \), in which case \( R^2=1 \), corresponding to the case where all fitted values equal the observed value.
  • Generally, we consider a model with \( R^2 \) close to one a “good” fit, and \( R^2 \) close to zero a bad fit.
    • Note: this metric has a number of flaws, which we will discuss further in the course.
    • However, this metric is commonly used enough and is of great enough historical importance that we should understand it.

A visual representation

Visualization of the total variation of data points which is greater than the variation of the data points around the regression function.

Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

  • In the case of simple linear regression, we can visualize the meaning of \( R^2 \) directly in terms of the variation of the observations around the regression function.
    • The solid arrow represents the variance of the data about the sample-based mean of the response.
    • The dashed arrow represents the variance of the data about the least squares predicted mean response.
    • \( R^2 \) is defined by one minus the ratio of these two variances.
  • Intuitively, by the “picture-proof”, we want the variation of the cases about the predicted mean response to be much smaller than the variation around the empirical mean.
  • This corresponds, intuitively, to the idea that the response varies tightly with respect to the regression function, and there is indeed structure to the signal.
  • If we had a null model, where the response is flat with respect to the change in the predictor \( X \), then the \( RSS \) and the \( TSS \) would be the same.

Computing \( R^2 \) in the R language

  • Our definition for \( R^2 \) is the same one used in the language R, so we emphasize this.
    • However, this definition assumes that there is an intercept term for the model.
  • If there is no intercept term, i.e., \( \beta_0 = 0 \), then \( R^2 \) is equal to the correlation of the fitted values with the observed values, squared: \[ \begin{align} R^2 & = cor^2 \left(\hat{Y}, Y\right) \end{align} \]
  • The value of \( R^2 \) should be computed from this definition if the model is fitted without intercept.
  • If we don’t take care to do it this way, the coefficient of determination will be misleadingly high.
  • Note: for this reason, we must take care about using the model summary command in the R language, when we have a model without intercept.
  • Defining a model without intercept \( \beta_0=0 \) is usually an uncommon assumption, and it is taken only when there is a good “physical” meaning to taking a model without intercept.
  • For example, if we are forming a model for a population size based on the food supply as the predictor, there is a clear “physical” meaning for \( \beta_0=0 \).
    • It would be reasonable to require that if there is zero food supply, the population should be zero or randomly fluctuate around zero due to migrations through the study area.
  • In general, however, the \( beta_0 \) is usually used simply as the intercept and may not have a real meaning for the relationship, only that it sets a base level for the response, appropriate for the scope of the model.

What are “good” values of \( R^2 \)?

  • There is no universal “good” value for \( R^2 \) to attain in practice.
  • For physics and engineering applications, data will be produced in tightly controlled experiments.
    • Measurement noise will typically be low, and there are strong correlation and causality relationships in these settings.
    • In that case, we expect \( R^2 \) to be close to one in order to say the fitted values model the observations well.
  • In the social sciences, there is much more variability, typically causal relationships (if they exist) are not well understood and correlations are weaker.
    • In this case, we will typically expect a “good fit” to have a much lower \( R^2 \) score.

Simulated examples of \( R^2\approx.65 \)

Figure of different arrangements of data points, all with \( R^2 \) approximately .65.  Descriptions are in the text

Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

  • On the left, we see how different configurations of data can all result in the same \( R^2 \) score.
    • Upper left: the plot is well-behaved for R2 – there is a clear trend with some variation.
    • Upper right: the residual variation is smaller than the first plot but the variation in the observations is also smaller, so the \( R^2 \) score is about the same.
    • Lower left: the fit looks strong except for a couple of outliers, which lower the overall score.
    • Lower right: the relationship is quadratic, leading to some irregularity in the fit with a straight line.

Breakdown of degrees of freedom

  • We have used the notion of the “degrees of freedom” loosely up to this point, and we want to formalize this analysis.
  • The degrees of freedom refer to the number of values that are free to vary (the number of free parameters or independent variables) in the computation of some statistic.
  • In particular, this can be considered geometrically for a set of of \( n \) observations of the response, \( \left\{Y_i\right\}_{i=1}^n \);
    • If we identify the \( n \) observations as an \( n \)-dimensional vector \[ \mathbf{Y} = \begin{pmatrix} Y_1,& \cdots, &Y_n\end{pmatrix}^\mathrm{T} , \] we say that as a random vector, it can attain a value in any subspace of the \( n \)-dimensional space \( \mathbb{R}^n \).
    • Suppose we have the sample-based mean defined as before as \( \overline{Y} = \frac{1}{n}\sum_{i=1}^n Y_i \).
    • Then, we can re-write the random vector \( \mathbf{Y} \) in terms of two objects, one which lives in \( 1 \)-dimensional space and one which lives in \( n-1 \)-dimensional space: \[ \mathbf{Y} = \overline{Y} \begin{pmatrix}1 & \cdots & 1\end{pmatrix}^\mathrm{T} + \begin{pmatrix} Y_1 - \overline{Y} & \cdots & Y_n - \overline{Y}\end{pmatrix}^\mathrm{T} \]
    • The first quantity on the right-hand-side is constrained to live in the \( 1 \)-dimensional subspace that is spanned by the vector \( \begin{pmatrix}1 & \cdots & 1\end{pmatrix}^\mathrm{T} \); here the only “free” parameter is the value of \( \overline{Y} \).

Breakdown of degrees of freedom – continued

  • Continuing our ananlysis: \[ \mathbf{Y} = \overline{Y} \begin{pmatrix}1 & \cdots & 1\end{pmatrix}^\mathrm{T} + \begin{pmatrix} Y_1 - \overline{Y} & \cdots & Y_n - \overline{Y}\end{pmatrix}^\mathrm{T} \]
    • The second quantity on the right-hand-side may appear have \( n \)-dimensions of possible values, but there is a constraint implied by the used degree of freedom: \[ \sum_{i=1}^n \left(Y_i - \overline{Y}\right) =0, \] such that there are only \( n-1 \) degrees of freedom left over.
    • Particularly, the second quantity (sometimes referred to as the anomalies) live in a \( n-1 \)-dimensional subspace of the full \( \mathbb{R}^n \) space.
  • Therefore, when we compute the unbiased sample-based variance, the normalization by \( n-1 \) makes sense by the fact that the quantity \[ \sum_{i=1}^n \left(Y_i - \overline{Y}\right)^2, \] has only \( n-1 \) degrees of freedom, or values that are not yet determined.
  • This is analogous to the earlier lecture when we discussed the over constrained/ under constrained/ unique solution to finding a line through data points in the plane.

Breakdown of degrees of freedom – continued

  • In the case of estimating the regression function, we see similarly, \[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X \] we estimate two parameters as linear combinations of the observed cases \( (X_i,Y_i) \).
  • We consider the \( Y_i \) to be the free values here, while the two normal equations provide two constraints to the estimated regression function.
  • Correspondingly, as we introduce more parameters \( p \) in the model, we will use more degrees of freedom, solving a system of equations for \( p \) parameters;
  • equivalently, we will put more constraints on the system until it becomes uniquely (or eventually over-) constrained, with \( n-p \) degrees of freedom available to determine the regression function.
  • Q: We said that the definition of the unbiased estimate for \( \sigma^2 \) generalizes to multiple regression, simply by increasing the number of \( p \) parameters to account for the additional estimated quantities in our regression.
  • Suppose \( n=p \), what is our unbiased estimate of the variance \( \sigma^2 \)?

Breakdown of degrees of freedom – continued

  • A: recall the definition, \[ \hat{\sigma}^2 \triangleq \frac{RSS}{n-p}, \] such that if \( n-p=0 \), the equation is undefined.
  • Indeed, if \( n-p=0 \) this is a completely constrained system, with a unique value for the regression fuction — this is actually a serious issue of overfitting, which we will return to later.
  • Particularly, one issue we can see already is that we do not have a means of uncertainty quantification for our estimates.
  • If \( n-p<0 \), we have an overconstrained or “super-saturated” model for which different techniques entirely are needed for the analysis.

Degrees of freedom of TSS and RSS

  • By the earlier discussion, we say that the TSS, \[ TSS = \sum_{i=1}^n \left(Y_i - \overline{Y}\right)^2 \] has \( n-1 \) degrees of freedom.
    • This corresponds to the fact that there are \( n \) values that the observations can attain, with one constraint from the sample-based mean.
  • Similarly, we find that the RSS, \[ \begin{align} RSS &= \sum_{i=1}^n \left( Y_i - \hat{Y}_i \right)^2\\ &= \sum_{i=1}^n \left( Y_i - \hat{\beta}_0 - \hat{\beta}_1X_i \right)^2 \end{align} \] has \( n-p \) degrees of freedom \( (p=2) \), because there are \( p \) constraints on this relationship given any \( n \) possible values that \( Y_i \) attain.

Degrees of freedom of ESS

  • Let us derive the number of degrees of freedom of the explained sum of squares, \[ ESS = \sum_{i=1}^n \left(\hat{Y}_i - \overline{Y}\right)^2 \]
  • Q: we will use the property that the mean of the fitted values is equal to the mean of the observed values, i.e., \[ \frac{1}{n}\sum_{i=1}^n \hat{Y}_i = \overline{Y}; \] using any of the properties we have proven already about the regression function, can you show why this is?
  • A: one useful property we have shown with the normal equations is that the sum of the residuals is zero, i.e., \[ \sum_{i=1}^n \hat{\epsilon}_i = 0. \]
  • Therefore, we can consider that \[ \overline{Y} =\frac{1}{n} \sum_{i=1}^n Y_i = \frac{1}{n}\sum_{i=1}^n\left( \hat{Y}_i + \hat{\epsilon}_i\right) = \frac{1}{n}\sum_{i=1}^n \hat{Y}_i \]

Degrees of freedom of ESS – continued

  • Recalling the form for the explained sum of squares, \[ \begin{align} ESS &= \sum_{i=1}^n \left(\hat{Y}_i - \overline{Y}\right)^2 \\ &=\sum_{i=1}^n \left[\hat{\beta}_0 + \hat{\beta}_1X_i - \left(\frac{1}{n}\sum_{i=1}^n \hat{Y}_i \right) \right]^2 \end{align} \] where the above used the relationship we just proved.
  • Q: in what way can we simplify the above expression?
  • A: one way is to substitute the definition of the fitted value \( \hat{Y}_i \) once again and cancel terms, \[ \begin{align} ESS &=\sum_{i=1}^n \left\{ \hat{\beta}_0 + \hat{\beta}_1X_i - \left[\frac{1}{n}\sum_{i=1}^n \left(\hat{\beta}_0 + \hat{\beta}_1 X_i \right)\right] \right\}^2 \\ &=\sum_{i=1}^n \left\{ \hat{\beta}_0 + \hat{\beta}_1X_i - \hat{\beta}_0 - \hat{\beta}_1 \overline{X} \right\}^2\\ &= \hat{\beta}_1^2 \sum_{i=1}^n \left(X_i - \overline{X}\right)^2 \end{align} \]

Degrees of freedom decomposition

  • From the last derivation, we have that \[ \begin{align} ESS &= \hat{\beta}_1^2 \sum_{i=1}^n \left(X_i - \overline{X}\right)^2 \end{align} \]
  • Although the the \( ESS \) is computed from \( n \) deviations, they are all derived from the same regression line.
  • If we suppose the regression line is the free value in this case, it has two degrees of freedom described by its slope \( \hat{\beta}_1 \) and its intercept \( \hat{\beta}_0 \).
  • However, as we saw earlier, we cancel the terms with the intercept such that \( \hat{\beta}_1 \) is the only degree of freedom (free parameter) in the \( ESS \).
  • Therefore, we say that the \( ESS \) has one degree of freedom.
  • An important consequence of this for the analysis of variance approach is that the degrees of freedom, like the total variation, are additive:, \[ \underbrace{TSS}_{n-1} = \underbrace{ESS}_{p - 1} + \underbrace{RSS}_{n-p}, \] where \( p=2 \) in simple regression.
  • The above concept and the geometry likewise generalize to multiple regression, which we will come to shortly.

Mean Squares

  • A sum of squares, such as, the \( TSS \), \( ESS \) or \( RSS \) when divided by its associated degrees of freedom is referred to as a mean square.
  • Therefore, we will identify the following quantities:
    1. the regression mean square: \( \frac{ESS}{p-1} \);
    2. the residual mean square error: \( \frac{RSS}{n-p} \);
  • Q: we have mentioned once before that one of the above is an unbiased estimator — can you recall what is the value of, \[ \mathbb{E}\left[\frac{RSS}{n-p}\right]? \]
  • A: the residual mean square error, denoted \( \hat{\sigma}^2 \) is an unbaised estimator for \( \sigma^2 \); therefore, \[ \mathbb{E}\left[\frac{RSS}{n-p}\right] = \sigma^2 \]

Mean Squares – continued

  • It can be shown that similarly, the regression mean square has an expectation, \[ \mathbb{E}\left[\frac{ESS}{1} \right]= \sigma^2 + \beta_1^2 \sum_{i=1}^n \left(X_i - \overline{X}\right)^2 \]
    • Note, however, while the residual mean square error takes the form for higher dimensions, the above regression mean square does not.
  • Q: suppose \( \beta_1\neq 0 \), which is larger, the expected regression mean square or the expected residual mean square error?
  • A: provided all cases don’t correspond to the same value \( X_i \), the sum of squares \( \sum_{i=1}^n\left(X_i - \overline{X}\right)^2 \) is positive.
  • Therefore, comparing the two values of the regression mean square and the residual mean square error provides some means to determine “how likely is it that \( \beta_1=0 \)?”
  • Particularly, the expected value of a mean square gives the mean around which the sample-based estimate will vary;
    • if \( \beta_1 \neq 0 \), we expect the regression mean square to attain a value greater than the RSS.
  • This type of comparison will underpin our hypothesis tests, which we will introduce shortly in multiple regression.

The ANOVA table

  • Collecting all the information we have developed so far in the analysis of variance framework, we arrive at the ANOVA table.
    • A sample ANOVA table:
      Source Degrees of freedom Sum of squares Mean square F-statistic
      Regression \( p-1 \) \( ESS \) \( \frac{ESS}{p-1} \) \( F \)
      Residual \( n-p \) \( RSS \) \( \frac{ESS}{n-p} \)
      Total \( n-1 \) \( TSS \) \( \frac{TSS}{1} \)
  • The R language will provide an ANOVA table that arranges the information we have discussed, similarly to the above.
    • The piece of data we haven’t discussed so far is the one we have been alluding to — the value of the F-statistic for hypothesis testing.
  • It is not strictly necessary to compute all the elements of the table — as the originator of the table, Fisher said in 1931, it is “nothing but a convenient way of arranging the arithmetic.”
  • When we introduce multiple regression, we will return to this table to interpret our results in terms of hypothesis testing versus the null model.