Diagnostics part 1: Gaussian error assumption

10/19/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:
    • The iterative process of regression – a review
    • Diagnostics of four different assumptions
    • Diagnostics for the Gaussian error assumption
    • Q-Q plots and visual diagnostics
    • Hypothesis testing for Gaussian distributions

Diagnostics

  • Our analysis and methodology for:
    1. fitting a linear model;
    2. forecasting predictions from a linear model;
    3. explaning the relationships between variables in a linear model; and
    4. quantifying the uncertainty of a linear model
  • all rely on several assumptions, e.g., the conditions for the Gauss Markov theorem and usually Gaussianity of the errors.

  • Methods for checking and validating these assumptions are known as diagnostics.

  • Typically, we will start with one model as a best first guess.

  • Performing diagnostics will reveal issues in the model, and suggest ways for improvement.

  • Building a model is thus usually an interactive, iterative process, where we will create and perform diagnostics over a succession of models.

Constructing a regression model – an iterative process

Flow chart from start of the regression analysis, to exploratory data analysis, to developing one or more possible models.

Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition

  • This iterative process is represented by the flow chart on the left.
  • We must clean the data and perform exploratory analysis to determine relationships between the response, possible predictors, and the between the predictors themselves.
    • This step, which we have largely suppressed so far, can actually be one of the larger pieces of work in regression analysis.
    • This will be introduced to a certain extent in the midterm and final projects.
    • It will be critical to understand the relationships between the variables to identify correlations / multicollinearity, as it affects our framework.
    • Understanding the spread of the data will affect our inferences as it relates to extrapolation.
    • Being aware of nonlinear scaling in the data (e.g., suggesting a log or square root transform) will also affect our analysis.
    • Similarly, we should be aware of multi-modality in the data, as well as aspects of skewness in distributions.
    • Deciding the appropriate specification of categorical variables in the model is another way exploratory analysis affects the model selection.
    • Finally, identifying unusual observations as well as erroneous and missing data is a key aspect to a realistic analysis.
  • This exploratory analysis helps us to develop one or more possible regression models for the study.

Constructing a regression model – an iterative process

Flow chart from developing one or more possible models, iterating if the models aren't appropriate for the data at hand, then identifying the most suitable model and making inferences.

Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition

  • From the point of the last diagram, we can take our tentative models and evaluate if they are suitable for the problem and data.
  • If not, we revise and repeat, possibly applying a different framework all together.
  • If so, we compare the competing models and explanations for those that are most robust to the uncertainty in accurately identifying a statistical relationship and drawing conclusions from it.
  • We will test the models in various ways to see if our framework’s assumptions are well-satisfied, that the models are not overfit to the training data and that we can draw the same conclusions in the presence of confounding variables.
    • Note, even when we feel quite confident in the model, it is key to qualify our inferences by the uncertainty of the model, e.g., providing confidence intervals.
  • The first two parts of this course have focused on building the theory and studying the application of regression in the situation in which the GM assumptions are well satisfied and there exists a good linear signal in the data.

Diagnostics – continued

  • In the next part of the course, we will consider 4 categories of potential issues with the model:
    1. Issues with the Gaussian error assumption: our hypothesis testing studied thus far relies on the Gaussian error assumption.
      • This is an assumption that will not cause huge issues if it is not satisfied exactly due to the central limit theorem; however, we should be cautious about the use of some hypothesis testing and careful in our judgement if inferences therein will be accurate when this assumption is not satisfied.
      • Other courses, or further reading the in the book, will contain more information on how to handle significant departures from Gaussianity, but this is the least of important assumptions as the least squares estimate will still be the BLUE without it.
    2. Issues with the form of the hypothetical covariance: we has assumed that \[ \mathrm{cov}\left(\mathbf{Y}\right) = \sigma^2\mathbf{I} \] but for many cases this will not be true.
      • Handling deviations from this assumption will lead to various techniques of transformation, including generalized least squares.
    3. Issues with unusual observations: some of the observations may not look like others, and they might change the choice and the fit of the model.
      • It is important thus to understand how to identify, interpret and treat the effects of influential and unusual observations.
    4. Issues with the systematic part of the model: we have assumed that there is an actual signal in the data of the form \[ \begin{align} \mathbb{E}[\mathbf{Y}] = \mathbf{X} \boldsymbol{\beta}, \end{align} \] which may not be valid.
      • Additionally, this doesn’t address the multiple competing models we may choose, nor the issue of overfitting in their selection.

Reviewing the Gaussian error assumption

  • We will begin our discussion of diagnostics with the Gaussian error assumption.
  • Although our model may still represent the BLUE without this assumption, we will not be able make use of certain tests without this.
  • The hypothesis testing and confidence interval procedures we have learned are based upon assumptions of Gaussian errors.
  • For the need to quantify the uncertainty of our model, its explanations and predictions, we should be aware of when it is appropriate or not to make conclusions based on these assumptions.
  • The central limit theorem assures that asymptotically in the sample size \( n \), \( \hat{\boldsymbol{\beta}} \) will be Gaussian distributed about the true parameter values, regardless of the underlying \( \boldsymbol{\epsilon} \).
  • There is no universal number \( n_0 \) for the minimum sample size to apply the central limit theorem; for close-to-normal data, the sample size can be quite small and is often take \( n_0\approx 30 \).
  • In the case of strong departures from the Gaussian error assumption, however, \( n \) may need quite large, on the order of hundreds of observations.
  • Therefore, to gain more corroborating evidence that we can trust our confidence intervals and hypothesis tests, we will often want to quantify if there are strong departures from Gaussian distributions, especially in the context of smaller sample sizes.
  • Let us recall, if \( \boldsymbol{\epsilon} \) is Gaussian distributed, \( \hat{\boldsymbol{\epsilon}} \) will be as well.
  • Therefore, we can take \[ \begin{align} H_0: \boldsymbol{\epsilon} \sim N(\boldsymbol{0},\sigma^2 \mathbf{I}) & & H_1 : \boldsymbol{\epsilon} \text{ non Gaussian} \end{align} \] and study the residuals to determine if we have significant departures from the null hypothesis.

Tests for Gaussianity

Cumulative probability distributions of Gaussians

Courtesy of Inductiveload via Wikimedia Commons

  • An extremely useful tool to assess the Gaussianity of the error is a quantile-quantile (Q-Q) plot.
  • To define a Q-Q plot, we will consider the theoretical versus the empirical values for such a plot.
  • Suppose we have two probability measures \( P_1 \) and \( P_2 \).
  • Let each probability have a corresponding cumulative distribution function (CDF) given by \( C_1 \) and \( C_2 \) respectively.
  • Recall, the CDF of a probability measure \( P \) is defined, \[ \begin{align} C(x) \triangleq& P(X \leq x )\\ \equiv& \int_{-\infty}^x f(t) dt, \end{align} \] whenever \( P \) has a continuous probability density function \( f(t) \)
  • On the left, we illustrate this with the CDF of the normal distribution with several different parameters.

Quantile-Quantile plots

  • Given the two probability measures \( P_1,P_2 \), and their respective CDF's, we can define their theoretical Q-Q plot as the graph of the \( \mathbb{R}^2 \) valued function,

    \[ \begin{align} G:&[0,1] &\mapsto &\mathbb{R}^2& \\ &p &\mapsto &(C_1^{-1}(p), C_2^{-1}(p)) & \end{align} \]

    • Suppose \( p \) is a particular quantile, e.g., \( p=50\% \). Then the map \( G \) takes \( p \) to the point in the plane, \( (x_1, x_2) \), for which \[ \begin{align} P_1(X \leq x_1) &= 50\% \\ P_2(X \leq x_2) &= 50\% \end{align} \]
  • Q: in the above, what does the point \( (x_1,x_2) \) correspond to?

    • A: The point \( (x_1,x_2) \) is just the plot of the medians of the two distributions in the plane.
  • Q: What kind of shape do we expect for the plot when the two CDFs are equal?

    • A: If the two CDFs are equal, \( C_1 = C_2 \), the Q-Q plot of \( G \) is just the straight line with slope one through the origin.

Quantile-Quantile plots

  • Suppose that \( P_1,P_2 \) represent the same family of probability measures, such that their CDFs differ only by location and shape, i.e.,

    • suppose \( C_1(x) = C_2\left(\frac{x - \mu}{\sigma}\right) \).
    • Then \( C_1^{-1}(p) \equiv \mu + \sigma C_2^{-1}(p) \), and there is a linear relationship between the quantiles of the two measures.
  • Therefore, if two distributions differ only in location and scale,

    • e.g., \( X_1 \sim N(0,1) \) and \( X_2 \sim N(\mu, \sigma^2) \),
  • the Q-Q plot is just a straight line with slope \( \sigma \) and intercept \( \mu \).

  • For this reason, when making a Q-Q plot, the CDFs (and/or the data) are typically standardized to be mean zero and variance one.

  • By doing so, when measuring two distributions in the same family (as above), the Q-Q plot will just be the central diagonal of the plane.

Quantile-Quantile plots

  • Typically, we will use the Q-Q plot to make a visual inspection of the empirical quantiles of our data versus the theoretical quantiles of a target distribution.
  • Let’s suppose that we have a sample \( \left\{X_i\right\}_{i=1}^n \) distributed according to \( C_2 \) where \( C_2 \) is an unknown CDF.
  • Suppose that we hypothesize \( C_2 = C_1 \) for some known, theoretical distribution \( C_1 \). That is, \[ \begin{align} H_0: C_1 = C_2 \\ H_1: C_1 \neq C_2 \end{align} \]
  • We will make a hypothesis test as above where,
    1. if the Q-Q plot of the empirical versus theoretical CDF is “almost” the central diagonal,
    2. we will tentatively fail to reject the null hypothesis.
  • Recall, that the inverse of the CDF applied to \( 50\% \), \( C^{-1}\left(50\%\right) \), is just the median.
  • Q: Given observations \( \left\{x_i\right\}_{i=1}^n \) of the random sample above, how do we compute the empirical inverse CDF?
  • A: For the observations of the sample \( \left\{x_i\right\}_{i=1}^n \), the inverse of the empirical CDF is simply the re-ordered values such that \( x_i \leq x_{i+1} \) for all \( i=1,\cdots, n-1 \).

Quantile-Quantile plots

  • With sample size \( n \), we can make \( n \) plotting positions for the inverse of the empirical and theoretical CDF, i.e., we can plot the points,

    \[ \begin{align} \left\{\left(C_1^{-1}\left(\frac{i}{n+1}\right), x_i\right) \in \mathbb{R}^2: i = 1,\cdots,n \right\} \end{align} \]

    • Q: Notice, we do not compute \( C_1^{-1}\left(1\right) \), what would be the possible issue with this plot?
    • A: If, for instance, \( X_1 \sim N(\mu, \sigma^2) \), then \( C_1^{-1}(1)= \infty \).
    • Generally, some adjustment must be made in the above equation including, e.g., the \( n+1 \) as the denominator.
    • Several different approximations can be made, but some adjustment is unavoidable.
    • For a sufficiently large, independent, identically distributed sample, the approximation won't make a very large difference, but will keep the plot finite.
    • Note: evaluating the hypothesis test,

    \[ \begin{align} H_0: C_1 = C_2 \\ H_1: C_1 \neq C_2 \end{align} \]

  • using the empirical versus theoretical CDF in the Q-Q plot is very close to the Kolmogorov-Smirnov test.

  • The Kolmogorov-Smirnov test follows a similar principle and can be used generally to evaluate the divergence of the empirical CDF of a sample of observations from a hypothesized CDF.

Reading Q-Q plots

Normal data q-q plot.