10/12/2020
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
We have now seen how to construct a model, and assess some of its uncertainty and significance versus the null hypothesis.
One of the main points of constructing such a model is to produce new predicted values for the response based upon new values for the explanatory data.
Consider values of the predictors which are not observed in our data set, but fall within the scope of the model;
We can construct an arbitrary vector of values for the predictors, \( \mathbf{x}_0 \), and define the prediction based on our least-squares model as, \[ \begin{align} \hat{Y}_0 = \mathbf{x}_0^\mathrm{T} \hat{\boldsymbol{\beta}} \end{align} \] where \( \mathbf{x}_0\in \mathbb{R}^p \).
If \( \mathbf{x}_0 \) was an observed case, this is precisely the fitted value.
Among the primary issues is quantifying the uncertainty of the prediction.
Concretely, we will differentiate between the predicted mean response and the prediction of a future observation
Suppose we build a model that predicts the rental price of houses in a given area based on: (i) the number of bedrooms, (ii) the number of bathrooms; and (iii) distance to a major highway.
Given some new set of values for the explanatory variables \( \mathbf{x}_0 \),
for some particular realization of \( \epsilon_0 \) in this case.
A future observation of the response is defined to be,
\[ \begin{align} Y_0 = \mathbf{x}_0^\mathrm{T} \hat{\boldsymbol{\beta}} + \epsilon_0; \end{align} \]
where the variation, \( \epsilon_0 \), around the signal, \( \mathbf{x}_0^\mathrm{T} \hat{\boldsymbol{\beta}} \), for this future observation is unknown.
Assuming,
our “central” point prediction is given by the predicted mean, \( \hat{Y}_0 = \mathbf{x}_0^\mathrm{T} \hat{\boldsymbol{\beta}} \).
We will derive the variance of the predicted observation – this allows us to quantify the uncertainty of the point prediction and to derive confidence intervals.
Consider,
\[ \begin{align} \mathbf{x}_0^\mathrm{T} \hat{\boldsymbol{\beta}} & = \mathbf{x}_0^\mathrm{T}\left( \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \mathbf{Y}\right) \\ & = \mathbf{x}_0^\mathrm{T} \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T} \left(\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\right) \\ &= \mathbf{x}_0^\mathrm{T} \boldsymbol{\beta} + \mathbf{x}_0^\mathrm{T} \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T} \boldsymbol{\epsilon} \end{align} \]
such that
\[ \begin{align} \mathbb{E}\left[\mathbf{x}_0^\mathrm{T} \hat{\boldsymbol{\beta}} \right]& = \mathbf{x}_0^\mathrm{T} \boldsymbol{\beta} \end{align} \] the prediction is indeed unbiased.
Therefore, the deviation from the mean is given by,
\[ \begin{align} \mathbf{x}_0^\mathrm{T} \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T}\boldsymbol{\epsilon} \end{align} \] which is a scalar.
Because the above equation is a scalar, the transpose is equal to itself, i.e.,
\[ \begin{align} \mathbf{x}_0^\mathrm{T} \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T}\boldsymbol{\epsilon}&= \left(\mathbf{x}_0^\mathrm{T} \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T}\boldsymbol{\epsilon}\right)^\mathrm{T}\\ & = \boldsymbol{\epsilon}^\mathrm{T}\mathbf{X}\left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{x}_0 \end{align} \]
From the last slide, we compute the variance of \( \mathbf{x}_0^\mathrm{T} \hat{\boldsymbol{\beta}} \) as
\[ \begin{align} var\left[\mathbf{x}_0^\mathrm{T} \hat{\boldsymbol{\beta}} \right]& = \mathbb{E}\left[ \left( \mathbf{x}_0^\mathrm{T} \hat{\boldsymbol{\beta}} - \mathbf{x}_0^\mathrm{T}\boldsymbol{\beta}\right)^2\right] \\ &=\mathbb{E} \left[ \mathbf{x}_0^\mathrm{T} \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T}\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\mathrm{T}\mathbf{X}\left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{x}_0\right] . \end{align} \]
Passing the expectation linearly across terms, we have
\[ \begin{align} var\left[\mathbf{x}_0^\mathrm{T} \hat{\boldsymbol{\beta}} \right]& = \mathbf{x}_0^\mathrm{T} \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T}\mathbb{E}\left[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\mathrm{T}\right]\mathbf{X}\left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{x}_0 \\ &= \mathbf{x}_0^\mathrm{T} \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T}\left( \sigma^2 \mathbf{I}\right)\mathbf{X}\left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{x}_0. \end{align} \]
With a final cancellation of terms, we recover,
\[ \begin{align} var\left[\mathbf{x}_0^\mathrm{T} \hat{\boldsymbol{\beta}} \right] &= \sigma^2\mathbf{x}_0^\mathrm{T} \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \mathbf{x}_0, \end{align} \] so that the uncertainty is given by a weighted inner product of the new predictors.
Recall how we defined the anomalies of the mean earlier.
Let \( \overline{\mathbf{X}}^{(i)} \) be the mean of column \( i \) of the design matrix, i.e., \[ \overline{\mathbf{X}}^{(i)} \triangleq \frac{1}{n} \sum_{k=1}^n X_{k,i}, \] summing over the matrix entries \( X_{k,i} \) along the rows \( k=1,\cdots,n \).
We will then define the \( (k,i) \)-th anomaly once again as \[ a_{(k,i)} = X_{k,i} - \overline{\mathbf{X}}^{(i)}, \] such that the matrix \( \mathbf{A} \) is defined column-wise as \[ \begin{align} \mathbf{A}^{(i)} &\triangleq \mathbf{X}^{(i)} - \frac{1}{n} \boldsymbol{1}\boldsymbol{1}^\mathrm{T}\mathbf{X}^{(i)} \\ \end{align} \] where \( \mathbb{1} \) is the vector of ones, \[ \begin{align} \boldsymbol{1}^\mathrm{T} \triangleq \begin{pmatrix} 1 & 1 & \cdots & 1 \end{pmatrix}. \end{align} \]
The sample-based correlation coefficient of the variables \( X_i \) and \( X_j \) can be written once again as \[ \begin{align} cor(X_i,X_j)\triangleq \frac{\left(\mathbf{A}^{(i)}\right)^\mathrm{T} \mathbf{A}^{(j)}}{\sqrt{\left[\left(\mathbf{A}^{(i)}\right)^\mathrm{T}\mathbf{A}^{(i)}\right] \left[\left(\mathbf{A}^{(j)}\right)^\mathrm{T}\mathbf{A}^{(j)}\right]}}. \end{align} \]
Note that we can also write the covariance directly as, \[ \mathrm{cov}\left(\mathbf{X}\right) = \frac{1}{n-1}\mathbf{A}^\mathrm{T} \mathbf{A} \]
dividing by the degrees of freedom.
This suggest that it can be useful instead to also consider the normalized anomalies,
\[ \tilde{\mathbf{X}}^{(i)} = \frac{1}{\sqrt{n-1}}\mathbf{A}^{(i)}, \]
as this means,
\[ \mathrm{cov}\left(\mathbf{X}\right) = \tilde{\mathbf{X}}^\mathrm{T} \tilde{\mathbf{X}}. \]
If we take the predictors to instead be \( \tilde{\mathbf{X}}^{(i)} \) the uncertainty of the prediction can be written as
\[ \begin{align} \mathrm{var}\left[\tilde{\mathbf{x}}_0^\mathrm{T} \hat{\boldsymbol{\beta}} \right] &= \sigma^2\tilde{\mathbf{x}}_0^\mathrm{T} \mathrm{cov}\left(\mathbf{X}\right)^{-1} \tilde{\mathbf{x}}_0, \end{align} \] i.e., as a ratio of the spread in the observations \( \sigma^2 \) and a weighted distance of the new case from the mean;
Using the assumption of the Gaussian variation, we develop two separate intervals for uncertainty quantification:
Due to the extra term in the prediction interval, this is generally significantly wider than the range in which we believe the mean to lie.
We should always be careful about which of the prediction interval and the confidence interval for the mean response we intend to use.
As an example, consider again, predicting the high level water line for a river in a flood plane.
This is useful for policy makers to understand the range of possible extreme events, where there could be extreme flooding due to variation around the signal.
Measuring someone's weight is easy, but to estimate their body density, one has to obtain a complicated measure of volume.
The same principle applies to measuring body density; bones and muscle are denser than fat so an individual's body density can be used to estimate their percentage body fat.
The volume of someone's body can be estimated by the Archimedes' principle, but this is complicated and time consuming with a human being or other living subjects.
To simplify the process, researchers recorded age, weight, height, and 10 body circumference measurements for 252 men.
These test subjects were given an accurate measurement of their body fat by an underwater weighting technique.
The task is to determine if we can predict body fat using easier to measure quantities.
library(faraway)
lmod <- lm(brozek ~ age + weight + height + neck + chest + abdom +
hip + thigh + knee + ankle + biceps + forearm + wrist, data=fat)
head(fat)
brozek siri density age weight height adipos free neck chest abdom hip
1 12.6 12.3 1.0708 23 154.25 67.75 23.7 134.9 36.2 93.1 85.2 94.5
2 6.9 6.1 1.0853 22 173.25 72.25 23.4 161.3 38.5 93.6 83.0 98.7
3 24.6 25.3 1.0414 22 154.00 66.25 24.7 116.0 34.0 95.8 87.9 99.2
4 10.9 10.4 1.0751 26 184.75 72.25 24.9 164.7 37.4 101.8 86.4 101.2
5 27.8 28.7 1.0340 24 184.25 71.25 25.6 133.1 34.4 97.3 100.0 101.9
6 20.6 20.9 1.0502 24 210.25 74.75 26.5 167.0 39.0 104.5 94.4 107.8
thigh knee ankle biceps forearm wrist
1 59.0 37.3 21.9 32.0 27.4 17.1
2 58.7 37.3 23.4 30.5 28.9 18.2
3 59.6 38.9 24.0 28.8 25.2 16.6
4 60.1 37.3 22.8 32.4 29.4 18.2
5 63.2 42.2 24.0 32.2 27.7 17.7
6 66.0 42.0 25.6 35.7 30.6 18.8
We will make a prediction given from the median of the explanatory variables
We compute the median,
x <- model.matrix(lmod)
(x0 <- apply(x,2,median))
(Intercept) age weight height neck chest
1.00 43.00 176.50 70.00 38.00 99.65
abdom hip thigh knee ankle biceps
90.95 99.30 59.00 38.50 22.80 32.05
forearm wrist
28.70 18.30
(y0 <- sum(x0*coef(lmod)))
[1] 17.49322
predict(lmod,new=data.frame(t(x0)))
1
17.49322
med_obs <- predict(lmod,new=data.frame(t(x0)),interval="prediction")
med_mean <- predict(lmod,new=data.frame(t(x0)),interval="confidence")
med_obs
fit lwr upr
1 17.49322 9.61783 25.36861
med_mean
fit lwr upr
1 17.49322 16.94426 18.04219
med_obs
fit lwr upr
1 17.49322 9.61783 25.36861
med_mean
fit lwr upr
1 17.49322 16.94426 18.04219
A: In this case, despite the tight interval around the mean response, there is a practical difference in the lower and upper bounds of the \( 95\% \) prediction interval:
med_obs_width <- med_obs[3] - med_obs[2]
med_obs_width / med_obs[1]
[1] 0.9003934
med_mean_width <- med_mean[3] - med_mean[2]
med_mean_width / med_mean[1]
[1] 0.06276317
Particulary, the width of the \( 95\% \) prediction interval for an observation with median characteristics is on the order of the prediction itself.
On the other hand, the width of the \( 95\% \) confidence interval for the mean response is on the order of \( 6\% \) of the prediction.
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.292549 16.069921 -0.9516 0.342252
age 0.056786 0.029965 1.8951 0.059290
weight -0.080310 0.049581 -1.6198 0.106602
height -0.064600 0.088930 -0.7264 0.468299
neck -0.437541 0.215334 -2.0319 0.043273
chest -0.023603 0.091839 -0.2570 0.797396
abdom 0.885429 0.080077 11.0572 < 2.2e-16
hip -0.198419 0.135156 -1.4681 0.143406
thigh 0.231895 0.133718 1.7342 0.084175
knee -0.011677 0.224143 -0.0521 0.958496
ankle 0.163536 0.205143 0.7972 0.426142
biceps 0.152799 0.158513 0.9640 0.336048
forearm 0.430489 0.184452 2.3339 0.020436
wrist -1.476537 0.495519 -2.9798 0.003183
n = 252, p = 14, Residual SE = 3.98797, R-Squared = 0.75
Suppose we want to make a prediction given a set of observed explanatory variables that lie far from the previously observed data.
Recall the form of the variance of the prediction when we use the normalized anomalies as the predictors,
\[ \begin{align} var\left[\tilde{\mathbf{x}}_0^\mathrm{T} \hat{\boldsymbol{\beta}} \right] &= \sigma^2\tilde{\mathbf{x}}_0^\mathrm{T} cov\left(\mathbf{X}\right)^{-1} \tilde{\mathbf{x}}_0, \end{align} \]
Q: given the above calculation of the variance, do we expect the variance of the prediction to grow or shrink as we move away from the center of mass of the previous observations?
A: when the weighted distance from the mean is large (i.e., the normalized anomalies are large) then the variance will be large.
We can consider the case where \( \tilde{\mathbf{x}}_0 \) is an eigen vector off \( cov(\mathbf{X}) \).
In this case, we have
\[ \begin{align} & cov(\mathbf{X})\tilde{\mathbf{x}}_0 = \lambda \tilde{\mathbf{x}}_0 \\ \Leftrightarrow & \frac{1}{\lambda} \tilde{\mathbf{x}}_0 = cov(\mathbf{X})^{-1}\tilde{\mathbf{x}}_0 \\ \Rightarrow & \frac{1}{\lambda} \parallel \tilde{\mathbf{x}}_0\parallel^2 = \tilde{\mathbf{x}}_0^\mathrm{T} cov(\mathbf{X})^{-1}\tilde{\mathbf{x}}_0. \end{align} \]
Recalling, we have for an eigenvector \( \tilde{\mathbf{x}}_0 \),
\[ \begin{align} \frac{1}{\lambda} \parallel \tilde{\mathbf{x}}_0\parallel^2 = \tilde{\mathbf{x}}_0^\mathrm{T} \mathrm{cov}(\mathbf{X})^{-1}\tilde{\mathbf{x}}_0. \end{align} \]
Because the covariance is symmetric by construction, we know it has an orthonormal basis of eigenvectors by the Spectral Theorem.
We can decompose any anomaly \( \tilde{\mathbf{x}}_0 \) into such a basis, and recover a similar statement in terms of decomposing the normalized anomaly \( \tilde{\mathbf{x}}_0 \) in terms of the orthogonal eigenbasis.
Therefore, we see that the variance scales like the inverse eigenvalues of \( \mathrm{cov}(\mathbf{X}) \) times the norm-square of the anomaly and the variance of the observations.
For the directions of the greatest spread in the observations of the predictors will correspond to larger \( \lambda \) and therefore anomalies in the direction of the greatest spread will have relatively lower variance in prediction.
Generally speaking, this says that when we try to predict for a new case that lies far away from the center of the predictor space, centered by the mean and weighed by the inverse covariance, we leave our realm of experience and this results in high variance in the predictions.
Predicting the response variables given values \( \mathbf{x}_0 \) for the explanatory variables in the signal that lie beyond the previously observed quantities is known as extrapolation.
Intuitively, we should expect greater uncertainty in our predictions the less the situation looks like our previous experience.
We see this concretely reflected in the \( 95\% \) prediction interval and confidence interval of the mean response.
Suppose we take the \( 95 \) percentile of the observed explanatory variables and compute these intervals.
x1 <- apply(x,2,function(x) quantile(x,0.95))
nine_five_obs <- predict(lmod, new=data.frame(t(x1)), interval="prediction")
nine_five_mean <- predict(lmod, new=data.frame(t(x1)), interval="confidence")
(nine_five_obs[3] - nine_five_obs[2])/ med_obs_width
[1] 1.027756
(nine_five_mean[3] - nine_five_mean[2])/ med_mean_width
[1] 3.547266
Here we see only a slight increase in the width of the \( 95\% \) prediction interval;
Unfortunately, for the above example, we have substantial uncertainty of the form that the model should take.
While we are able to quantify the uncertainty of the parameters, e.g., with hypothesis testing and confidence intervals, we haven't dealt with the uncertainty of the model when we are unsure if a linear model is appropriate in the first place.
This aspect is to be continued…
In many data sets, we will be interested in how a value changes in time, often as a funciton of the previous values for the response itself.
Examples include:
The principle for using such a model is that there is strong correlation in time between the observations of the response.
In the following, we will be interested in modeling the future response as a linear model in terms of the previously observed responses.
By intuition, or expert prior knowledge, we might figure that the number of passengers in a given month would strongly depend on:
In this case, our best model for the seasonal variation might be the seasonal variation we noticed one year prior.
However, if we wish to obtain the variation seasonally, we need the difference between months.
This above reasoning suggests a model might look like the following:
\[ \begin{align} y_t = \beta_0 + \beta_1 y_{t-1} + \beta_{12}y_{t-12} + \beta_{13} y_{t-13} + \epsilon_t \end{align} \]
Here, \( y_{t} \) refers to the response (number of passengers) at time \( t \).
Therefore, \( y_{t-12} \) represents the response one year in the past;
\( y_{t-13} \) represents the response one year and a month in the past;
and \( \epsilon_t \) represents the random variation at the current time.
In general, we denote the \( y_{t-i} \) for \( i>0 \) the lagged variables.
In R, we will use the “embed” function to create a matrix of the lagged variables.
For an example, let the vector \( x \) represent the time series from time 1 to time 10
x = c(1:10)
print(x)
[1] 1 2 3 4 5 6 7 8 9 10
embed(x, dimension=3)
[,1] [,2] [,3]
[1,] 3 2 1
[2,] 4 3 2
[3,] 5 4 3
[4,] 6 5 4
[5,] 7 6 5
[6,] 8 7 6
[7,] 9 8 7
[8,] 10 9 8
In the following, we use the same to create 13 lagged variables in the past for each row;
we rename the column names of the new dataframe correspondingly;
and we create a linear model in terms of the lagged variables as demonstrated before:
lagdf <- embed(log(airpass$pass),14)
colnames(lagdf) <- c("y",paste0("lag",1:13))
lagdf <- data.frame(lagdf)
armod <- lm(y ~ lag1 + lag12 + lag13, data.frame(lagdf))
sumary(armod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.138485 0.053607 2.5833 0.01092
lag1 0.692308 0.061858 11.1919 < 2.2e-16
lag12 0.921518 0.034732 26.5321 < 2.2e-16
lag13 -0.632144 0.067679 -9.3403 4.156e-16
n = 131, p = 4, Residual SE = 0.04164, R-Squared = 0.99
An explicit time variable isn't necessary here because the \( y_{t-1} \) plays the role in describing the trend.
Notice the alternating sign for the coefficients in the \( y_{t-13} \) and \( y_{t-12} \) lag variables, giving the response precisely as the weighted difference between these two months.
The \( R^2 \) is extremely good and each variable is strongly significant… let's see the result.
Suppose we wish to predict future values.
The last observation in the time series is given as:
lagdf[nrow(lagdf),]
y lag1 lag2 lag3 lag4 lag5 lag6 lag7
131 6.068426 5.966147 6.133398 6.230481 6.40688 6.43294 6.282267 6.156979
lag8 lag9 lag10 lag11 lag12 lag13
131 6.133398 6.037871 5.968708 6.033086 6.003887 5.891644
where each row entry corresponds to the current (final) time step or up to 13 lagged variables.
predict(armod, data.frame(lag1=6.0684, lag12=6.0331, lag13=6.0039),
interval="prediction")
fit lwr upr
1 6.103972 6.020606 6.187338
By producing new predicted values, we can recursively compute future values, using predictions of new variables to give future lag variables.