Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
Regression models are extremely important in describing relationships between variables.
Linear regression is a simple, but powerful tool in investigating linear dependencies.
Nonparametric regression models are widely used, because fewer assumptions about the data at hand are necessary.
At the beginning of every empirical analysis, it is better to look at the data without assumptions about the family of distributions.
Nonparametric techniques allow describing the observations and finding suitable models, when the sample size is sufficiently large and representative to explain the true population.
Regression models aim to find the most likely values of a dependent variable \( Y \) for a set of possible values \( \{x_i\}_{i = 1}^n \) of the explanatory variable \( X \).
We write a proposal for how the variables \( Y \) and \( X \) vary together as
\[ \begin{align} Y = g(X) + \epsilon & & \epsilon \sim F_\epsilon , \end{align} \]
where \( g(X)= \mathbb{E}\left[Y \vert X =x \right] \) is an arbitrary function.
The \( g(X) \) is included in the model with the intention of capturing the mean of the process that corresponds to a particular value of \( X=x \).
The \( \epsilon \) is a random noise term, representing variation around the deterministic part of the relationship.
The natural aim is to keep the values of the \( \epsilon \) as small as possible;
Parametric models assume that the dependence of \( Y \) on \( X \) can be fully explained by a finite set of parameters and that \( F_\epsilon \) has a prespecified form with parameters to be estimated.
Nonparametric methods do not assume any form:
The fact that nonparametric techniques can be applied where parametric ones are inappropriate prevents the nonparametric user from employing a wrong method.
These methods are particularly useful in fields like quantitative finance, where the underlying distribution is in fact unknown.
However, as fewer assumptions can be exploited, this flexibility comes with the need for more data.
Particularly, nonparametric methods can be of high variance in how they estimate the trend in the data.
Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition
Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition
Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition
Considering again the general regression model,
\[ \begin{align} Y = g(X) + \epsilon & & \epsilon \sim F_\epsilon . \end{align} \]
Our simplest relationship we can propose between quantitative variables:
is the linear one.
This is to say namely the function \( g(·) \) is linear.
It is of interest to study how Y varies with changes in the explanatory variables \( X_1 , \cdots , X_{p-1} \), using individual observed measurements \( \{y_i\}_{i=1}^n \) and \( \{x_{ij} \vert i=1,\cdots,n; j=1,\cdots,p-1\} \).
The model, which is assumed to hold in the population, is written as
\[ \begin{align} Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_{p-1} X_{p-1} + \epsilon \end{align} \] for some true and deterministic, but unknown collection of parameters \( \{\beta_i\}_{i=0}^{p-1} \).
The variable \( \epsilon \) is called the error term and represents all factors other than \( X_1 ,\cdots , X_{p-1} \) that affect \( Y \).
Generically, it is assumed that \( \mathbb{E}\left[ \epsilon\right] = 0 \) so that the mean of the random variable \( Y \) is described conditionally as \( \mathbb{E}\left[Y \vert X_1, \cdots, X_p \right] = \beta_0 + \beta_1 X_1 + \cdots + \beta_{p-1} X_{p-1} \).
Let \( \mathbf{Y} = \{y_i\}_{i=1}^n \) be a vector of observations of the response variables and
\[ \begin{align} \mathbf{X} = \{ x_{ij}\vert i=1,\cdots, n; j=1,\cdots p-1\} \end{align} \] be a matrix of tabular data, which we will associate with a data frame.
That is, we will consider the rows of the matrix \( \mathbf{X} \) to correspond to individual cases where \( Y \) is observed associated to corresponding values of the \( X_1, \cdots, X_{p-1} \).
The columns of the matrix \( \mathbf{X} \) will thus correspond to the distinct explanatory variables \( X_j \) that are observed in conjunction with the response variable.
In most cases, a constant is included by introducing a column of dummy variables \( x_{i0} = 1 \) for all rows \( i \) in this matrix.
The resulting data or design matrix, with the \( 0 \)-th column index vector of constant values \( 1 \), is thus of shape \( \mathbf{X}\in\mathbb{R}^{n\times p} \)
The aim is to find a good linear approximation of \( \mathbf{Y} \) using linear combinations of covariates
\[ \begin{align} \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \end{align} \] where the above represents a matrix transformation of a vector of parameters \( \boldsymbol{\beta} \) into the response variable, with error.
With the values of \( \{\beta_i\}_{i=0}^{p-1} \) unknown, this implies the need to choose some possible values to describe the relationship with some criterion for optimality.
For a particular choice of parameter vector \( \hat{\boldsymbol{\beta}} \), we describe the estimated values of \( \mathbf{Y} \) by the matrix equation
\[ \begin{align} \hat{\mathbf{Y}} = \mathbf{X} \hat{\boldsymbol{\beta}} \end{align} \]
Let \( y_i \) be a particular observed value, in the \( i \)-th row of the vector \( \mathbf{Y} \).
By row-column multiplication, the above relationship describes the following estimate
\[ \begin{align} y_i \approx \hat{y}_i = \hat{\beta}_0 \times 1 + \hat{\beta}_1 \times x_{i,1} + \cdots + \hat{\beta}_{p-1} \times x_{i,p-1} \end{align} \]
This gives the estimated value of \( y_i \) precisely as a linear combination of the observed values of the predictors, plus a constant adjustment term \( \beta_0 \) acting like the intercept of the simple regression model.
We suppose that the true relationship for this case is described as,
\[ \begin{align} y_i = \beta_0 \times 1 + \beta_1 \times x_{i,1} + \cdots + \beta_{p-1} \times x_{i,p-1} + \epsilon_i \end{align} \] where \( \epsilon_i \) represents the particular variation that was observed in the process when we obtained the measurement \( y_i \).
Again, assuming that the random variable \( \epsilon_i \) satisfies\( \mathbb{E}\left[ \epsilon_i\right] = 0 \), it can be understood that
\[ \begin{align} \mathbb{E}\left[Y_i\vert X_{i,j}=x_{i,j}\right] = \beta_0 + \beta_1 x_{i,1} + \cdots + \beta_{p-1} x_{i,p-1} \end{align} \]
To estimate the appropriate parameters \( \boldsymbol{\beta} \), the following least-squares optimization must be solved:
\[ \begin{align} \hat{\boldsymbol{\beta}} &= \mathrm{min}_{\boldsymbol{\beta}}f(\boldsymbol{\beta})\text{ where} \\ f(\boldsymbol{\beta}) &= \left(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}\right)^\mathrm{T} \left(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\right) = \left( \mathbf{Y} - \hat{\mathbf{Y}}\right)^\mathrm{T} \left(\mathbf{Y} - \hat{\mathbf{Y}}\right)= \hat{\boldsymbol{\epsilon}}^\mathrm{T}\hat{\boldsymbol{\epsilon}} \end{align} \]
In the above, like in optimization, we refer to
\[ \hat{\boldsymbol{\epsilon}} = \mathbf{r}(\boldsymbol{\beta}) = \mathbf{Y} - \hat{\mathbf{Y}} \] as the residual, measuring the difference between the observed reality and the prediction given the choice of the free variables.
We saw, when discussing unconstrained optimization, that this problem is globally convex problem that has a unique solution given by the normal equations:
\[ \begin{align} \hat{\boldsymbol{\beta}} = \left(\mathbf{X}^\mathrm{T} \mathbf{X}\right)^{-1} \mathbf{X}^\mathrm{T} \mathbf{Y} \end{align} \] provided the column rank of \( \mathbf{X} \) is of full dimension.
Here, similar but different to the optimization approach, we are treating the observed explanatory variables \( \mathbf{X} \) as the relationship that transmits the parameter values \( \boldsymbol{\beta} \) as free variables into the prediction \( \hat{\boldsymbol{Y}} \).
These approaches are deeply interrelated, and simply require slightly different interpretations based on the problem.
When the residual function, and the relationship between the free parameters and the prediction, is nonlinear, we must again consider iterative optimization procedures like Gauss-Newton.
The solution to the linear least squares problem
\[ \begin{align} \hat{\boldsymbol{\beta}} &= \mathrm{min}_{\boldsymbol{\beta}}f(\boldsymbol{\beta})\text{ where} \\ f(\boldsymbol{\beta}) &= \left(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}\right)^\mathrm{T} \left(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\right) = \left( \mathbf{Y} - \hat{\mathbf{Y}}\right)^\mathrm{T} \left(\mathbf{Y} - \hat{\mathbf{Y}}\right)= \hat{\boldsymbol{\epsilon}}^\mathrm{T}\hat{\boldsymbol{\epsilon}} \end{align} \] is called the (ordinary) least squares (OLS) estimator.
Under the following conditions, the OLS estimator \( \hat{\boldsymbol{\beta}} \) is by the Gauss–Markov theorem the best linear unbiased estimator (BLUE):
By unbiased, we mean to say that \( \hat{\boldsymbol{\beta}} \) is a random vector, that depends on the particular realization of \( \boldsymbol{\epsilon} \);
Being the best estimator, is in the sense that for any other possible solution \( \tilde{\boldsymbol{\beta}} \) for which \( \mathbb{E}\left[\tilde{\boldsymbol{\beta}}\right]=\boldsymbol{\beta} \),
\[ \begin{align} \mathrm{var}\left(\hat{\boldsymbol{\beta}}_i\right) \leq \mathrm{var}\left(\tilde{\boldsymbol{\beta}}_i\right) \end{align} \] so that it is the minimum variance estimator, among all linear unbiased estimators.
Practically, this means that, with respect to resampling, \( \hat{\boldsymbol{\beta}} \) is the most precise estimate, that varies the least from the true values of \( \boldsymbol{\beta} \).
We noted that the least squares solution is the BLUE when
The covariance of the estimator \( \hat{\boldsymbol{\beta}} \) is shown to equal \[ \mathrm{cov}\left( \hat{\boldsymbol{\beta}}\right) = \sigma^2 \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}, \] where \( \sigma \) is the standard deviation of the error \( \epsilon \).
Additional assumptions are required to develop further inferences about \( \hat{\boldsymbol{\beta}} \).
Under a normality assumption, \( \epsilon \sim N(0, \sigma^2) \), the estimator \( \hat{\boldsymbol{\beta}} \) has a normal distribution, i.e.
\[ \begin{align} \hat{\boldsymbol{\beta}} \sim N\left(\boldsymbol{\beta}, \sigma^2 \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\right). \end{align} \]
However, even without normality of the error, the central limit theorem tells us that this is the asymptotic distribution in large sample sizes \( n \).
In practice, the error variance \( \sigma^2 \) is often unknown, but can be estimated by
\[ \begin{align} \hat{\sigma}^2 = \frac{\hat{\boldsymbol{\epsilon}}^\mathrm{T} \hat{\boldsymbol{\epsilon}}}{n-p} \end{align} \]
It can be shown that this estimate is independent of the parameter estimates \( \hat{\boldsymbol{\beta}} \), and thus together
\[ \begin{align} \mathrm{cov}\left(\hat{\boldsymbol{\beta}}\right) \approx \hat{\sigma}^2 \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1} \end{align} \] is known as the standard error estimate, which can be used to describe how far \( \hat{\boldsymbol{\beta}} \) tends to vary from its true mean value \( \boldsymbol{\beta} \).
Also, under a normal error assumption or asymptotically with the central limit theorem, this allows us to make hypothesis tests for the estimated parameters \( \hat{\boldsymbol{\beta}} \).
Particularly, we are interested if \( \beta_i \) may be equal to zero, rendering the response variable independent of the explanatory variable \( X_i \).
Using the (possibly asymptotic) normal distribution of \( \hat{\beta}_i \), we can create a t-test using the test statistic
\[ \begin{align} t_i = \frac{\hat{\beta}_i}{\hat{\sigma}_i} \sim t_{n-p} \end{align} \] under the null hypothesis, where \( \hat{\sigma}_i \) is the \( i \)-th diagonal entry of the covariance matrix.
The t-test above is a simple and effective means to identify the usefulness of a particular variable \( X_i \) in describing the trend in \( Y \) as \( X_i \) varies.
If we are interested in testing for several variables simultaneously, i.e.,
\[ \begin{align} H_0 :& \beta_{q} = \beta_{q+1}= \cdots = \beta_{p-1}=0 \\ H_1 :& \beta_0, \cdots, \beta_{p-1} \neq 0 \end{align} \] we must use an F-test or an ANOVA to determine the simultaneous hypothesis test – the individual t-test p-values cannot be combined to perform this test.
Particularly, we can derive a test statistic of the form,
\[ \begin{align} F = \frac{\left(\hat{\boldsymbol{\epsilon}}^\mathrm{T}_\text{reduced model} \hat{\boldsymbol{\epsilon}}_\text{reduced model} - \hat{\boldsymbol{\epsilon}}^\mathrm{T}_\text{full model}\hat{\boldsymbol{\epsilon}}_\text{full model}\right)/\left(p-q\right)}{\left(n-p\right) / \left(n-q\right)} \end{align} \]
We will look at an example of this with the lm
function in R and the model summary.
library(faraway)
str(gala)
'data.frame': 30 obs. of 7 variables:
$ Species : num 58 31 3 25 2 18 24 10 8 2 ...
$ Endemics : num 23 21 3 9 1 11 0 7 4 2 ...
$ Area : num 25.09 1.24 0.21 0.1 0.05 ...
$ Elevation: num 346 109 114 46 77 119 93 168 71 112 ...
$ Nearest : num 0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
$ Scruz : num 0.6 26.3 58.7 47.4 1.9 ...
$ Adjacent : num 1.84 572.33 0.78 0.18 903.82 ...
Q: how many observations \( n \) do we have in this data set, and what is the largest \( p \) number of parameters can we put into a model?
A: there are \( n=30 \) observations, and the largest number is \( p=6 +1 \) parameters (including an intercept and minus one for the response).
row.names(gala)
[1] "Baltra" "Bartolome" "Caldwell" "Champion" "Coamano"
[6] "Daphne.Major" "Daphne.Minor" "Darwin" "Eden" "Enderby"
[11] "Espanola" "Fernandina" "Gardner1" "Gardner2" "Genovesa"
[16] "Isabela" "Marchena" "Onslow" "Pinta" "Pinzon"
[21] "Las.Plazas" "Rabida" "SanCristobal" "SanSalvador" "SantaCruz"
[26] "SantaFe" "SantaMaria" "Seymour" "Tortuga" "Wolf"
summary(gala)
Species Endemics Area Elevation
Min. : 2.00 Min. : 0.00 Min. : 0.010 Min. : 25.00
1st Qu.: 13.00 1st Qu.: 7.25 1st Qu.: 0.258 1st Qu.: 97.75
Median : 42.00 Median :18.00 Median : 2.590 Median : 192.00
Mean : 85.23 Mean :26.10 Mean : 261.709 Mean : 368.03
3rd Qu.: 96.00 3rd Qu.:32.25 3rd Qu.: 59.237 3rd Qu.: 435.25
Max. :444.00 Max. :95.00 Max. :4669.320 Max. :1707.00
Nearest Scruz Adjacent
Min. : 0.20 Min. : 0.00 Min. : 0.03
1st Qu.: 0.80 1st Qu.: 11.03 1st Qu.: 0.52
Median : 3.05 Median : 46.65 Median : 2.59
Mean :10.06 Mean : 56.98 Mean : 261.10
3rd Qu.:10.03 3rd Qu.: 81.08 3rd Qu.: 59.24
Max. :47.40 Max. :290.20 Max. :4669.32
head(gala)
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58 23 25.09 346 0.6 0.6 1.84
Bartolome 31 21 1.24 109 0.6 26.3 572.33
Caldwell 3 3 0.21 114 2.8 58.7 0.78
Champion 25 9 0.10 46 1.9 47.4 0.18
Coamano 2 1 0.05 77 1.9 1.9 903.82
Daphne.Major 18 11 0.34 119 8.0 8.0 1.84
The meaning of different variables can be found in the help file for the “gala” data using the command ?gala
.
Species – number of species found on the island
Endemics – number of endemic species
Area – the area of the island in km2
Elevation – the highest elevation of the island
Nearest – the distance from the nearest island
Scruz — the distance from Santa Cruz island
Adjacent – the area of the adjacent island
lm()
lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,data=gala)
Notice the code syntax:
\[ \begin{align} \mathbf{Y}_{\mathrm{species}} = \beta_0 + \beta_{\mathrm{Area}}\mathbf{X}_{\mathrm{Area}} + \beta_{\mathrm{Elevation}}\mathbf{X}_{\mathrm{Elevation}} + \beta_{\mathrm{Nearest}}\mathbf{X}_{\mathrm{Nearest}} + \beta_{\mathrm{Scruz}}\mathbf{X}_{\mathrm{Scruz}} + \beta_{\mathrm{Adjacent}}\mathbf{X}_{\mathrm{Adjacent}} \end{align} \]
Like in the simple regression model, we can see exactly how (for all other variables held fixed) the increase in, e.g., \( \mathbf{X}_\mathrm{Area} \) by 1 \( \mathrm{km}^2 \) corresponds to an increase in the predicted mean function by \( \beta_{\mathrm{Area}} \);
summary(lmod)
Call:
lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data = gala)
Residuals:
Min 1Q Median 3Q Max
-111.679 -34.898 -7.862 33.460 182.584
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.068221 19.154198 0.369 0.715351
Area -0.023938 0.022422 -1.068 0.296318
Elevation 0.319465 0.053663 5.953 3.82e-06 ***
Nearest 0.009144 1.054136 0.009 0.993151
Scruz -0.240524 0.215402 -1.117 0.275208
Adjacent -0.074805 0.017700 -4.226 0.000297 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 60.98 on 24 degrees of freedom
Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07
lm_simple <- lm(Species ~ Area,data=gala)
anova(lm_simple, lmod)
Analysis of Variance Table
Model 1: Species ~ Area
Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
Res.Df RSS Df Sum of Sq F Pr(>F)
1 28 235611
2 24 89231 4 146380 9.8427 7.362e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Because the p-value for the F statistic is below \( \alpha=5\% \), we have reason to reject the null hypothesis;
names(lmod)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
lmod$coefficients
(Intercept) Area Elevation Nearest Scruz Adjacent
7.068220709 -0.023938338 0.319464761 0.009143961 -0.240524230 -0.074804832
lmod$residuals
Baltra Bartolome Caldwell Champion Coamano Daphne.Major
-58.725946 38.273154 -26.330659 14.635734 38.383916 -25.087705
Daphne.Minor Darwin Eden Enderby Espanola Fernandina
-9.919668 19.018992 -20.314202 -28.785943 49.343513 -3.989598
Gardner1 Gardner2 Genovesa Isabela Marchena Onslow
62.033276 -59.633796 40.497176 -39.403558 -37.694540 -2.037233
Pinta Pinzon Las.Plazas Rabida SanCristobal SanSalvador
-111.679486 -42.475375 -23.075807 -5.553122 73.048122 -40.676318
SantaCruz SantaFe SantaMaria Seymour Tortuga Wolf
182.583587 -23.376486 89.383371 -5.805095 -36.935732 -5.700573
predicted_Y <- lmod$fitted.values
print(predicted_Y)
Baltra Bartolome Caldwell Champion Coamano Daphne.Major
116.7259460 -7.2731544 29.3306594 10.3642660 -36.3839155 43.0877052
Daphne.Minor Darwin Eden Enderby Espanola Fernandina
33.9196678 -9.0189919 28.3142017 30.7859425 47.6564865 96.9895982
Gardner1 Gardner2 Genovesa Isabela Marchena Onslow
-4.0332759 64.6337956 -0.4971756 386.4035578 88.6945404 4.0372328
Pinta Pinzon Las.Plazas Rabida SanCristobal SanSalvador
215.6794862 150.4753750 35.0758066 75.5531221 206.9518779 277.6763183
SantaCruz SantaFe SantaMaria Seymour Tortuga Wolf
261.4164131 85.3764857 195.6166286 49.8050946 52.9357316 26.7005735
Q: what appears to be at issue with our model?
A: in this example, the predicted values are un-physical, including negative numbers of species.
In practice, a great deal of the work will be identifying when the assumptions of the Gauss-Markov theorem fail to hold;
Even in the case of well-fitted models, it is not an easy task to select the best model from a set of alternatives.
Often, a first consideration is the coefficient of determination \( R^2 \) or adjusted \( R^2 \).
These values measure the ‘goodness of fit’ of the regression equation.
They represent the percentage of the total variation of the data explained by the fitted linear model.
Consequently, higher values closer to 1 indicate a ‘better fit’ of the model, while low values closer to 0 may indicate a poor model.
It is important to know, that \( R^2 \) always increases with the number of explanatory variables added to the model even if they are irrelevant.
The adjusted \( R^2 \) is a modification using a penalty term on the number of explanatory variables used, and is given by
\[ \begin{align} R^2_a = 1 - \left(1 - R^2\right)\frac{n-1}{n-p} \end{align} \]
The penalty in the above helps prevent against overfitting because new parameters introduced to the model that do not contribute substantially to the fit of the model will reduce the adjusted \( R^2 \).
summary(lmod)
Call:
lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data = gala)
Residuals:
Min 1Q Median 3Q Max
-111.679 -34.898 -7.862 33.460 182.584
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.068221 19.154198 0.369 0.715351
Area -0.023938 0.022422 -1.068 0.296318
Elevation 0.319465 0.053663 5.953 3.82e-06 ***
Nearest 0.009144 1.054136 0.009 0.993151
Scruz -0.240524 0.215402 -1.117 0.275208
Adjacent -0.074805 0.017700 -4.226 0.000297 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 60.98 on 24 degrees of freedom
Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07