Simple linear regression – part 1

08/31/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Linear models
- Simple linear regression
- Basic regression assumptions
- The process of creating a regression model

Introducing linear models

Image of 2-dimensional plot with one data point.

In past mathematics courses, we have seen many examples of linear models.
Suppose that we wish to model a relationship between two variables, \( x \) and \( y \) to the left.
We will call \( y \) a dependent variable, or the response variable.
On the other hand, we will call \( x \) an independent variable, an explanatory variable or a predictor variable for the response.
Q: can you propose a valid linear model for the relationship between the response and the predictor?

Introducing linear models – continued

A: actually, any line that passes through the point is a valid linear model.
Particularly, this relationship is underconstrained and there exists infinitely many choices of linear models;

given the current data, any choice is as valid as any other.

Introducing linear models – continued

Image of 2-dimensional plot with two data points.

Q: given the data on the left, can you propose a valid linear model for the relationship between \( x \) and \( y \)?

Introducing linear models – continued

Image of 2-dimensional plot with one data points with line connecting them.

A: given the data, there exists a unique choice of a linear model that describes the relationship between \( x \) and \( y \).
Suppose the points have the Cartesian coordinates \( (x_1,y_1) \) and \( (x_2,y_2) \).
Then we can write the change in \( y \) from \( y_1 \) to \( y_2 \) over the interval \( x_1 \) to \( x_2 \) as, \[ \begin{align} m \triangleq \frac{\left(y_2 - y_1\right)}{\left(x_2 - x_1\right)}. \end{align} \]
Then suppose we want to predict the response \( y \) at an unobserved value \( x \) for the explanatory variable;

our model for the response variable can be written as, \[ \begin{align} & m = \frac{\left(y - y_1\right)}{\left(x - x_1\right)} \\ \Leftrightarrow & y = m*x + \left(y_1 - m * x_1\right) \end{align} \]

Introducing linear models – continued

Image of 2-dimensional plot with three data points.

Let’s suppose that we are given the observed data on the left, with two values of the response \( y \) observed with the same value of the explanatory variable \( x \).
Let’s suppose that we know there is measurement error in the response variable \( y \), but we have the same level of confidence (or inversely, the same level of uncertainty) in each measurement.
Q: given the data, can you propose a linear model that describes the relationship between \( x \) and \( y \)?

Introducing linear models – continued

Image of 2-dimensional plot with three data points and line fit with least-squares.

A: given the data, if the relationship is purely deterministic, there is no possible linear model to define;

indeed, in a deterministic setting, this represents an overconstrained model for which there is no solution.

However, because we know that there is random error, one reasonable choice may be to “fit” a line that passes equi-distant between the two observed values of \( y \) for the single value of \( x \).
If we were to take many replicates (repeated measurements) of \( y \) at the same point \( x \), we would expect that on average we would recover the “true” relationship without error fitting the line this way.
This underlies the fundamental motivation for simple linear regression.

Introducing linear models – continued

Image of 2-dimenssional scatter plot of year-end evaluation score versus midyear evaluation.

Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition

More generally we might suppose that, instead of measurement errors, the relationship itself is not deterministic.

We will suppose that there is some systematic way in which two variables \( x \) and \( y \) change together, but this is a relationship with natural variability.

In the (hypothetical) example on the left, we plot the year-end performance evaluation of employees at a company, versus their midyear evaluation.
We see that there is a general “tendency” in the plot, where higher midyear evaluations tend to correspond to higher year-end evaluations.
However, we do not believe that the relationship is causal;

indeed, one would only need to work half of the year to recieve good year-end evaluations if this were the case.

Introducing linear models – continued

Image of probability distribution for year-end evaluation score varying versus midyear evaluation.

Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition

In order to model the observed tendency, we can approach the problem as before:

we can “fit” a line to the data which will be drawn to a particular observation more or less strongly depending on the uncertainty of the data, or the natural variability.

The line thus represents a statistical relationship, which describes the general tendency of the relationship between \( x \) and \( y \) in the data.
We will not suppose that \( x \) causes the value of the response \( y \), but that \( x \) is useful for explaining or predicting a value of \( y \) based on our model, with some level of uncertainty.

Introducing linear models – continued

Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition

Formally, this postulates that:
1. there is a probability distribution for the response variable \( y \) at each value of \( x \);
2. and that the means of these probability distributions vary systematically with the value of \( x \).
The systematic part of this relationship is called the regression function of \( y \) on \( x \), or the regression curve.
Note: generally, we can extend this notion into higher dimensions.
Also, while we will focus on Gaussian distributions, this can be formulated for a variety of other distributions parameterized by regression function.

With two predictors, \( x \) and \( z \), the regression function describes how \( y \) varies systematically in response to two variables \( x \) and \( z \) simultaneously.
In this case, we will postulate that there is a distribution for \( y \) that has a mean which varies systematically in the \( x-z \) plane.
Will call a regression function with one predictor simple regression;
- if the regression function includes more than one predictor, we call this a multiple regression.
We will develop our basic intuition with simple regression models, then extend this to multiple regression.

Simple linear regression

We saw already how we can define a simple regression in terms of a line describing a tendency.
Using the equation for a line in the response \( y \) in terms of the explanatory variable \( x \), we arrive at the general form for simple linear regression:

\[ \begin{align} Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \end{align} \]

In the above, the terms represent:
1. \( Y_i \) – the value of the response in the \( i \)-th trial or observation;
2. \( \beta_0 \) and \( \beta_1 \) – the parameters which define the equation for the line, i.e., define the regression function;
3. \( X_i \) – the value of the predictor in the \( i \)-th trial or observation;
4. \( \epsilon_i \) – the random error or variation in the response about the regression line.

The equation above represents our hypothesis of what the “true” relationship looks like between \( x \) and \( y \), where we assume that there is a linear signal describing \( y \) by \( x \) with variation.

Simple linear regression – continued

Consider the example of the year-end performance \( y \) as regressed on with the midyear performance \( x \), in terms of the regression equation:

\[ \begin{align} Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \end{align} \]
Q: what values are known in the above equation? What values are unknown?
A: \( Y_i \) and \( X_i \) are observed values in our data set and are thus known.
- The “true” relationship, if it exists, is defined in terms of unknown parameters \( \beta_0,\beta_1 \);
- the value of \( \epsilon_i \) is the unknown variation about the true signal, defined with respect to \( \beta_0,\beta_1 \).
In this regard, the first goal of our regression analysis will be to determine \( \beta_0 \) and \( \beta_1 \), assuming that the above relationship is valid.
- However, the ability to choose \( \beta_0 \) and \( \beta_1 \) with any guarantee of a “best” choice will depend on various assumptions.

Regression assumptions – refresher on terms

Let \( X \) be a random variable with a cumulative distribution function,

\[ F_X(x) \triangleq P(X \leq x) \]
We suppose that \( F_X \) defines a probability distribution \( p(x) \). We denote the expectation of \( X \)

\[ \begin{align} \mathbb{E}\left[ X\right] = \int_\mathbb{R} xp(x) \mathrm{d}x \end{align} \]
- Note, \( \mathrm{d}x \) describes the “size” of a group of values that \( X \) might attain;
- the density \( p(x) \) describes the probability of any particular group of events in the integral;
- therefore, the expectation describes the “center of mass” for the probability of \( X \), i.e., the mean of \( X \),
\[ \mathbb{E}\left[X\right] = \mu_X \]

Regression assumptions – refresher on terms

Recall, the variance of a random variable \( \sigma^2 \) is defined by the expected value of the squared-deviation from the mean, i.e.,

\[ \sigma^2_X \triangleq \mathbb{E}\left[ \left( X - \mu_X\right)^2 \right]. \]
- The above is a measure of how much the random variable \( X \) deviates from its mean on average, over all its realizations;
- while the absolute value could also be used in principle to measure the expected absolute deviation of \( X \) from its mean \( \mu_X \), the square has many better properties mathematically for deriving important results.
However, the cost of using the square is that the variance has units given by the units of \( X \) squared;
- therefore, we also typically consider the standard deviation \( \sigma_X \) as a measure of how much variation \( X \) experiences about its mean, but in the same units.
Notice, in the above we are describing the definition of the variance parameter for the random variable.
This differs of course from the definition of the sample variance statistic, which arises from a set of data points.

Regression assumptions – refresher on terms

Let's suppose that \( x_1, x_2, \cdots, x_n \) are actual measurements of the realizations of the random variable \( X \).
- We will define the sample-based mean for the measurements \( \{x_i\}_{i=1}^n \) as,
\[ \begin{align} \overline{x} \triangleq \frac{1}{n}\sum_{i=1}^n x_i. \end{align} \]
Likewise, we can define the sample-based variance as the mean-square-deviation of each measurement from the sample mean,

\[ \begin{align} s^2_X \triangleq \frac{\sum_{i=1}^n \left( x_i - \overline{x}\right)^2}{n-1}; \end{align} \]
Notice the difference in the denominator in this case from the sample-based mean;
- we include the \( n-1 \) in the denominator for a number of theoretical reasons, which we will refresh later.
At the moment, we will summarize by saying that estimating the sample mean introduces a dependency among the measurements;
- this uses a “degree of freedom” that we have available in the data set to compute statistics.

Regression assumptions – refresher on terms

We will also typically define the sample-based standard deviation as,

\[ \begin{align} s_X \triangleq \sqrt{ \frac{ \sum_{i=1}^n \left( x_i - \overline{x}\right)^2 }{n-1}} \end{align} \]
We will recall what an “estimator” is shortly, but it is important to note that the sample-based standard deviation is not an accurate estimator of \( \sigma \) the parameter standard deviation.
This actually differs from the sample-based mean and the sample-based variance, as these do actually accurately estimate the corresponding parameter.
- This can be formally shown with arguments like Jensen's inequality, as this arises due to the effect of taking the square root;
- the implication is that the standard deviation computed from any sample will systematically underestimate the standard deviation of the actual population, when we cannot sample the entire population.
However, because the ways to fix this are too complex for what they are worth, we often just use the biased estimate for convenience.

Regression assumptions – refresher on terms

If we have two random variables \( X_1,X_2 \) with means \( \mu_{X_1}, \mu_{X_2} \) respectively, we denote their covariance as,

\[ \sigma_{12}^2 = \mathrm{cov}(X_1, X_2) \triangleq \mathbb{E}\left[\left(X_1 - \mu_{X_1}\right)\left(X_2 - \mu_{X_2}\right)\right] \]
The correlation of the two variables \( X_1,X_2 \) is thus defined as,

\[ \rho_{12} = \mathrm{cor}\left(X_1,X_2\right)\triangleq \frac{\sigma_{12}^2}{\sigma_1 \sigma_2} \]

where \( \sigma_1 \) and \( \sigma_2 \) is the standard deviation of \( X_1 \) and \( X_2 \) respectively.
- The covariance thus measures how much the two variables \( X_1 \) and \( X_2 \) co-vary together or oppositely away from their respective means.
- Correlation is a measure of how much the variables co-vary, where the range is “normalized” to \( [-1,1] \).
Like the sample-based variance statistic, we can also define the sample-based covariance;
- let us suppose that \( x_1, \cdots, x_n \) and \( z_1, \cdots, z_n \) are sample measurements of the two random variables \( X \) and \( Z \), then
\[ \begin{align} s^2_{XZ} \triangleq \frac{\sum_{i=1}^n \left( x_i - \overline{x}\right) \left( z_i - \overline{z}\right) }{n-1}. \end{align} \]
- Similarly, we define the sample-based correlation coefficient between \( X \) and \( Z \) as \( r = \frac{s^2_{XZ}}{s_{X}s_{Z}} \)

Regression Assumptions – continued

Back to our regression equation, \[ \begin{align} Y_i = \beta_0 + \beta_1 X_i + \epsilon_i. \end{align} \]
We will make some assumptions about the variation \( \epsilon_i \) that appears in the signal:

we suppose that the variation is of mean zero, \( \mathbb{E}\left[\epsilon_i\right] = 0 \).

If such a linear relationship does exist, we can always take this assumption by adjusting the definition of \( \beta_0 \) accordingly.

we suppose (for now) that the variation is constant about every case, i.e., \[ \mathbb{E}\left[\epsilon_i^2\right] = \mathbb{E}\left[\epsilon_j^2\right] = \sigma^2 \] for every \( i,j \).

Later we will look at ways to relax this assumption, leading to weighted least squares.

we suppose (for now) that every pair of distinct cases \( i\neq j \) are uncorrelated, \[ \mathrm{cov}\left(\epsilon_i \epsilon_j\right) = 0 \]

If \( \epsilon_i \) is uncorrelated with \( \epsilon_j \), this just means that we do not gain information on the variation in the observation of \( Y_i \) from the variation in \( Y_j \).
Later we will look at ways to relax this as well, leading to generalized least squares.

Notice that we did not assume what kind of distribution \( Y_i \) takes – we only made assumptions about its first two moments.

Regression Assumptions – linearity

We note here an important distinction when we refer to the below model as “linear”,

\[ \begin{align} Y_i = \beta_0 + \beta_1 X_i + \epsilon_i. \end{align} \]

In describing, “linearity” we refer to the fact that the parameters enter into the equation linearly.
There will be many cases in which, however, we will want to change the scale of the predictor variable \( X \).
Q: can the below be interpreted as a linear model?

\[ \begin{align} Y_i = \beta_0 + \beta_1 \mathrm{log}\left( X_i\right) + \epsilon_i. \end{align} \]

A: yes, in this case, the parameters entered linearly. Conceptually we can always re-write

\[ \begin{align} Z_i &= \log(X_i)\\ Y_i &= \beta_0 + \beta_1 Z_i + \epsilon_i \end{align} \]

to find a model of the form we considered before in terms of the explanatory variable \( Z \).

A hypothetical example

Flow chart from start of the regression analysis, to exploratory data analysis, to developing one or more possible models.

Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition

Suppose we are studying the relationship between the number of bids requested by construction contractors for lighting equipment during a week and the time required to prepare the bids at a factory.
Suppose that regression model discussed above is applicable and is given precisely as: \[ Y_i = 9.5 + 2.1 X_i + \epsilon_i \] where \( X \) is the number of bids prepared in a week and \( Y \) is the number of hours required to fulfill them.
In the figure to the left, we see the regression function parametrizing the mean of the the distributions for the total hours with respect to the number of bids.

Suppose that in the \( i \)-th week, \( X_i = 45 \) bids are prepared and the actual number of hours required is \( Y_i = 108 \);
then, the error term realization is \( \epsilon_i = 4 \), as \[ \mathbb{E}\left[Y_i\right] = 9.5 + 2.1*(45) = 104 \] so the variation in the signal resulted in an outcome that was 4 hours longer than the estimated mean.
This error term accounts for any number of possible deviations due to, e.g., weather, delivery delays, etc… that have been quantified with variance \( \sigma^2 \).

Constructing a regression model

Because we have assumed a form for \( Y_i \) as a linear model, \[ Y_i = \beta_0 + \beta_1 X_i +\epsilon_i \] we can develop theory for if/ when/ how we can identify this form accurately from the data.

This reduces an infinite dimensional function space to a finite problem of estimating \( \beta_0,\beta_1 \) as well as possible.
Even if the relationship is more complicated, a linear relationship might be a “good enough” approximation to capture the essential relationship.

Furthermore, using the regression function, we can describe the distribution of future observations and predict future values.
We can also determine “how well” the observed values of \( X_i \) describe the observed values of the response \( Y_i \) for explanations, possibly determining causality.
However, the above depends on if and when the model assumptions hold.

There are many subtleties in selecting a “good” model in regression analysis, and determining “how well” our assumptions are satisfied.
Even when it appears that we have a “good” model, we always need to weight it agains competing models that may be equally “good”.

Constructing a regression model – continued

In reality, the first step is usually selecting the appropriate variable(s) for the independent, explanatory variable(s).
We typically have some response variable \( Y \) we would like to study, but there can exist infinitely many variables to choose for \( X \).
For example, we may wish to study the incomes of graduates from US colleges and universities five years after graduation;
in practice we need to choose variables for our regression analysis among the many possible variables, including but not limited to:

major of study;
grade point average;
demographic information, such as age, gender, ethnic background, marriage and immigration status;
previous years of work experience;
years of study;
income bracket of parents;
etc…

Constructing a regression model – continued

Selecting any of the previous variables should factor the extent to which a chosen variable contributes to reducing the remaining variation in the response.
- We generally favor a regression function in one variable over another if the observed responses \( Y_i \) adhere more tightly to the regression function of one over the other.
However, this comparision should consider the contributions of other predictor variables, including possible variables we have not already included in our model.
- Likewise, if we study a model with multiple predictors, we should consider if two or more of the variables are strongly correlated and if they contribute unique information to relationship or not.

Constructing a regression model – continued

Other considerations include the importance of the variable as a causal agent in the process under analysis;
- if the use of the model is to determine causality, we should construct several different models including and excluding respectively different possible causual variables.
We may also consider the degree to which observations on the variable can be obtained more accurately, or quickly, or economically than on competing variables;
Ideally, we would also like variables that can be controlled as a “treatment” but this is not the case for all variables.

Constructing a regression model – continued

Once we have selected the variables, it is also not automatically clear the functional form of the regression.
For example, it is usually not known a priori:
- what scale we should use for the response and the predictors (linear/ log/ square-root/ etc…); or
- if we should combine one or more variables into a single variable (e.g., total years of work and education instead of work and education individually).
Likewise, we need to specify the scope of the model in terms of what value we will regress upon;
- this is often defined in terms of the range of values we have available for the predictors.
- It is important to recognize this scope, for the danger of extrapolating the statistical relationship beyond where it is well understood.
In the case of the college graduate incomes, it would be doubtful that we could predict incomes 10 years after graduation with our model if our data is limited to five years.

Constructing a regression model – an iterative process

Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition

Regression analysis is thus an iterative process, which typically follows the flow on the left.
We must clean the data and perform exploratory analysis to determine relationships between the response, possible predictors, and the between the predictors themselves.
This exploratory analysis helps us to develop one or more possible regression models for the study.

Constructing a regression model – an iterative process

Flow chart from developing one or more possible models, iterating if the models aren't appropriate for the data at hand, then identifying the most suitable model and making inferences.

Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition

From the point of the last diagram, we can take our tenative models and evaluate if they are suitable for the problem and data.
If not, we revise and repeat.
If so, we compare the competing models for which one is most robust in our analysis — particularly, if the model assumptions are well satisfied.
If we are secure about the robustness of the model, having tested it in various ways, we can make inferences based on the model, such as predictions of future observations.

Note, even when we feel quite confident in the model, it is key to qualify our inferences by the uncertainty of the model, e.g., providing confidence intervals.

The first part of this course will be focused on the situation in which assumptions are well satisfied and there exists a good linear signal in the data.
The issues involved in the development of an appropriate regression model can be more easily explained with the understanding of what is a “good” regression model and how it is used.