08/31/2020
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
The following topics will be covered in this lecture:
Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition
Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition
Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition
We saw already how we can define a simple regression in terms of a line describing a tendency.
Using the equation for a line in the response \( y \) in terms of the explanatory variable \( x \), we arrive at the general form for simple linear regression:
\[ \begin{align} Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \end{align} \]
Consider the example of the year-end performance \( y \) as regressed on with the midyear performance \( x \), in terms of the regression equation:
\[ \begin{align} Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \end{align} \]
Q: what values are known in the above equation? What values are unknown?
A: \( Y_i \) and \( X_i \) are observed values in our data set and are thus known.
In this regard, the first goal of our regression analysis will be to determine \( \beta_0 \) and \( \beta_1 \), assuming that the above relationship is valid.
Let \( X \) be a random variable with a cumulative distribution function,
\[ F_X(x) \triangleq P(X \leq x) \]
We suppose that \( F_X \) defines a probability distribution \( p(x) \). We denote the expectation of \( X \)
\[ \begin{align} \mathbb{E}\left[ X\right] = \int_\mathbb{R} xp(x) \mathrm{d}x \end{align} \]
\[ \mathbb{E}\left[X\right] = \mu_X \]
Recall, the variance of a random variable \( \sigma^2 \) is defined by the expected value of the squared-deviation from the mean, i.e.,
\[ \sigma^2_X \triangleq \mathbb{E}\left[ \left( X - \mu_X\right)^2 \right]. \]
However, the cost of using the square is that the variance has units given by the units of \( X \) squared;
Notice, in the above we are describing the definition of the variance parameter for the random variable.
This differs of course from the definition of the sample variance statistic, which arises from a set of data points.
Let's suppose that \( x_1, x_2, \cdots, x_n \) are actual measurements of the realizations of the random variable \( X \).
\[ \begin{align} \overline{x} \triangleq \frac{1}{n}\sum_{i=1}^n x_i. \end{align} \]
Likewise, we can define the sample-based variance as the mean-square-deviation of each measurement from the sample mean,
\[ \begin{align} s^2_X \triangleq \frac{\sum_{i=1}^n \left( x_i - \overline{x}\right)^2}{n-1}; \end{align} \]
Notice the difference in the denominator in this case from the sample-based mean;
At the moment, we will summarize by saying that estimating the sample mean introduces a dependency among the measurements;
We will also typically define the sample-based standard deviation as,
\[ \begin{align} s_X \triangleq \sqrt{ \frac{ \sum_{i=1}^n \left( x_i - \overline{x}\right)^2 }{n-1}} \end{align} \]
We will recall what an “estimator” is shortly, but it is important to note that the sample-based standard deviation is not an accurate estimator of \( \sigma \) the parameter standard deviation.
This actually differs from the sample-based mean and the sample-based variance, as these do actually accurately estimate the corresponding parameter.
However, because the ways to fix this are too complex for what they are worth, we often just use the biased estimate for convenience.
If we have two random variables \( X_1,X_2 \) with means \( \mu_{X_1}, \mu_{X_2} \) respectively, we denote their covariance as,
\[ \sigma_{12}^2 = \mathrm{cov}(X_1, X_2) \triangleq \mathbb{E}\left[\left(X_1 - \mu_{X_1}\right)\left(X_2 - \mu_{X_2}\right)\right] \]
The correlation of the two variables \( X_1,X_2 \) is thus defined as,
\[ \rho_{12} = \mathrm{cor}\left(X_1,X_2\right)\triangleq \frac{\sigma_{12}^2}{\sigma_1 \sigma_2} \]
where \( \sigma_1 \) and \( \sigma_2 \) is the standard deviation of \( X_1 \) and \( X_2 \) respectively.
Like the sample-based variance statistic, we can also define the sample-based covariance;
\[ \begin{align} s^2_{XZ} \triangleq \frac{\sum_{i=1}^n \left( x_i - \overline{x}\right) \left( z_i - \overline{z}\right) }{n-1}. \end{align} \]
We note here an important distinction when we refer to the below model as “linear”,
\[ \begin{align} Y_i = \beta_0 + \beta_1 X_i + \epsilon_i. \end{align} \]
In describing, “linearity” we refer to the fact that the parameters enter into the equation linearly.
There will be many cases in which, however, we will want to change the scale of the predictor variable \( X \).
Q: can the below be interpreted as a linear model?
\[ \begin{align} Y_i = \beta_0 + \beta_1 \mathrm{log}\left( X_i\right) + \epsilon_i. \end{align} \]
A: yes, in this case, the parameters entered linearly. Conceptually we can always re-write
\[ \begin{align} Z_i &= \log(X_i)\\ Y_i &= \beta_0 + \beta_1 Z_i + \epsilon_i \end{align} \]
to find a model of the form we considered before in terms of the explanatory variable \( Z \).
Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition
Selecting any of the previous variables should factor the extent to which a chosen variable contributes to reducing the remaining variation in the response.
However, this comparision should consider the contributions of other predictor variables, including possible variables we have not already included in our model.
Other considerations include the importance of the variable as a causal agent in the process under analysis;
We may also consider the degree to which observations on the variable can be obtained more accurately, or quickly, or economically than on competing variables;
Ideally, we would also like variables that can be controlled as a “treatment” but this is not the case for all variables.
Once we have selected the variables, it is also not automatically clear the functional form of the regression.
For example, it is usually not known a priori:
Likewise, we need to specify the scope of the model in terms of what value we will regress upon;
In the case of the college graduate incomes, it would be doubtful that we could predict incomes 10 years after graduation with our model if our data is limited to five years.
Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition
Courtesy of: Kutner, M. et al. Applied Linear Statistical Models 5th Edition