Instructor: Colin Grudzien

## Instructions: We will work through the following series of activities as a group and hold small group work and discussions in Zoom Breakout Rooms. Follow the instructions in each sub-section when the instructor assigns you to a breakout room. ## Activities: ### Activity 1: refreshing statistical concepts If we have two random variables $X_1,X_2$ with means $\mu_{X_1}, \mu_{X_2}$ respectively, we denote their covariance as, $$\sigma_{12}^2 = \mathbb{E}\left[\left(X_1 - \mu_{X_1}\right)\left(X_2 - \mu_{X_2}\right)\right]$$ The correlation of the two variables $X_1,X_2$ is thus defined as, $$\mathrm{cor}\left(X_1,X_2\right)\triangleq \frac{\sigma_{12}^2}{\sigma_1 \sigma_2}$$ where $\sigma_1$ and $\sigma_2$ is the standard deviation of $X_1$ and $X_2$ respectively. The covariance thus measures how much the two variables $X_1$ and $X_2$ co-vary together or oppositely. Correlation is a measure of how much the variables co-vary, where the range is "normalized" to $[-1,1]$. #### Question 1: How can we use the definition of variance, $$\sigma^2_X = \mathbb{E}\left[ \left( X - \mu_X\right)^2 \right],$$ the definition of covariance, $$\sigma_{12}^2 = \mathbb{E}\left[\left(X_1 - \mu_{X_1}\right)\left(X_2 - \mu_{X_2}\right)\right],$$ and the definition of correlation, $$\mathrm{cor}\left(X_1,X_2\right)\triangleq \frac{\sigma_{12}^2}{\sigma_1 \sigma_2},$$ to show that $X_1$ always has correlation 1 with itself? ##### Answer to question 1: The numerator is equal to the variance by definition, and the standard deviation appears as a squared term in the denominator --- this reduces to one. #### Question 2: Which of the following models are linear in the parameters,

- $Y_i = \beta_0 + \beta_1 X_i^2 + \epsilon_i$;
- $Y_i = \beta_0 + \beta_1 \sqrt{X_i} + \epsilon_i$;
- $Y_i = e^{\beta_0} X_i^{\beta_1} * \epsilon_i$;
- $Y_i = \beta_0 + \beta_1 X_i^{\beta_2} + \epsilon_i$.

- Notice if we define $Z_i = X^2_i$, the change of variables renders the first equation equivalent to its usual form, $$\begin{align} Y_i = \beta_0 + \beta_1 Z_i + \epsilon_i; \end{align}$$
- the same kind of trick is applicable if we define $Z_i = \sqrt{X_i}$.
- In this case, suppose we take a log transform of the entire equation, i.e., $$\begin{align} &\log\left(Y_i\right) = \log\left(e^{ \beta_0} X_i^{\beta_1} * \epsilon_i \right)\\ \Leftrightarrow &\log(Y_i) = \beta_0 + \beta_1 \log(X_i) + \log(\epsilon_i) \end{align}$$ Provided that the log transform of the response, the predictor and the variation $\epsilon_i$ makes sense (they all have positive ranges) we can thus transform the variables to find a model linear in the parameters.
- In this case, the equation $$\begin{align} Y_i = \beta_0 + \beta_1 X_i^{\beta_2} + \epsilon_i \end{align}$$ does not have an obvious transform of the variable $Y_i$ that will make the equation linear in the parameters. Log transforms are useful to turn something that exists at a multiplicative scale to an additive one. In the third example we were able to turn the relationship entirely additive by the transformation because it was entirely multiplicative before. In the fourth example, this has mixed terms that complicate the relationship into something fundamentally nonlinear in the parameters.

- we suppose that the variation is of mean zero, $\mathbb{E}\left[\epsilon_i\right] = 0$.
- we suppose that the variation is constant about every case, i.e., $$ \mathbb{E}\left[\epsilon_i^2\right] = \mathbb{E}\left[\epsilon_j^2\right] = \sigma^2 $$ for every $i,j$.
- we suppose that every pair of distinct cases $i\neq j$ are uncorrelated, $$ \mathrm{cov}\left(\epsilon_i \epsilon_j\right) = 0 $$