# Activity 10/19/2020 ## STAT 757 -- Section 1001
Instructor: Colin Grudzien
## Instructions: We will work through the following series of activities as a group and hold small group work and discussions in Zoom Breakout Rooms. Follow the instructions in each sub-section when the instructor assigns you to a breakout room. ## Activities: ```{r} require("faraway") require("EnvStats") require("stats") ``` ### Activity 1: Testing the Gaussian error assumption Note, that when performing exploratory analysis, histograms should not be used to check the Gaussianity of various distributions. One of the central weaknesses about histograms is that they do not have a unique choice of bin-sizes, and generally the convergence of a histogram to the true density function $f$ of some distribution is very slow in the sample size $n$. Q-Q plots are more robust than most other visualization for testing Gaussianity and these are preferred when testing versus this null hypothesis. However, for exploratory purposes, we can improve the convergence of histograms by using a kernel density estimator. This type of plot has the syntax ```{r eval=FALSE} plot(density(some_data, kernel= "some_kernel")) ``` where different kernels give different kinds of weights to the nearby points to observations, unlike histograms which give even weight within specified bins. #### Question 1: Using the `gala` data set, perform exploratory analysis on each of the explanatory variables and the response variable in the model: use the histogram, density plot and Q-Q plot to observe the empirical distribtuion of the sample. Comment on the differences you notice between the histogram of the various variables and the density plots when using the "gaussian" kernel. Do the variable appear to Gaussian distributed? Verify your intuition numerically with the Shapiro-Wilk test. ##### Solution: In the below, we simply loop over the variables in the `gala` data and plot the Gaussian kernel density estimator, the histogram and print the results from the Shapiro-Wilk test. We notice that all variables in this data set evidence non-normal distributions, most being strongly right skewed and all non-negative. We should note that the smoothing of the Gaussian kernel approximates non-zero probability for negative values that should not be counted, but this can be adjusted with the `from=0` keyword argument below. Otherwise, the smoothed version of the density curve appears to give much better detail than the histogram in a small sample size $n=30$. ```{r} summary(gala) total_numb_variables <- length(gala[1,]) name_list <- names(gala) for (i in 1:total_numb_variables) { dens <- density(gala[,i], kernel = "gaussian", from=0) plot(dens, main=paste("Gaussian kernel density of", name_list[i])) hist(gala[,i], freq=FALSE, main=paste("Relative frequency of", name_list[i]), xlab=name_list[i]) qqnorm(scale(gala[,i]),ylab=name_list[i], main="Normal Q-Q plot") qqline(scale(gala[,i])) print(shapiro.test(gala[,i])) } ``` #### Question 2: Now, perform the same exploratory analysis on the residuals of the model ```{r} lmod <- lm(Species ~ Area + Nearest + Scruz + Adjacent, gala) ``` Does it appear that the Gaussian error distribution is a viable assumption? Try systematically reducing the model above into one for which all of the predictors are significant. Does this model appear to satisfy the Gaussian error assumption? Can we get by with a Gaussian approximation by the central limit theorem? ##### Solution: We now perform the analysis of the residuals ```{r} dens <- density(lmod$residuals, kernel = "gaussian", from=0) plot(dens, main=paste("Gaussian kernel density of staturated model residuals")) hist(lmod$residuals, freq=FALSE, main="Relative frequency of saturated model residuals", xlab="Residual value") qqnorm(scale(lmod$residuals),ylab="Saturated model residuals", main="Normal Q-Q plot") qqline(scale(lmod$residuals)) print(shapiro.test(lmod$residuals)) ``` As evidenced by each of the measures, the residuals have significant departures from the Gaussian error assumption. We will attempt to re-fit the model as below: ```{r} summary(lmod) lmod_reduced <- update(lmod, . ~ . - Adjacent) summary(lmod_reduced) lmod_reduced <- update(lmod_reduced, . ~ . - Nearest) summary(lmod) lmod_reduced <- update(lmod_reduced, . ~ . - Scruz) summary(lmod_reduced) anova(lmod_reduced,lmod) ``` We fail to reject the null hypothesis of the simpler model, and we now perform the analysis of the residuals ```{r} dens <- density(lmod$residuals, kernel = "gaussian", from=0) plot(dens, main=paste("Gaussian kernel density of reduced model residuals")) hist(lmod$residuals, freq=FALSE, main="Relative frequency of reduced model residuals", xlab="Residual value") qqnorm(scale(lmod$residuals),ylab="Reduced model residuals", main="Normal Q-Q plot") qqline(scale(lmod$residuals)) print(shapiro.test(lmod$residuals)) ``` Neither the saturated nor the reduced model appear to satisfy the Gaussian error assumption, and for both models the residuals have strong departures from normality. Assuming Gaussian error distributions for the estimate of the mean value with a sample size of $n=30$, though a common rule of thumb for applying the central limit theorem, is blatantly incorrect in this case. This does not affect the fact that the least squares solution above is still the BLUE. However, our current hypothesis testing methods are not appropriate in this case and therefore the analysis of finding a reduced model above is inherently flawed, as well as any inferences that we might want to make with these models. Different techniques (including transformations of variables) should be tried instead. ### Activity 2: Testing the distance between two distributions #### Question 1: In the following we will use the data `globwarm` containing the average Northern Hemisphere temperature from 1856-2000 and eight climate proxies from 1000-2000AD. First we must remove the missing values before there are reliable temperature records. Then let's subset the data according to the following: 1. one subset contains the data from 1856 - 1940 2. one subset contains the data from 1941 - 2000 The $\chi^2$ test for variance can be applied to data assuming that it is sufficiently normal to find confidence intervals for the variance. Similarly, the F-test for variance can be used to compare the variances of two data subsets for equality, provided that the underlying original data is sufficiently normal. These two test in R are as, `varTest`for a single sample, from the `EnvStats` library and `var.test` for two samples in `stats` library. We would like to examine if we can study the variances individually tested, or using a two-sample test, seem to agree between the two data subsets. Use a Q-Q plot and the Shapiro-Wilk test and visual inspection to determine if it is appropriate to test the variance of each of these data subsets. ##### Solution: ```{r} temps_1856_1940 <- na.exclude(globwarm[globwarm$year <=1940,]) temps_1941_2000 <- na.exclude(globwarm[globwarm$year >1940,]) qqnorm(scale(temps_1856_1940),ylab="Nothern Hemisphere mean temp 1856-1940", main="Normal Q-Q plot") qqline(scale(temps_1856_1940)) shapiro.test(temps_1856_1940$nhtemp) plot(density(temps_1856_1940$nhtemp, kernel= "gaussian"), main=paste("Nothern Hemisphere mean temp 1856-1940")) qqnorm(scale(temps_1941_2000),ylab="Nothern Hemisphere mean temp 1941-2000", main="Normal Q-Q plot") qqline(scale(temps_1941_2000)) shapiro.test(temps_1941_2000$nhtemp) plot(density(temps_1941_2000$nhtemp, kernel= "gaussian"), main=paste("Nothern Hemisphere mean temp 1941-2000")) ``` In this case, we cannot test for the variance on the temperatures from 1941-2000 as there is a significant presence of non-normality in the data. Visual inspection and these tests suggest that the distributions may be different over these different time-periods. ##### Question 2: Visual inspection suggest that the mean temperature in the second half of the data is higher than in the first half, and that the distribution tends to be non-normally right skewed. To produce tests of the variances, we would like to have the two samples drawn from normal distributions. This is not a good test in this case. The Kolmogorov–Smirnov test measures the distance between two possible, arbitrary distributions. If the distance is large enough between the empirical CDFs of the two samples, the distributions are regarded as different. In R, the package stats implements the Kolmogorov–Smirnov test. The Kolmogorov–Smirnov test determines whether two distributions $F$ and $G$ are significantly different: $$\begin{align} H_0: F = G & & H1 : F\neq G. \end{align}$$ The underlying idea of this test is to measure the so-called Kolmogorov–Smirnov distance between two distributions, under the restriction that both functions are continuous. The measure for the distance is the supremum of the absolute difference between the two CDFs. Use the Kolmogorov–Smirnov test to determine if the two distributions of mean annual temperature appear to be significantly different at the $5\%$ significance level. ##### Solution: ```{r} ks.test(temps_1856_1940$nhtemp, temps_1941_2000$nhtem) ``` Here the mean annual temperatures are verified to be significantly different, as was already evidenced by the skewness and the location differences in the densities.