# Activity 09/27/2021 ## STAT 445 / 645 -- Section 1001
Instructor: Colin Grudzien
## Instructions: We will work through the following series of activities as a group and hold small group work and discussions in Zoom Breakout Rooms. Follow the instructions in each sub-section when the instructor assigns you to a breakout room. ## Activities: ```{r} require("RColorBrewer") ``` ### Activity 1: The ecdf Compare the convergence of the ecdf to the true cdf of the standard normal distribution as follows. We will use the `seq()` function to generate a plot of the cdf of the standard normal. This should run on the interval $[-10,10]$ and we should set a step size of $0.01$. The plot can be generated then using the cdf function with the argument being the sequence generated over this interval. Then, use a `for` loop to produce the following plots. We will set a random seed and then sample the standard normal with the following sample sizes: ```{r} n = c(10, 20, 50, 100, 1000, 10000) ``` The loop should run over the sample sizes and plot the associated ecdf for the sample generated within the loop. ### Activity 2: Mean and standard deviation from samples We will use a `for` loop in the following to see how fast the sample mean from a set of ```{r} n = seq(3, 1000, by=1) ``` observations converges to the true mean of the standard normal distribution. Likewise, we will see how well the sample standard deviation approaches the true value of one. To get started, we will define these helper variables so that we can pre-allocate storage for the values of the means and the standard deviations. We should fill the values for these vectors by the loop index, as we draw a sample of size `n[i]`. ```{r} set.seed(1) means <- matrix(ncol=length(n)) standard_deviations <- matrix(ncol=length(n)) ``` ### Activity 3: Kernel densities versus histograms We will use the `gala` data from the Faraway package in the following activity. ```{r} require("faraway") ``` We want to examine the differences between the empirical density as estimated by a histogram and a kernel density for low sample sizes. The `gala` data only has 30 observations, and we know that Sturges' formula for generating the bins of the histogram can perform poorly if there are fewer observations than this. Let's compare how the densities of the variables in the gala set differ from the histograms. To get started, let's load some useful variables for the `for` loop later. ```{r} total_numb_variables <- length(gala[1,]) cols <- brewer.pal(n=total_numb_variables, name="Dark2") name_list <- names(gala) ``` We should loop between 1 and the `total_numb_variables` to create a density and a histogram plot in which the color scheme is the same for the same variable. We should label the plots as well with the values in the `name_list`. The density should be chosen as ` kernel = "epanechnikov"`. Then discuss, what are the differences between these two plots. Does the interpretation change when we use the ` kernel = "gaussian"` option?