# Activity 12/01/2020 ## STAT 645 -- Section 1001
Instructor: Colin Grudzien
## Instructions: We will work through the following series of activities as a group and hold small group work and discussions in Zoom Breakout Rooms. Follow the instructions in each sub-section when the instructor assigns you to a breakout room. {r} require("numDeriv")  ## Activity 1: Gaussian maximum likelihood estimation We will use the BFGS approximation to Newton's descent algorithm to estimate the parameters of the Gaussian distribution from known data. A likelihood function is described as follows. Suppose we are given a generic probability density function $f_\boldsymbol{\xi}(x)$ that depends on some parameter values. For the one dimensional Gaussian / normal distribution, this density is given as, \begin{align} f_\xi(x, \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x - \mu}{\sigma} \right)^2} \end{align} where 1. $x$ is some possible value that the normally distributed random variable $\xi$ might attain; 2. $\mu$ is the mean for the distribution $N(\mu, \sigma^2)$ that $\xi$ is drawn from; 3. $\sigma$ is the standard deviation for the distribution $N(\mu, \sigma^2)$ that $\xi$ is drawn from. For a collection of observations $\{x_i\}_{i=1}^n$ that are independent and identically distributed according to the parent distribution $N(\mu, \sigma^2)$, we can formulate the *likelihood* of observing these values for *unknown* $\mu$ and $\sigma$ as follows. For a single observation $x_i$, the likelihood of observing this value is a function of unknown $\mu$ and $\sigma$ values as free parameters, \begin{align} \mathcal{L}_{x_i}(\mu, \sigma) = f_\xi(x_i) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x_i - \mu}{\sigma} \right)^2} \end{align} This says that in the likelihood function $\mathcal{L}_{x_i}(\mu,\sigma)$ the value of $x_i$, the known observation, is held fixed while we allow $\mu$ and $\sigma$ to vary. We then obtain a conditional density for the unknown parameter values $\mu$ and $\sigma$ based on the density value for $\xi$ evaluated at $x_i$, assuming that $\xi \sim N(\mu, \sigma^2)$. The goal of maximum likelihood estimation given the above formulation is maximize the density for the known $x_i$ for some *choice* of $\mu$ and $\sigma$. This tells us which unknown distribution parameters $\mu$ and $\sigma$ are most likely to describe the unknown parent distribution for $\xi$, from which the observation $x_i$ is taken. When we have multiple observations, and they are independent and identically distributed, the joint likelihood function is defined by the product of the individual likelihoods, i.e., if $$\mathbf{x} =\begin{pmatrix} x_1 \\ \vdots \\ x_n\end{pmatrix}$$ \begin{align} \mathcal{L}_{\mathbf{x}}(\mu, \sigma) &= \prod_{i=1}^n f_\xi(x_i) \\ & = \prod_{i=1}^n \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x_i - \mu}{\sigma} \right)^2} \\ &=\left(\frac{1}{\sigma\sqrt{2\pi}}\right)^n e^{-\frac{1}{2}\sum_{i=1}^n\left(\frac{x_i - \mu}{\sigma} \right)^2} \end{align} We will perform a Gaussian maximum likelihood estimation in the following given a known sample, but assuming that we do not know from what distribution these are drawn, other than they are normally distributed for unknown $\mu$ and $\sigma$. ### Question 1: We will construct a likelihood function as our first step. We define the following independent, identically distributed sample of observations $\{x_i\}_{i=1}^{20}$ as follows: {r} set.seed(0) x <- rnorm(20) x  Each of the above entries $i=1,\cdots,20$ is an observation of the normally distributed random variable with $\mu=0$ and $\sigma=1$. We want to define the likelihood function gauss_like so that it takes a vector containing $$\mathrm{params} = \begin{pmatrix} \mu \\ \sigma \end{pmatrix}$$ and a vector of the observed values $$\mathrm{obs}= \begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix}$$ And so that it returns the likelihood of params given the vector of observations obs. Use this to compute the likelihood of the above sample for each of the following parameter combinations, in the columns of the following matrix {r} param_vals <- matrix(c(0, 1, 2, 2, -10, 10, -10, 0.5), nrow=2, ncol=4, byrow=FALSE) param_vals  What do you notice about the likelihood values? Where is the likelihood value the largest? Where is the likelihood value the smallest? How does this relate to the location and the spread of the Gaussian distribution? Repeat the above, but for a new sample of observations defined as, {r} set.seed(0) x <- rnorm(1000)  What do you notice about the values? What do you think the issue is for large sample sizes, $n \gg 1$? ### Question 2: For computational reasons, we often transform a likelihood function into a log-likelihood. This is defined as, \begin{align} L_{\mathbf{x}}(\mu,\sigma) = \log(\mathcal{L}_{\mathbf{x}}(\mu, \sigma) ) \end{align} This has the effect of turning a problem that is multiplicative in nature into one that is additive, providing important stability properties for us. Using the Gaussian likelihood function described above, write out mathematically what the log-likelihood function $L_\mathbf{x}(\mu,\sigma)$ is equal to, with a complete derivation of the steps. ### Question 3: Using the log-likelihood function $L_{\mathbf{x}}(\mu,\sigma)$ derived in question 2, let's use Newton's descent in the form of BFGS to find the maximum likelihood over a search region of possible values for $\mu$ and $\sigma$ simultaneously. Note, $\sigma >0$ by definition, so we need to specify bounds making this a *constrained optimization* problem. The package optimx includes both BFGS and constrained BFGS for this purpose. Recall, a *maximization* problem can be phrased as \begin{align} \text{given }f:&\mathbb{R}^n \rightarrow \mathbb{R} \\ \text{find }\mathbf{x}^\ast &\text{ such that:}\\ &f(\mathbf{x}) \leq f\left(\mathbf{x}^\ast\right) \end{align} for all $\mathbf{x}$ in a neighborhood of $\mathbf{x}^\ast$. Using this fact, a maximization problem can be turned into a minimization problem as follows. If we redefine $\tilde{f}(x) = - f(x)$, then note the minimization problem: \begin{align} \text{given }\tilde{f}:&\mathbb{R}^n \rightarrow \mathbb{R} \\ \text{find }\mathbf{x}^\ast &\text{ such that:}\\ &\tilde{f}(\mathbf{x}) \geq \tilde{f}\left(\mathbf{x}^\ast\right) \end{align} for all $\mathbf{x}$ in a neighborhood of $\mathbf{x}^\ast$, gives the solution to the maximization problem above. Specifically, if \begin{align} &\tilde{f}(\mathbf{x}) \geq \tilde{f}\left(\mathbf{x}^\ast\right) \\ \Leftrightarrow &-f(\mathbf{x}) \geq -f\left(\mathbf{x}^\ast\right) \\ \Leftrightarrow &f(\mathbf{x}) \leq f\left(\mathbf{x}^\ast\right) \\ \end{align} Use this fact above to find the maximum likelihood estimate for the following sample {r} set.seed(0) n <- 20 x <- rnorm(n)  Note, you will need to define your negative Gaussian log-likelihood function with the above code included directly in the function body as the first step in order to supply the sample to the method. Specifically this should be a function like: {r eval=FALSE} neg_gauss_log_like <- function(mu){ # for convenience, we will simply load the sample in the function itself set.seed(0) n <- 20 x <- rnorm(n) mu <- mu # define the negative log-likelihood of this sample in here and return the output below }  You will want to load the optimx operation as follows {r eval=FALSE} require("optimx") fBFGS = optimx(fn = neg_gauss_log_like, # define the objective function par = c(0,1), # define the initial first guess at the minimum method ="L-BFGS-B", # define the method of solution lower = c(-10, 0), # specify a lower bound upper = c(10, 10) # specify an upper bound ) print(fBFGS)