Descriptive statistics in R

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:

An introduction to descriptive statistics in R
Graphical representations
Statistics of location and dispersion

An introduction to descriptive statistics in R

Let us consider an rv $ X $ following a distribution $ F_θ $ , $ X \sim F_\theta $.
To obtain a sample of $ n $ observations one first constructs $ n $ copies of the rv $ X $, i.e.
- sample rvs $ X_1 , \cdots , X_n \sim F_\theta $ which follow the same distribution $ F_\theta $ as the original variable $ X $.
- By way of analogy, this is how a binomial trial by replicating individual Bernouli trials, where each random variable represents an outcome of an individual Bernouli trial.
All $ X_1 , \cdots , X_n $ are often assumed to be independent, where “iid” refers to independent and identically distributed as above.
When the real-world transpires, and each rv assigns a number to the outcome, we draw an observation $ x_i $ of the rv $ X_i $;
- this results in a random sample $ x_1 , \cdots , x_n $.
Note that these lower-case values $ x_1 , \cdots , x_n $ are not rvs, as the outcome has already transpired, but instead just observations of the outcome.
- particularly, these can be used to estimate the unknown parameters $ \theta $ (for normal distribution these are $ \mu $ and $ \sigma $).

An introduction to descriptive statistics in R

Note, because the observations are of a random process, the outcomes and thus the observations will change if we replicate the trial.
If we compute the sample mean of the random variables, i.e., \[ \begin{align} \overline{X} = \frac{1}{n}\sum_{i=1}^n X_i \end{align} \]

this is also a random variable, as it is simply a combination of random variables.
However, computing this on the observations, after the outcome has transpired

\[ \begin{align} \overline{x} = \frac{1}{n}\sum_{i=1}^n x_i \end{align} \]

is just a measure of the center of the cloud of the observations.

Frequency tables

Suppose we have observations $ \{x_i \}_{i=1}^n $ of a rv $ X $.
- Let $ \{a_j\}_{i=1}^k $, denote all different realizations of $ X $ in the sample.
We will define two types of frequencies as follows:
The absolute frequency of $ a_j $, denoted by $ n(a_j) $, is the number of occurrences of $ a_j $ in the sample $ \{x_i\}_{i=1}^n $.
The relative frequency of $ a_j $, denoted by $ h(a_j) $, is the ratio of the absolute frequencies of $ a_j $ and the sample size $ n $, \[ h(a_j) = n(a_j)/n. \]
- Clearly
\[ \sum_{j=1}^k h(a_j )= \sum_{j=1}^k \frac{n(a_j)}{n} = \frac{n}{n} = 1. \]

Frequency tables

The function table() returns all observed values of the data along with their absolute frequencies.
These can be used further to compute the relative frequencies by dividing by n.
Let us consider the dataset chickwts, a data frame with 71 observations of 2 variables:
- weight, a numeric variable for the weight of the chicken, and
- feed, a factor for the type of feed.
By using

table(chickwts$feed)


   casein horsebean   linseed  meatmeal   soybean sunflower 
       12        10        12        11        14        12

we get one line, stating the possible chicken feed, i.e. each possible observed value, and the absolute frequency of each type in the line below.

Frequency tables

To put this into relative frequency, we can write

table(chickwts$feed) / length(chickwts$feed)


   casein horsebean   linseed  meatmeal   soybean sunflower 
0.1690141 0.1408451 0.1690141 0.1549296 0.1971831 0.1690141

There are several methods of visualising the frequencies graphically.
Depending on which type of data is at our disposal, the adequate approach needs to be chosen carefully:
- Bar Plot, Bar Diagram or Pie Chart for qualitative or discrete variables, and
- the histogram for continuous (or quasi-continuous) variables.
Since the variable feed in dataset chickwts has only 6 distinct categories of feed, the frequencies can be conveniently shown using a Bar Plot, a Bar Diagram or a Pie Chart.

Graphical frequency

Here we make a bar plot for the frequency table.

par(cex.axis = 1.75, mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(table(chickwts$feed), ylab = expression(h(a[j])), xlab = "", cex.lab = 2)

plot of chunk unnamed-chunk-3

Graphical frequency

Here we make a bar plot for the relative frequency table

par(cex.axis = 1.75, mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
barplot(table(chickwts$feed)/length(chickwts$feed), space = 2, ylab = expression(h(a[j])), xlab = "", cex.lab = 2)
abline(h = 0)

plot of chunk unnamed-chunk-4

Graphical frequency

Here we make a pie chart for the relative frequency table

par(cex = 1.75, mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
pie(table(chickwts$feed)/length(chickwts$feed))

plot of chunk unnamed-chunk-5

Empirical (Cumulative) Distribution Function

The empirical cumulative distribution function (ecdf) is denoted by $ \hat{F}(x) $ and describes the relative number of observations which are less than or equal to $ x $ in the sample.
We write $ \hat{F}(x) $ with a hat because it is an estimator of the true cdf \[ F(x) = P(X \leq x). \]
$ \hat{F}(x) $ converges almost surely to $ F(x) $ when $ n $, the sample size, goes to infinity due to the Strong Law of Large Numbers.
The ecdf is a non-decreasing step function,
- i.e. it is a function which is constant except for jumps at a discrete set of points.
The points where the jumps occur are the realizations of the rv and thus illustrate a general property of the cumulative distribution function:
- continuous at right and existing limit at left, which holds for all $ \hat{F} $.
This step function is given by ecdf(), which returns a function of class ecdf, which can be plotted using plot().
Take for example the dataset Formaldehyde containing 6 observations on two variables:
1. carbohydrate (car) and
2. optical density (optden).

Empirical (Cumulative) Distribution Function

The ecdf for the subset Formaldehyde$car can be calculated as follows,

d = ecdf(Formaldehyde$car)
summary(d)

Empirical CDF:    6 unique values with summary
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1000  0.3500  0.5500  0.5167  0.6750  0.9000

This allows us to see the empirical values for the CDF, as given by the ordered values in the sample.

Empirical (Cumulative) Distribution Function

We can plot this as

par(cex = 1.75, mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(d, ylab = expression(hat(F)(x)), xlab = "x", main = "")
abline(v = 0.5, lty = 2, lwd = 2)
text(0.71, 0.51, expression(hat(F)(0.5)), cex = 1.2)

plot of chunk unnamed-chunk-7

In the vertical axis, we have the relative frequency value defined by $ \hat{F}(x) $ on the sample, i.e., \[ \begin{align} \hat{F}(x) \approx \frac{\text{number of values less than }x\text{ in the sample}}{\text{total sample size }n} \end{align} \]
Correspondingly, the horizontal axis represents the value $ x $ in the functional argument.

Empirical (Cumulative) Distribution Function

We then can see the behavior of the step function where we have the six different measurements.

par(cex = 1.75, mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(d, ylab = expression(hat(F)(x)), xlab = "x", main = "")
abline(v = 0.5, lty = 2, lwd = 2)
text(0.71, 0.51, expression(hat(F)(0.5)), cex = 1.2)

plot of chunk unnamed-chunk-8

The value that $ \hat{F}(x) $ is constant in between observed values and jumps when we reach a new value $ x_i $ that exists in the sample of observations.

Histograms

The histogram is a common way of visualizing the data frequencies of continuous (real valued) variables.
In a histogram, bars are erected on distinct intervals, which constitute the so-called data classes or bins.
The y-axis represents the relative frequency of the classes, so that the total area of the bars is equal to one.
The width of the bars is given by the chosen intervals, the height of each bar is equal to the empirical probability density of the corresponding interval.
If $ \{K_i\}_{i = 1}^s $ is a set of $ s $ total disjoint classes, the histogram or empirical density function $ \hat{f}(x) $ is defined by

\[ \begin{align} \hat{f}(x)= \frac{\text{relative frequency of observations lying in the class containing}x}{\text{the width of the interval that defines this class}} = \frac{h(K_i)}{\vert K_i\vert} \text{ for }x\in K_i \end{align} \]
In the above, we thus define $ h(K_i) $ to be the frequency of the observations lying in the interval $ K_i $ relative to the sample size $ n $, and
$ \vert K_i\vert $ to be the width of the associated interval.

Histograms

We write $ \hat{f}(x) $ with a “hat” also because it is a sample-based estimator of the true density function $ f(x) $, which describes the relative likelihood of the underlying variable to take on any given value.
$ \hat{f}(x) $ is a consistent estimator of $ f(x) $, since for every value of $ x $, $ \hat{f}(x) $ converges almost surely to $ f(x) $ when $ n $ goes to infinity, also due to the Strong Law of Large Numbers.
We can construct a histogram with the function hist().
By default, without specifying the arguments for hist(), R produces a histogram with the absolute frequencies of the classes on the y-axis.
Thus, to obtain a histogram according to our definition, one needs to set freq = FALSE.
The number of classes $ s $ is calculated by default using Sturges’ formula \[ s = log_2 \left[n + 1\right]. \]
The brackets denote the ceiling function used to round up to the next integer to avoid fractions of classes.
Note that this formula performs poorly for $ n < 30 $.
To specify the intervals manually, one can fill the argument breaks with a vector giving the breakpoints between the histogram cells, or simply the desired number of cells.

Histograms

For example, consider nhtemp, a sample of size $ n = 60 $ containing the mean annual temperature in degrees Fahrenheit in New Haven, Connecticut, from 1912 to 1971.

par(cex = 1.75, mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
hist(nhtemp, freq = F, main = "", ylab = expression(hat(f)(x)), xlab = expression(x %in% K[i]))

plot of chunk unnamed-chunk-9

Histograms

In the following example, breaks = seq(47, 55, 0.5) means that the histogram should range from 47 to 55 with a break every 0.5 step, i.e. \[ K_1 = [47, 47.5), K 2 = [47.5, 48), .... \]

par(cex = 1.75, mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
hist(nhtemp, freq = F, breaks = seq(from = 47, to = 55, by = 0.5), main = "", ylab = expression(hat(f)(x)), xlab = expression(x %in% K[i]))

plot of chunk unnamed-chunk-10

Kernel density estimators

The histogram is a density estimator with a relatively low rate of convergence to the true density.
A simple idea to improve the rate of convergence is to use a function that weights the observations in the vicinity of the point where we want to estimate the density, depending on how far away each such observation is from that point.
Therefore, the kernel estimated density is defined as \[ \begin{align} \hat{f}_h (x) = \frac{1}{n}\sum_{i=1}^n K_h(x - x_i) = \frac{1}{hn}\sum_{i=1}^n K\left(\frac{x-x_i}{h} \right) \end{align} \]
The $ K_h(x−x_i) $ is the kernel, which is a symmetric nonnegative real valued integrable function.
- The expression above can be considered similar to the histogram, but by choosing one of a class of possible weight functions $ K_h $;
- the weight that $ K_h $ gives values is according to the distance of the argument $ x $ from the $ i $-th observation $ x_i $ weighted by the “bandwidth” $ h $.
Furthermore, the kernel should have the following properties:
1. $ \int_{-\infty}^\infty uK(u)\mathrm{d}u = 0 $;
2. $ \int_{-\infty}^\infty k(u) \mathrm{d}u = 1 $.

Kernel density estimators

Courtesy of Härdle, W.K. et al. Basic Elements of Computational Statistics. Springer International Publishing, 2017.

Some popular choices of kernels are pictured to the left.
In each, we can see how different choices given different weight to the argument $ x $ relative to the position of the observation $ x_i $.
For the uniform kernel, we resemble the histogram where all points in an interval are treated with equal weight in that class.
For the triangular, we see that the weight points receive decays linearly away from the position of $ x_i $.
For the Epanechnikov kernel, the weight decays more quadratically.
For the quartic kernel, the weight decays quarticly.

Kernel density estimators

If we instead plot the kernel density estimator for the Newhave temperature data, we get the following:

par(cex = 1.75, mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
y = density(nhtemp, kernel = "epanechnikov") # density estimation for temperature
plot(y, main = "", lwd = 3)

plot of chunk unnamed-chunk-11

The result is to have a quadratic smoothing of the contour, rather than the boxy histogram.
Especially for a continuum variable like temperature, this is a more accurate approach to represent the empirical density.

Statistics of location and dispersion

“Where are the data centered?”
“How are the data scattered around the center?”
“Are the data symmetric or skewed?”
These questions are often raised when it comes to a simple description of sample data.
Location parameters describe the center of a distribution through a numerical value.
They can be quantified in different ways and visualized particularly well by boxplots.

Arithmetic mean

The term arithmetic mean characterizes the average position of the realizations of the observations.
It is a good location measure for data from a symmetric distribution.
The sample (arithmetic) mean for a sample of $ n $ values, $ x_1 , x_2 , \cdots, x_n $ is defined by

\[ \begin{align} \overline{x}_n = \frac{1}{n}\sum_{i=1}^n x_i \end{align} \]
Suppose that $ \{x_i\}_{i=1}^n $ are observations of $ X_i $ all with the same expected value $ \mathbb{E}\left[X_i\right] =\mu $.
The sample means converge almost surely to $ \mu $ (also called the population mean), i.e.

\[ \lim_{n\rightarrow \infty}\overline{x}_n \rightarrow_{as} \mu \]
The arithmetic mean is the value that we compute when we use the mean() function in R.

Quantiles

Another type of location parameter is the quantile.
Quantiles are very robust in that they are not influenced by outliers, since they are determined by the ordered rank of the observations;
- the quantiles are also estimates of the theoretical quantiles $ F^{-1}(p)= \xi_p $ defined earlier.
The p-quantile $ \tilde{x}_p $, for $ 0\leq p \leq 1 $, is a value such that at most $ 100 \times p\% $ of the observations are less than or equal to $ \tilde{x}_p $; and
$ 100 \times (1 − p)\% $ are greater than or equal to $ \tilde{x}_p $.
The number of observations which are less than or equal to $ \tilde{x}_p $ is, then, equal to $ n\times p $.
If it is desired to have the usual location parameters, such as the median, mean and some quantiles at once, one can use the command summary()

summary(nhtemp)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  47.90   50.58   51.20   51.16   51.90   54.60

Mode

The mode is the most frequently occurring observation in a data set.
Together with the mean and median, one can use it as an indicator of the skewness of the data.
In general, the mode is not equal to either the mean or the median, and the difference can be huge if the data are strongly skewed.
The mode is defined by \[ x_{mode} = a_j, \] with $ n(x_{mode} ) \geq n(a_i ), \forall i \in \{1, \cdots, k\} $ where $ n(x) $ is the absolute frequency of $ x $ and $ a_i $ are the observed values in the sample.

Variance

The variance is one of the most widely used measures of dispersion.
However, the variance is sensitive to outliers and is only reasonable for symmetric data.
The sample variance for $ n $ values $ x_1 , x_2 , \cdots , x_n $ is the average of the squared deviations from the sample mean $ \overline{x}_n $:

\[ s^2 = \frac{1}{n-1} \sum_{i=1}^n \left(x_i - \overline{x}_n\right)^2 \]
- Note that we do not use the denominator $ n $ for theoretical reasons, (the degrees of freedom).
Having copies $ X_1 , \cdots , X_n $ of a rv $ X \sim F_{\mu, \sigma^2} $, it can be shown that

\[ \begin{align} \mathbb{E}\left[ S^2\right] = \mathbb{E}\left[ \frac{1}{n-1}\sum_{i=1}^n \left(X_i - \mu\right)^2\right] = \sigma^2 \end{align} \]

where in the above we refer to the random variables, not their observations.

Standard deviation

As the variance is, due to the squares, not on the same scale as the original data, it is useful to introduce a normalised dispersion measure.
We saw that for the rv $ X $, the theoretical standard deviation $ \sigma $ gives an appropriate measure of spread but in the same scale as $ X $.
Given observations $ \{x_i\}_{i=1}^n $ of the random variable $ X $, we denote the sample standard deviation as $ s = \sqrt{s^2} $.
Unlike the sample variance estimator, $ \mathbb{E}\left[S\right] \neq \sigma $ as this turns out to systematically underestimate the true parameter.
Nonetheless, the fixes are too complicated and not useful enough in general so we default to this estimator regardless.
Variance and standard deviation are computed simply in R as

var(nhtemp)

[1] 1.601763

sd(nhtemp)

[1] 1.265608

Interquartile range

The interquartile range (IQR) of a sample is the difference between the upper quartile $ \tilde{x}_{0.75} $ and the lower quartile $ \tilde{x}_{0.25} $, i.e.

\[ \mathrm{IQR} = \tilde{x}_{0.75} - \tilde{x}_{0.25}. \]
It is also called the midspread or middle fifty, since roughly fifty percent of the observations are found within this range.
The IQR is a robust statistic and is therefore preferred to looking at, e.g., the total range of the data.
- If the min and max values are outliers, this range can strongly distort our view of the data.

Box-Plots

The box-plot (or box-whisker plot) is a diagram which describes the distribution of a given data set.
It summarizes the location and dispersion measures discussed previously.
The box-plot gives a quick glimpse of the observations’ range and empirical distribution.

require(lattice)
bwplot(as.vector(nhtemp), xlab="Newhaven Temperature Farenheit")

plot of chunk unnamed-chunk-14

At the center of the plot, we have the median and the IQR.
The whiskers typically extend by 1.5 the length of the IQR and outliers beyond this are plotted individually.
This forms a visual summary similar to the summary() function in R.