Measures of relative standing

02/13/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Z scores
- Percentiles
- Quartiles
- Box plots

Z scores

We will start to look at measures of relative standing.
- Measures of relative standing are tools to describe the location of observations in a data set with respect to the other data pieces.
The most important measure of relative standing is the z score;
- a z score utilizes our understanding of the spread and concentration of normal data in terms of the standard deviation.
Like the coefficient of variation, we will make this score into a measure on a relative scale so we can compare values from different distributions.
Specifically, suppose we have a sample value \( x \) from a normal data set with sample mean \( \overline{x} \) and sample standard deviation \( s \).
Suppose that the population mean is \( \mu \) and the population standard deviation is \( \sigma \).
The z score of \( x \) is given as

\[ \begin{matrix} \text{Sample z score} = \frac{x - \overline{x}}{s} & & \text{Population z score} = \frac{x - \mu}{\sigma} \end{matrix} \]
This measures how far \( x \) deviates from the mean, relative to the size of standard deviation.
Note, we will typically round the z score to two decimal places.
Z scores also apply to non-normal data, but their interpretation changes slightly as we cannot use the empirical rule in this context.

Interpreting z scores

Significance of measurements by z score.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Let us recall the empirical rule for normally distributed data:

Approximately \( 68\% \) of the sample data will lie within one standard deviation \( \sigma \) of the population mean \( \mu \), i.e., in \[ [\mu - \sigma, \mu + \sigma]. \]
Approximately \( 95\% \) of sample data will lie within two standard deviations \( 2\sigma \) of the population mean \( \mu \), i.e., in \[ [\mu - 2\sigma, \mu + 2\sigma]. \]

Approximately \( 99.7\% \) of sample data will lie within three standard deviations \( 3\sigma \) of the population mean \( \mu \), i.e., in \[ [\mu - 3\sigma, \mu + 3\sigma]. \]

By convention, we will say that an observation is statistically significant low or high in value if there is \( 5\% \) or less chance to observe a value at least as extreme.
Discuss with a neighbor: if an observation from a normal data set has a z score of \( 1 \) is this significant? Why? What is the probability of finding a value at least this extreme?

By the empirical rule, there is a \( 68\% \) chance of finding a value within one standard deviation.
Therefore, \( 100\% - 68\% = 32\% \) of values lie outside of one standard deviation – i.e., they are at least this extreme. This is not significant.

Discuss with a neighbor: if an observation from a normal data set has a z score of \( 2 \) is this significant? Why? What is the probability of finding such a value?

By the empirical rule, there is a \( 95\% \) chance of finding a value within two standard deviations, so \( 100\% - 95\% = 5\% \) of values lie outside of two standard deviations – i.e., they are at least this extreme. This is significant.

Interpreting z scores continued

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Less formally, if the data is not normally distributed we will still use the range rule of thumb as an approximation of the empirical rule.
If the data is not normally distributed, the empirical rule no longer applies, i.e.,
- We are not guaranteed that \( 68\% \) of data will lie within one standard deviation of the mean.
- We are not guaranteed that \( 95\% \) of data will lie within two standard deviations of the mean.
- We are not guaranteed that \( 99.7\% \) of data will lie within three standard deviations of the mean.

However, Chebyshev’s theorem says that at least \( 75\% \) of data lies within two standard deviations of the mean.

The actual amount will often be more than this, as this is a lower bound on any data set.

Therefore, we can still consider a z score of 2 to be interesting for non-normal data, but we must be more careful about our conclusions.

Interpreting z scores example

Discuss with a neighbor: which of the following two values is more extreme from the data set from which it came?

A baby is born with weight \( 4000.0g \), where the sample data includes \( 400 \) babies with sample mean \( \overline{x}=3152.0g \) and sample standard deviation \( s=693.4g \)
An adult is measured with a body temperature of \( 99^\circ F \) out of sample data of \( 106 \) adults with sample mean \( \overline{x}=98.20^\circ F \) and sample standard deviation of \( 0.62^\circ F \).

To compare these two measurements which exist on different scales and units, we compute their z scores as: \[ \begin{matrix} \text{baby z score}=\frac{4000.0g - 3152.0g}{693.4g} = 1.22\text{ std} & & \text{heat z score}=\frac{ 99^\circ F - 98.2^\circ F}{0.62^\circ F} = 1.29\text{ std} \end{matrix} \]
By comparing the z scores, we see that the body temperature measurment is more standard deviations away from its sample mean than the baby’s weight.
Even though the difference in temperature units is small, the relatively small standard deviation in the measurements makes this a more extreme value with respect to its sample data set.
This illustrates the purpose of the z score, in that it makes all measurements comparable on a relative, standardized scale.
We note that the z score is signed with a \( \pm \). One important property of the z score is that it tells whether the value lies above or below the mean.
In the above both measurements lie above the mean value of the samples and for this reason they are positive;

on the other hand, whenever we see a negative z score, we know already that the measurement was below the mean of the samples.

Percentiles

Percentiles – these are measures of location, denoted \( P_1, P_2,\cdots, P_{99} \), which divide a set of data into \( 100 \) groups with about \( 1\% \) of the values in each group.
- An example we know already is the median.
- Indeed, the median is the \( P_{50} \) percentile, which separates the data into groups with \( 50\% \) of the data above and \( 50\% \) of the data below.
There are different ways in which the percentile can be computed, and therefore we will consider one of several possible approaches;
- the important part is to understand how we can convert a data value into a percentile, and
- how to convert a percentile back into a data value.
We will discuss both of these in the following, but note,
- converting back and forth, the results can be inconsistent.
We should be careful therefore about what is the question at hand.

Converting data into percentiles

Suppose we have samples given as \( x_1, \cdots, x_n \) where \( n \) is the total number of samples in the data set.
Suppose the measurements are quantitative, so that we can arrange the samples in order;
- that is, up to re-naming samples, we can write \[ x_i \leq x_{i+1} \] for each \( i = 1,\cdots, n-1 \).
Then, for a particular value \( x \), its percentile can be computed as,

\[ \begin{align}\text{Percentile of }x &= \frac{\text{Number of samples with value less than } x}{\text{Number of total samples}}\times 100 \end{align} \]
- If we can order the sample values as above, we thus look for the index \( i \) for which \[ x_i < x \leq x_{i+1} \]
- That is, we count the number of samples \( i \) with value strictly less than \( x \);
- the next ordered sample \( x_{i+1} \) can have a value that is either greater than or equal to the value \( x \).
- If we choose the index \( i \) as above, the formula becomes \[ \begin{align}\text{Percentile of }x &= \frac{i}{n}\times 100 \end{align} \]

Finding the percentile of some value

Table of sample values ordered low to high.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

In the table to the left, we see an example data set where the samples have been ordered from low to high in the value.
There are \( 4 \) rows and \( 10 \) columns to this table.
The samples are the number of chocolate chips in a batch of 40 cookies.

Discuss with a neighbor: what is the percentile of \( x=23 \)? That is, what is the percent of samples that have value lower than \( 23 \), relative to the total number of samples?

Notice there are \( 10 \) columns and the first row consists of samples with value less than \( 23 \).
That is to say, \[ x_{10} < 23 \leq x_{11}. \]
In this regard, we have, \[ \text{Percentile of }23 = \frac{10}{40}\times 100 = 25. \]
Therefore, we say \( x=23 \) is in the \( 25 \)-th percentile.

Similar to the median, we can say that a cookie with \( 23 \) chips approximately separates the cookies with the lowest \( 25\% \) of chips from those with the highest \( 75\% \) of chips.
Note: we do not say \( P_{25}=23 \), we will show how to find \( P_{25} \) in the following.

Finding the value corresponding to some percentile

Flow chart of converting percentile to data.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Converting from a percentile value to a data value can be more complex.
This is represented by the flow chart to the left, for which we introduce the following notation:

\( n \) will be the total number of samples in the data set.
\( k \) is the percentile of interest.
\( L \) will be the “locator index”, i.e., if we sort the samples low to high the value \( L \) corresponds to the sample indexed \( x_L \).
\( P_k \) will be the \( k \)-th percentile.

Suppose once again we have samples \( x_1, \cdots, x_n \), and that we want to find the value corresponding to \( P_k \).

E.g., suppose we want to find the value that corresponds to the \( 25 \)-th percentile \( P_{25} \), \( k=25 \).

We will assume that, up to re-ordering, \( x_i \leq x_{i+1} \) for every \( i=1,\cdots, n-1 \), i.e., the samples are sorted from low to high.
We then compute \[ L= \left(\frac{k}{100}\right) \times n \] to give us a data location.

However, \( L \) may not make sense as an index, e.g., it could come up as \( 12.25 \) where the samples are only indexed by whole numbers.

Finding the value corresponding to some percentile continued

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

The last slide leaves us at the decision if \( L \) is:

a whole number corresponding exactly to some index; or
some decimal number that does not give an exact index.

Let’s assume case 1 above:

We see that \( L \) corresponds to some sample \( x_L \) for which: \[ x_1 \leq x_2 \leq \cdots \leq x_L \leq x_{L+1}. \]
Then consider the value, \[ x = \frac{x_L + x_{L+1}}{2}, \]
we know that \[ \begin{align} \text{Percentile of }x& = \frac{\text{Number of samples with value less than } x}{\text{Number of total samples}}\times 100 \\ &= \frac{L}{n}\times 100 \\ &= \frac{\left(\frac{k}{100}\right)n}{n} \times 100 \\ &=k \end{align} \]
I.e., \( x = P_k \) by our construction where \( x \) is the mean of \( x_{L} \) and \( x_{L+1} \).

Finding the value corresponding to some percentile – example 1

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

We will consider the earlier example of the chocolate chip cookies.
We will suppose that we wish to find a value corresponding to the \( 25 \)-th percentile, \( P_{25} \) with roughly \( 25\% \) of values lying below this value.
Discuss with a neighbor: what is the value corresponding to \( P_{25} \)?

Notice that \( L = \left(\frac{25}{100}\right)40 = 10 \), so that \( L \) is a whole number.
Therefore we find \( P_{25} = \frac{x_{11} + x_{10}}{2} = \frac{23 + 22}{2} = 22.5 \).
Recall in the last example, we said \( 23 \) is in \( P_{25} \) but \( P_{25} \neq 23 \). We need to be careful about this distinction.
In this case, exactly \( 25\% \) of data lies below \( P_{25}=22.5 \) and \( 75\% \) of data lies above.
However, there are many cases that finding such a value shall be only approximate.

Finding the value corresponding to some percentile continued

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

We are considering the decision if \( L \) is:

a whole number corresponding exactly to some index; or
some decimal number that does not give an exact index.

Let’s assume case 2 above:

In this case, \( L \) doesn’t correspond to any particular index, so we can round the value \( L \) up to a value we will call \( L^\ast \).

For example, if \( L=2.35 \) we will call \( L^\ast =3 \).

\( L^\ast \) is a whole number, so that we can find a sample \( x_{L^\ast} \) in our ordered data.
Notice then, \[ \begin{align} \text{Percentile of }x_{L^\ast} &= \frac{\text{Number of samples with value less than } x_{L^\ast}}{\text{Number of total samples}}\times 100\\ &= \frac{L^\ast - 1}{n}\times 100 \\ &\approx k \end{align} \] where the above is simply an approximation due to the rounding.
Because this cannot be found unambiguously, we denote \( P_k = x_{L^\ast} \).
This is part of what is meant by, “there are different possible ways to compute the percentile” so that we take this by convention.

Finding the value corresponding to some percentile – example 2

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

We will consider the earlier example of the chocolate chip cookies.
We will suppose that we wish to find a value corresponding to the \( 18 \)-th percentile, \( P_{18} \) with roughly \( 18\% \) of values lying below this value.
Discuss with a neighbor: what is the value corresponding to \( P_{18} \)? Is this approximate?

Notice that, \[ L =\left(\frac{k}{100}\times n \right) = \left(\frac{18}{100}\right)\times 40 = 7.2, \] which is not a whole number.
Therefore, we round \( L=7.2 \) up to \( L^\ast= 8.0 \) which is the next whole number.
The sample corresponding to \( L^\ast=8 \) is \( x_{8}=22 \).
Thus, by convention, we say \( P_{18}=22 \).

Finding the value corresponding to some percentile – example 2

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Note: \( \frac{7}{40}\times 100 = 17.5 \), so that this is only an approximation – indeed \( 17.5\% \) of the data actually lies below \( x_{8} \).
Moreover, \( 6 \) of \( 40 \) values are below the value \( 22 \), so that \[ \text{Percentile of }22 = \frac{6}{40}\times 100 = 15 \] and actually \( 15\% \) of data lies below the value \( 22 \).
This shows why these calculations and approximations can be inconsistent.
It is important thus to understand when we are talking about:

The percentile of some value \( x \); or
The value \( P_k \) associated to some percentile \( k \).

Complications arise especially when there are samples with repeated measurement values.

Quartiles

There are special values of percentiles that are used most often in practice.
Typically, we will be concerned with the quartiles of the data – these are defined as follows:

\( Q_1 \) – the first quartile is equal to \( P_{25} \). This separates the data such that approximately \( 25\% \) of samples lie below this value.
\( Q_2 \) – the second quartile is equal to \( P_{50} \) or the median. This separates the data such that approximately \( 50\% \) of samples lie below this value.
\( Q_3 \) – the third quartile is equal to \( P_{75} \). This separates the data such that approximately \( 75\% \) of samples lie below this value.

We note that the quartiles thus separate the data into \( 25\% \) chunks – between each quartile lies approximately one quarter of the data.
For the same reasons discussed earlier, different ways of computing the percentiles (and the approximations in this) can lead to different values for the quartiles.
Small differences can be found using different software on the same data, depending on how and what rules are used.

Quartiles continued

We will define some additional statistics that describe the center and the variation in the data using quartiles:

Interqartile Range (IQR) – this is defined as \( Q_3 - Q_1 \) and describes the scale at which the data operates.

Specifically, this measures the width of the inner \( 50\% \) of the data.
Discuss with a neighbor: does the IQR seem resistant to outliers? Why?

Consider a sample data set \( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1000 \).
The range is \( 1000 - 0 = 1000 \) because it is strongly affected by the outlier \( 1000 \).
There are \( 12 \) samples such that, \( L = \left(\frac{25}{100}\right) 12 = 3 \), such that \( Q_1 = P_{25} = \frac{x_4 + x_3}{2} = \frac{3 + 2}{2} = 2.5 \).
Likewise, \( L =\left(\frac{75}{100}\right) 12 = 9 \) such that \( Q_3 = P_{75} = \frac{x_10 + x_9}{2} = \frac{9 + 8}{2} = 8.5 \). .
Therefore, the IQR is given as \( 8.5 - 2.5 = 6 \), which gives a better description of the concentration of the data in the presence of the outlier.
We consider the IQR to be a resistant statistic to outliers.

Semi-interquartile range – this is defined as \( \frac{Q_3 - Q_1}{2} \) or half the width of the inner \( 50\% \) of data.
Midquartile – this is defined as \( \frac{Q_3 + Q_1}{2} \) or the mean / midpoint of the inner \( 50\% \) of the data.

Note: because the statistic \( Q_3- Q_1 \) is resistant to outliers, both of the Semi-interquartile range and Midquartile are also resistant.

10-90 Percentile Range – this is defined as \( P_{90}-P_{10} \) or the width of the inner \( 80\% \) of the data.

This is also a fairly resistant statistic, but it can be affected by clusters of extremely large or small values.

5 Number summary

Because of the usefulness of quartiles in summarizing data as resistant statistics, they are part of a standard data summary.
It is standard to calculate the following five number summary when getting a “feel” for a data set:

Minimum value – this lets us know one endpoint of the range of data, but is sensitive to outliers.
\( Q_1 \) – the first quartile, resistant to outliers.
\( Q_2 \) – the median, resistant to outliers.
\( Q_3 \) – the third quartile, resistant to outliers.
Maximum value – this lets us know the other endpoint of the range of data, but is sensitive to outliers.

Together, these statistics give us a way to summarize the data in terms of the relative spread of the data in terms of min / max \( Q_3 / Q_1 \) and the relative center of the data in terms of the median.
While the numbers themselves can be useful, we often want to make a graphical representation of these statistics.

Box plots

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

From the earlier example, we recall \( Q_1 = 22.5 \).
We find \( \left(\frac{50}{100} \right) 40 = 20 \) such that \[ P_{50} = Q_2 = \frac{x_{21} + x_{20}}{2} = \frac{24 + 24 }{2} = 24. \]
And, \( \left(\frac{75}{100}\right)40 = 30 \) such that \[ P_{75}=Q_3 = \frac{x_{31} + x_{30}}{2}=\frac{26+26}{2} =26. \]
The min / max values are given by \( 19 \) and \( 30 \) respectively.
This information completes the five number summary.
We can represent the five number summary graphically as a box plot.

Box plots – these are visual representations of the five number summary.
These are constructed as follows:

Find the min / max and \( Q_1,Q_2,Q_3 \).
Construct a number-line range that extends at least to the min and max as endpoints.
Draw a box with edges at \( Q_1 \) and \( Q_3 \).
Draw lines at the min / max and at \( Q_2 \).

The box plot thus gives a picture of how data is centered and how much variation is around the center.

Box plots continued

Example of normal data box plot and skewed data box plot.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Particularly, we can quickly identify differences in the “shape” or distribution of the data.
On the top box plot, we see normally distributed data of a simple random sample of the heights of adult women.
With normaly distributed data we see some of our typical characterisitcs:
- The data is strongly peaked, with relatively few values initially, then many concentrated values, then relatively few values.

The data is strongly symmetric, with the values of \( Q_1 \) and \( Q_3 \) and the min / max almost symmetric around \( Q_2 \).

On the bottom box plot we see strongly skewed data.
This box plot represents the salaries of NCAA football coaches in thousands of dollars.
Discuss with a neighbor: can you identify whether this data is left or right skewed from the box plot?

Recall, we call data right skewed when the tail points towards the right.
On the bottom, we see that the median \( Q_2 \) is very close to the min so that \( 50\% \) of data is very small.
\( Q_3 \) lies asymmetrically away from the median, with the max far away from this.
This indicates right skewness, due to the long tail of data towards the right.

Note: this kind of plot is missing other data that can be found in a histogram or a frequency distribution, and these can be used to give more details on a specific data set.
However, box plots are very good at showing differences between the shape of two data sets.

Outliers and modified box plots

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

So far, we have been intentionally vague about the question “what is an outlier?”

We have only loosely described these as very extreme values relative to the other measurements.

Outliers have substantial effects on statistical analyses and for this reason it is important to identify them and quantify their presence in an objective way.

We can formally quantify the presence of outliers using modified box plots

Modified box plots – these are constructed as follows:

Compute \( Q_1,Q_2,Q_3 \) and the IQR equal to \( Q3 - Q_1 \).
We plot box as usual with edges at \( Q_1 \) and \( Q_3 \), and a line through this box at \( Q_2 \).
However, we draw whiskers that extend to the values \( Q_1 - 1.5\times IQR \) and \( Q_3 + 1.5\times IQR \)

Any value that lies outside of the whiskers \( Q_1 - 1.5\times IQR \) and \( Q_3 + 1.5\times IQR \) is thus called an outlier.

In the above, we see a sample of chocolate chip cookies, for which the value of \( 21 \) chips is identified as an outlier.

Modified box plots continued

Diagram of the percent of outcomes contained within each standard deviation of the mean
for a standard normal distribution.

Courtesy of Jhguch CC via Wikimedia Commons.

The standard box plot earlier did not have the ability to show our last criterion of normal data, the presence of few if any outliers.
In the figure to the left, we see a comparision of:

the empirical (68-95-99.7) rule; and
a modified box plot for a normal distribution.

We note, the empirical rule says that an observation that lies outside of \( [\mu - 2 \sigma, \mu + 2 \sigma] \) is significantly high or low.

This means that it is statistically interesting to observe.

For a normal distribution, \( Q_1 - 1.5 \times IQR \) and \( Q_3 + 1.5\times IQR \) extends to approximately \[ [\mu - 2.698 \sigma, \mu+ 2.698\sigma]. \]

The above range contains approximately \( 99.3\% \) of all data, which explains the meaning of few if any ouliers present in normal data.

I.e., in a large data set, about \( 0.7\% \) of observations would be considered outliers.

We note, the intervals for the empirical rule and the box plot aren’t the same as one is measured in terms of standard deviations \( \sigma \) and the other in terms of quartiles \( Q_i \).
The comparision of these intervals is pictured in the figure.