Summarizing and graphing data

01/30/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Characteristics of data
- Frequency distributions
- Histograms
- Other kinds of good and bad plots

Lead and IQ example

Freqency table of IQ scores for sample data set, table 2-2 from textbook.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

We have data in which children who had either low or high lead exposure were given measurements for IQ scores.
If we want to see if lead has a significant effect on IQ, the table doesn’t tell us much.
Indeed, the table is very difficult to interpret and instead we would like to develop tools to summarize and characterize the data.

Characteristics of data

Diagram of the percent of outcomes contained within each standard deviation of the mean
for a standard normal distribution.

Courtesy of M. W. Toews CC via Wikimedia Commons.

We often use visual tools to understand samples and to simplify their analysis
We try to characterize data by a number of the features that it exhibits – these are some of the ways

Center: A representative value that indicates where the middle of the data set is located.
Variation: A measure of the amount that the data values vary.
Distribution: The nature or shape of the spread of the data over the range of values (such as bell-shaped).
Outliers: Sample values that lie very far away from the vast majority of the other sample values.
Time: Any change in the characteristics of the data over time.

Understanding each of these features in data is essential to distinguishing different types of behaviors, and when different kinds of analysis are appropriate or not.
For example, the sample average is sensitive to outliers (extremely large or small values) which can move the sample average away from most of the data.
If we only consider the mean (average) without looking at other features, we will get an incomplete or misleading story from the data.
Our next topic will be how to use visual tools to understand and analyze the data.

Frequency distributions

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Frequency distribution (or frequency table) shows how data are partitioned among several categories (or classes).
We list the categories along with the number (frequency) of data values in each of them.
For all the observations in the original data table, we:

Identify which partition or class that the observation belongs to.

E.g., if we look at an observation with IQ score of 100, this belongs to the class “90-109”.

Tally the number of observations that belong to a class.

E.g., we look back over the entire table of raw data and count how many observations belong to the class “90-109”. This was 35 observations.

One key concept with fequency distributions is the partition of the data.
It is important that all the classes of the data are disjoint and exhaustive.

That is to say, all sample data belongs to one and only one class.

Frequency distributions – definitions

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

There are some common pieces of vocabulary we will use when discussing frequency distributions which we will outline in the following:

Lower class limits – these are the smallest numbers that can belong to the different classes.
Question: what are the lower class limits in the table to the left?
- Answer: these are 50, 70, 90, 110, and 130.
Upper class limits – these are the largest numbers that can belong to the different classes.
Question: what are the upper class limits in the table to the left?

Answer: these are 69, 89, 109, 129, and 149.

Class boundaries – these are the numbers used to separate the classes, but without the gaps created by class limits.

The figure below shows the gaps created by the class limits from the table to the left.
The values of 69.5, 89.5, 109.5, and 129.5 are in the centers of the gaps, and following the pattern of those class boundaries, the lowest is 49.5 with the highest 149.5.

Diagram class boundaries between class limits.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Frequency distributions – definitions

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Class midpoints – these are the values in the middle of the classes.
Discuss with a neighbor: can you describe at least one way to compute the class midpoint?

Each class midpoint can be computed by adding the lower class limit to the upper class limit and dividing the sum by 2, e.g. \[ \frac{69 + 50}{2} = \frac{119}{2} = 59.5. \] This gives the average of the lower class limit and the upper class limit.
A second way is to find half the distance between the class boundaries and to add this to the lower class limit, e.g., \[ \frac{69 - 50}{2} + 50 = \frac{19}{2} + 50 = 9.5 + 50 = 59.5. \]
These are equivalent because \[ \frac{69 - 50}{2} + 50 = \frac{69 - 50}{2} + \frac{2\times 50}{2} = \frac{69 - 50 + 100}{2} = \frac{69 + 50}{2} \]
The table to the left has class midpoints of 59.5, 79.5, 99.5, 119.5, and 139.5.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Frequency distributions – definitions

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Class width – this is the difference between two consecutive lower class limits (or two consecutive lower class boundaries) in a frequency distribution.

For example, this can be computed as \[ 70 - 50 = 20. \]
Note: this is not the same as if we take the difference of the upper class limit with the lower class limit – in this case this is only \( 19 \).
Remember, the class width is always defined in terms of the difference of one lower class limit with the previous

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Relative frequency distributions

Relative freqency table of IQ scores for sample data set, table 2-4 from textbook.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

A common variation of the basic frequency distribution is a relative frequency distribution or percentage frequency distribution.
Rather than giving the raw number of frequencies, it can be useful to change the units to a percentage.
This is especially the case when we have a large number of observations, and classes with frequencies at many different scales.
It is also possible to represent a relative frequency distribution in its equivalent decimal form;

we will refer to a relative frequency distribution in either decimal or percentage units indistinguishably.

Discuss with a neighbor: how do we convert the raw unit of frequency to relative frequency?

When we refer to relative frequency, in plain English we mean,
“How often did we observe data in this class, out of all observations in all classes?”

This is mathematically equivalent to \[ \begin{align} \text{Relative frequency for a class} &= \frac{\text{frequency (i.e., number of observations in this class)}}{\text{sum of all frequencies (i.e., total number of observations in all classes )}} \\ \\ \text{Percent frequency for a class} &= \text{Relative frequency for a class}\times 100\% \end{align} \]

Cumulative frequency distributions

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Another variation of a frequency distribution is a cumulative frequency distribution.
By cumuluative we mean,
Cumulative – acquired by or resulting from accumulation, i.e., growth acquired by repeated addition of elements.
In this case, we will look at how the frequency of events accumulate as we look at successive class limits.
Using the original frequencies on the left-hand-side,
we add 2 + 33 to get 35 as the cumulative value in the second row of the right-hand-side.

Cumulative freqency table of IQ scores for sample data set, table 2-5 from textbook.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

For the third row in the cumulative frequency distribution, we add 2 + 33 + 35 to get the new cumulative value.
This process continues until we have exhausted all classes.
Note: the class limits on the right-hand-side are replaced by “less than” expressions that describe the new ranges of values.
The final row contains a value is larger than the greatest upper class boundary – what does this imply about the cumulative frequency?

The final cumulative value is the total number of all observations

Interpreting frequency distributions

Courtesy of M. W. Toews CC via Wikimedia Commons.

Freqency table for normally distributed data.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Frequency distributions can help us understand the distribution (shape) of a data set.
The pictured distribution is a Normal (Gaussian / bell curve) distribution.
Normally distributed data is characterized by the following features:

The frequencies start low, then increase to one or two high frequencies, and then decrease to a low frequency.
The distribution is approximately symmetric.
There are few if any extreme values.

The frequency table is characteristic of the above:

The frequencies start low, increase to the maximum of 56, then decrease to a low frequency.
The frequencies of 1 and 10 that precede the maximum are a mirror image of the frequencies 10 and 1 that follow the maximum.

Real data sets are almost never have this kind of perfect symmetry or exact peak;

we will learn tools to to determine when the data is “normal enough”.

Example frequency distribution

Freqency table of verbal IQ scores for sample data set, table from exercise 7 from textbook.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Discuss with your neighbor: does the frequency plot to the left appear to come from normally distributed data? Which of the criteria does it satisfy? What are reasons it may not be normally distributed?

There are two reasons to believe this could be the case:

The frequencies start low, go to a peak value of 43, and descend back to low.
The frequencies are almost symmetric about the peak.

There is one reason why it might not be the case:

There are two extreme values in the highest data class.

This is why it can be difficult in practice to determine if data is “normal enough”.
In this case, this is “normal enough”:

the extreme values are not so extreme, and with additional samples we might find similar extreme, but low values.

Frequency tables (and histograms) are somewhat crude ways to check for “normal” data, and deciding if data is normal in this way can be difficult.
We will will develop better tools to evaluate this later.

Example frequency distribution

Freqency table for weight of pennies in grams, table 2-8 in text.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Discuss with your neighbor: does the frequency plot to the left appear to come from normally distributed data? Which of the criteria does it satisfy? What are reasons it may not be normally distributed?

This data is highly non-normal, indeed it doesn’t statisfy any of our criteria.
The most important aspect to notice here is that the weights fall basically into two separate groups.

This is indicated by the large number of middle data classes with zero frequencies.

It appears that we are not observing small random differences in the penny manufacturing and wear process.
Rather, it seems like we are looking at different generations of pennies, manufactured with different metal compositions.

Indeed, pennies made before 1983 are 95% copper and 5% zinc.
However, after 1983 pennies are made of 2.5% copper and 97.5% zinc.

Example frequency distribution

IQ frequency table for both low and high lead exposure, Table 2-9 from text.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

We can now compare the frequency distributions for the low and high lead groups together in a single plot, rather than the large table before.
Discuss with your neighbor: What features do you notice about the low lead versus high lead frequency distributions?

We should note that the high lead group has a much larger (relative) frequency concentrated with low IQ scores.
The high lead group has a much smaller spread of values, with many empty classes.

This may be due in part to fewer observations, but is important to note.

We also remark that the mostly normal structure we saw in the earlier verbal IQ score plot isn’t as apparent in the low lead group.

This is probably due, in part, to the fact that the widths of the classes are different between the two different tables.
We should remember that the data can appear quite a bit different if the class widths are chosen differently.

Histograms

Histogram of IQ scores for low lead exposure group.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Histograms – these are just graphical versions of frequency distributions.

These consist of bars of equal width corresponding to the class width of each data class.
The horizontal scale is the range of values for the data, separated into the distinct data classes.
The heights of the bars are just the frequency of the observations within the class.
Therefore, the two summaries of the data are totally equivalent:

The widths of the bars and the range of the data is taken from the left column.
The heights of the bars are taken from the right column.

We note, the scale for the vertical axis can also be given in relative frequency (percent or proportion units).
The change of scale between a histogram and a relative frequency histogram is equivalent to the way we change scale between frequency distributions and relative frequency distributions.

Bell-shaped (normal) histograms

Histogram of IQ scores for low lead exposure group with bell curve superimposed.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Histograms of normally distrubted data will generally have a bell shape.
We should mention some caveats to this rule however:

The shape of the histogram, like frequency distributions, is strongly affected by the class widths.

At one extreme, we could choose class widths so that only one observation fits within each class – in this case the histogram would be flat.
At another extreme, we could choose class widths so that all observations fit within a single class – in this case the histogram would also be flat.

Due to this fact, a histogram can’t always be used to accurately assess the normality:

In the top figure, the widths are chosen well so we see a nice bell shape.
In the bottom figure, the bell shaped is more obscure, though it would come out clearly with finer class widths.

We will learn more refined ways of studying the normality later in the course.

Uniform distribution histograms

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

We do see relatively flat histograms in practice, where the class widths aren’t chosen in an extreme way.
In the figure to the left, there is on the order of 1000 observations for every class;
also the differences between the frequency per class is small for every pair of classes.
When data is shaped (non-trivially) flat like this, it is known as uniformly distributed data.

Skewed histograms

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

When we see data that it strongly anti-symmetric we call the data skewed.
It is common to call the extreme values in the distribution the tail of the distribution.
For example, on the left, there is a tail of values that are much smaller than the majority of observations.
On the right, there is a tail of values that are much larger than the majority of observations.
We call a distribution left / right skewed when the tail points left / right.

Bar graphs

Multiple bar graph of median income by gender.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Bar graphs are basically histograms, but ones for which the data classes can be qualitative or discrete.
We will still plot the vertical bar as the frequency of observations occuring within a givn data class;
however, there doesn’t need to be any well defined notion of class width or limits.
The classes can be made as simple as “Group A”, “Group B”, “Group C”.
In the left we see a multiple bar graph in which we compare two data sets in the same plot.
The data classes are the decade, while the vertical axis is the median income in a given decade in thousands of dollars.

By using color coding, we can compare two data sets side by side across the data classes.
In this case, we plot the historic income inequality in terms of differences between median income of men and women in the United States.

Bar graph example

Bar graph of fuel consumption with vertical axis not starting at zero.

Bar graph of fuel consumption with vertical axis starting at zero.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Above is an example of two bar charts of the same fuel consumption data.
Discuss with a neighbor: what is different about these two presentations of the data? Which one is more interpretable? Why?

The difference lies in the vertical axis scale – on the left it starts at 30 while on the right it starts at 0.
In this case, the practical difference between the fuel consumption of all cars is relatively small, and the graph on the left exagerates this difference.
To make the graphs interpretable, we should be careful with the scales – it is usually a bad idea to start the scale away from zero.

Pareto charts

Pareto chart of responses to what contributes most to happiness.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Pareto charts are related to bar graphs and are used when the data classes are purely qualitative.
The key distinguishing factor in a Pareto chart is that the frequencies are listed in descending order.

For purely qualitative data with no natural order, this choice is made for the visual effect and interpretation of the data.

For example, in the Pareto chart pictured, individuals were surveyed on what they felt was the factor that contributes most to life happiness.
Ordering the frequencies descendingly, we can see what was most frequently described as most important, followed by the next most commonly reported factor, and so on.

The units for a Pareto chart can be in either frequency or relative frequency, making this a flexible way to represent data.

Pie charts

Pie chart of responses to what contributes most to happiness.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Pie charts serve the same purpose as a Pareto charts, but only in a relative frequency scale.
Each “slice” of the pie chart takes up area proportional to the relative frequency of each data class.
This pie chart shows the same information as the Pareto chart, but there are some flaws in this representation by comparision:

The colors serve no real purpose in distinguishing data classes, and they can be confusing for color blind individuals.
It can be more difficult to intercompare the size of different slices.

For these reasons, anywhere a pie chart can be used, it is recommended to use a Pareto chart instead .

Scatter plots

Scatter plot of arm circumference versus waist circumference.

Scatter plot of pulse rate versus weight.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Scatter plots – these plots are used to plot matching pairs of quantitative observations.
This can be especially useful to determine if to variables are correlated.

We will discuss correlation more in depth later but for now it can be read loosely:
Variable \( x \) is (correlated/ anti-correlated) with variable \( y \) if they tend to vary (together / oppositely).

Discuss with a neighbor: which of the two above scatter plots evidences correlation?

It appears that increase in waist circumference tends to go along with an increase in arm circumference, so these variables appear correlated.
An increase in weight doesn’t generally go along with an increase (or decrease) in pulse rate, so they don’t appear correlated (or anti-correlated).

Dot plots and Stem and Leaf plots

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Dot plots and stem and leaf plots are both conceptually very similar.
Every observation will be represented by one dot or digit in each of the above.
Also, every observation can be identified exactly on the numerical scale of interest.
The primary difference between a dot plot and a stem and leaf plot is as follows:

Dot plot – one dot is placed above the precise numerical value.
Stem and leaf plot – we organize numerical observations into units of 10 (or up to another leading numerical unit). For each observation, we then place its final digit behind the leading unit.

This approach allows us to loosely organize data into different classes (the leading unit) and then identify every data point within the class (by the trailing digit).