Summarizing and graphing data part II

02/04/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:
    • Additional visual summaries of data
    • Scatter plots
    • Correlation
    • Regression
    • Measures of center

Frequency distributions

Freqency table of IQ scores for sample data set, table 2-2 from textbook.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • Frequency distribution (or frequency table) shows how data are partitioned among several categories (or classes).
  • We list the categories along with the number (frequency) of data values in each of them.
  • For all the observations in the original data table, we:
    1. Identify which partition or class that the observation belongs to.
      • E.g., if we look at an observation with IQ score of 100, this belongs to the class “90-109”.
    2. Tally the number of observations that belong to a class.
      • E.g., we look back over the entire table of raw data and count how many observations belong to the class “90-109”. This was 35 observations.
  • One key concept with fequency distributions is the partition of the data.
  • It is important that all the classes of the data are disjoint and exhaustive.
    • That is to say, all sample data belongs to one and only one class.
Diagram class boundaries between class limits.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Histograms

Histogram of IQ scores for low lead exposure group.
Freqency table of IQ scores for sample data set, table 2-2 from textbook.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • Histograms – these are just graphical versions of frequency distributions.
    • These consist of bars of equal width corresponding to the class width of each data class.
    • The horizontal scale is the range of values for the data, separated into the distinct data classes.
    • The heights of the bars are just the frequency of the observations within the class.
    • Therefore, the two summaries of the data are totally equivalent:
      • The widths of the bars and the range of the data is taken from the left column.
      • The heights of the bars are taken from the right column.
    • We note, the scale for the vertical axis can also be given in relative frequency (percent or proportion units).
    • The change of scale between a histogram and a relative frequency histogram is equivalent to the way we change scale between frequency distributions and relative frequency distributions.

Frequency polygons

Fequency polygon of IQ scores.
Relative frequency polygon of IQ scores.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • Frequency polygons are almost identical to histograms, and provide a graphical tool to visualize frequency distributions.
  • Rather than using bars, dots are plotted above the class midpoints.
  • Lines then connect the dots across the data classes.
  • The same change of units to relative frequency can be performed for frequency polygons, as on the relative frequency polygon on the right-hand-side.

Cumulative frequency distributions

Freqency table of IQ scores for sample data set, table 2-2 from textbook.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • Another variation of a frequency distribution is a cumulative frequency distribution.
  • By cumuluative we mean,
    Cumulative – acquired by or resulting from accumulation, i.e., growth acquired by repeated addition of elements.
  • In this case, we will look at how the frequency of events accumulate as we look at successive class limits.
  • Using the original frequencies on the left-hand-side,
  • we add 2 + 33 to get 35 as the cumulative value in the second row of the right-hand-side.
Cumulative freqency table of IQ scores for sample data set, table 2-5 from textbook.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • For the third row in the cumulative frequency distribution, we add 2 + 33 + 35 to get the new cumulative value.
  • This process continues until we have exhausted all classes.
  • Note: the class limits on the right-hand-side are replaced by “less than” expressions that describe the new ranges of values.

Ogive

Cumulative freqency table of IQ scores for sample data set.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • An ogive is an extension of a frequency polygon to a cumulative frequency polygon.
  • Dots are plotted at each class boundary to indicate the frequency that measurments are less than certain values.
  • Lines once again connect the dots.
Cumulative freqency table of IQ scores for sample data set, table 2-5 from textbook.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Dot plots and Stem and Leaf plots

Dot plot of IQ scores.
Stem and leaf plot of IQ scores.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • Dot plots and stem and leaf plots are both graphical representations of frequency distributions, like histograms.
  • Dot plots and stem and leaf plots are both conceptually very similar.
  • Every observation will be represented by one dot or digit in each of the above.
  • Also, every observation can be identified exactly on the numerical scale of interest.
  • The primary difference between a dot plot and a stem and leaf plot is as follows:
    • Dot plot – one dot is placed above the precise numerical value.
    • Stem and leaf plot – we organize numerical observations into units of 10 (or up to another leading numerical unit). For each observation, we then place its final digit behind the leading unit.
      • This approach allows us to loosely organize data into different classes (the leading unit) and then identify every data point within the class (by the trailing digit).
  • Discuss with a neighbor: how many observations are between 120 and 129? What are their values?
    • There are three, will values 120, 125, and 128.

Pictrograms versus bar charts

Fequency polygon of IQ scores.
Relative frequency polygon of IQ scores.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • Pictograms use pictures to display data – these can be popular, but if done poorly can be very difficult to interpret.
  • The pictogram and the bar chart to the left actually present the same data.
  • Discuss with a neighbor: if you didn’t have the bar chart, could you interpret the pictrogram? What is lacking to interpret it?
    • The biggest issues with the pictogram versus the bar chart is the lack of units of comparison.
    • For example, we don’t know if there are the same number of passengers in 1984 and 2010, but they are larger in 2010.
    • Likewise, we cannot make a quantitative comparision without having a scale for the units as in the bar chart.

Time series

Time series of Dow Jones Industrial Average.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • When we are looking at the evolution of a particular quantity over a time-ordered measurements, this is called a time series.
  • Usually, these are given by of time-stamped measurements at fixed intervals.
  • Particularly, this kind of graph helps us detect patterns in the time-evolution of the quantity.

Scatter plots

Scatter plot of arm circumference versus waist circumference.
Scatter plot of pulse rate versus weight.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • Scatter plots – these plots are used to plot matching pairs of quantitative observations.
  • This can be especially useful to determine if to variables are correlated.
    • We will discuss correlation more in depth later but for now it can be read loosely:
      Variable \( x \) is (correlated/ anti-correlated) with variable \( y \) if they tend to vary (together / oppositely).
  • Discuss with a neighbor: which of the two above scatter plots evidences correlation?
    • It appears that increase in waist circumference tends to go along with an increase in arm circumference, so these variables appear correlated.
    • An increase in weight doesn’t generally go along with an increase (or decrease) in pulse rate, so they don’t appear correlated (or anti-correlated).

Correlation

Scatter plot of arm circumference versus waist circumference.
Scatter plot of pulse rate versus weight.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • More formally, we will give correlation a numerical value \( r \).
    • We define the quantity \( r \) to be the linear correlation coefficient / Pearson correlation , measuring the strength of how two variables vary together or oppositely.
  • This value will be bounded between \( \pm 1 \), so that it always lives in the interval \( [-1,1] \).
  • When the value \( r \) is close to \( \pm 1 \) we consider the variables to vary together (or oppositely) consistently.
  • When the value \( r \) is close to zero we consider the variables to not vary together (or oppositely) consistently. For example:
    • The linear correlation coefficient \( r=0.802 \) for the plot on the left.
    • The linear correlation coefficient \( r=0.082 \) for the plot on the right.

Critical values for correlation

Scatter plot for height versus shoe print length.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

  • In order to say how close to \( \pm 1 \) is close enough for \( r \) to indicate correlation, we will want to develop some tools.
  • On the left are measurements for the shoe print length and height of five individuals.
  • The linear correlation coefficient in this case \( r\approx0.591 \) isn’t close to \( \pm 1 \) or \( 0 \).
  • If we only looked at the value of \( r \), we might have a hard time telling if there is an interesting relationship.
  • NOTE: strong (anti-)correlation does not imply causation, only perhaps an interesting relationship to study.
  • Notice in the figure, we also list the critical values for \( r \).
  • Critical values are one of the tools for deciding if this relationship is truly interesting.
  • If the linear correlation coefficient \( r \) is at least as extreme as the critical values, we can conclude that there is statistical significance.
  • The critical values above are \( \approx \pm 0.878 \), but \( r \) is not as close to \( \pm 1 \) as the critical values.

Critical values for correlation continued

List of critical values for correlation.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • We can consider graphically “how extreme” the linear correlation coefficient is compared to the critical value.
  • For each number of pairs of measurements, there is an associated critical value that will determine the significance of the correlation.
  • We measured five individuals, getting five pairs of measurements with the corresponding critical value.
Critical value window diagram.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • This critical value corresponds to the inner window of “no correlation” in the diagram on the right-hand-side.
  • For any linear correlation coefficient (computed on 5 pairs of measurements) that isn’t at least as extreme as \( \pm 0.878 \),
    • we will say the pair of variables are not correlated.
  • If the linear correlation coefficient (computed on 5 pairs of measurements) lies in either \( [-1,-0.878] \) or \( [0.878,1] \)
    • we will say the pair of variables are correlated.
  • This is what is meant by “the sample value must be at least as extreme as the critical value to be significant”.

Critical values for correlation continued

List of critical values for correlation.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

Car weight and fuel consumption data table.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • Above, we have seven measurements of different cars' weights and highway miles per gallon gas consumption.
  • Suppose we use software to compute that the linear correlation coefficient is \( r \approx -0.987 \).
  • Discuss with a neighbor: using the table to the left of the critical values, can you determine if we would call the car weight and highway miles per gallon fuel consumption (anti)-correlated? If so, what does this relationship signify?
  • We note that there are 7 pairs of measurements, so the corresponding critical value is \( 0.754 \).
    • The linear correlation coefficient \( -0.987 \) is more extreme than \( -0.754 \), so we say the variables are correlated.
  • We recall, the negative sign for the correlation coefficient means that the variables of weight and highway MPG vary together oppositely ;
    • i.e., as the weight goes up, the highway MPG goes down.

P-values for correlation

Scatter plot for height versus shoe print length.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

  • A strongly related concept to critical values are p-values
  • As a thought experiment, consider the following.
    • In this example we are trying to determine if there is correlation between height and shoe print lenght.
    • Suppose we assume the null hypothesis, that is, assume that there is no correlation between these variables.
    • The p-value is the probability of getting a linear correlation coefficient \( r \) at least as extreme as the one we computed, when there is no correlation present.
    • Intuitively, it measures “how suprising it would be” computing such a linear correlation coefficient \( r \) in the case there is no statistically interesting relationship between the variables.
  • When a p-value is large this says that getting the value by chance is quite likely.
  • The p-value above is \( \approx 0.293 \), so that five random pairs of uncorrelated measurements would have as large, or larger, of linear correlation coefficient (\( r\approx 0.591 \)) almost \( 30\% \) of the time.

P-values for correlation

Scatter plot for height versus shoe print length.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

  • Note: the linear correlation coefficient depends on the sample data.
    • If we take measurements of the height and shoe print of five new people, we will quite likely get a different linear correlation coefficient.
  • Also, the critical values and p-values depend on the number of samples.
  • In the plot to the left, there are 40 total subject for whom we have pairs of measurements.
  • Discuss with a neighbor: does the relationship between height and shoe print length show more evidence of correlation now? Do you think the linear correlation coefficient will be close to \( 1 \), \( -1 \) or to \( 0 \)?
  • In this case, the linear correlation coefficient is \( \approx 0.813 \) suggesting that the varialbes are positively correlated.
  • Note: \( r \) is not as extreme as the critical value of \( 0.878 \) from before – however, this critical value was for \( 5 \) samples only.
  • The critical value for \( 40 \) samples is approximately \( 0.304 \), so a coefficient of \( 0.813 \) is much more extreme.
  • Likewise, the p-value is \( \approx 0 \) so there is very little probability of seeing such a correlation coefficient just by chance;
    • usually we say there should be at most \( 5\% \) chance.

Regression

Scatter plot for height versus shoe print length with regression line.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

  • One of the most useful ways we can interpret data is in terms of the “trend” in the data.
  • Noting that there was correlation in height and shoe print length, we could say,
    “On average, an increase in someone’s height usually goes along with an increase in their shoe print length (and vice versa).”
  • A regression line (or best fit line) quantifies what this trend looks like.
  • Recall the equation for a line, \[ y = {\color{red} a} +{\color{blue} b}x \]
    • The coefficient \( a \) is the intercept.
      • When the quantity \( x \) is zero, then \( y = {\color{red} a} \).
    • The coefficient \( b \) is the slope.
      • An increase of 1 unit of variable \( x \) corresponds to \( {\color{blue} b} \) units of increase in \( y \).

Regression continued

Scatter plot for height versus shoe print length with regression line.

Courtesy of Mario Triola, Essentials of Statistics, 6th edition

  • In regression, we write the equation for a line in a special form: \[ \hat{y} = {\color{red} {b_0} } + {\color{blue} {b_1} } x, \] where we re-name the variables as:
    • \( y \) – this is called the response;
    • \( x \) – this is called the predictor;
    • \( {\color{red} a} \) – this is re-named \( {\color{red} {b_0} } \); and
    • \( {\color{blue} b} \) – this is re-named \( {\color{blue} {b_1} } \).
  • In our example, we could use software to compute
    • \( {\color{red} {b_0} \approx 80.9} \); and
    • \( {\color{blue} {b_1} \approx 3.22} \).
  • The regression equation would then be read, \[ \hat{y}_\text{(Height)} = {\color{red} {80.9} } + {\color{blue} {3.22} } x_\text{(Shoe print length)}. \]

Characteristics of data

Diagram of the percent of outcomes contained within each standard deviation of the mean
for a standard normal distribution.

Courtesy of M. W. Toews CC via Wikimedia Commons.

  • Recall, we try to characterize data by a number of the features that it exhibits.
  • Some of the key measures are:
    1. Center: A representative value that indicates where the middle of the data set is located.
    2. Variation: A measure of the amount that the data values vary.
    3. Distribution: The nature or shape of the spread of the data over the range of values (such as bell-shaped).
    4. Outliers: Sample values that lie very far away from the vast majority of the other sample values.
    5. Time: Any change in the characteristics of the data over time.
  • We will now begin studying measures of center.
  • There are several main measures of center of a data set:
    1. mean;
    2. median;
    3. mode; and
    4. midrange.
  • Each of these usually gives a different view of where the “most central point” of the data lies.