Statistics -- The Science of Data

01/21/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Getting to know the class

  • For the next 2 minutes, interview someone sitting near you in class.

  • We will have small discussions throughout the course and you should get to know your neighbors.

  • You should be able to answer what their major of study is and how many years they have studied in the following poll.

Outline

  • The following topics will be covered in this lecture:

    • What is the subject of “statistics” and why do we study it?
    • What do we mean by “critical thinking” in statistics?
    • What is key vocabulary in statistics?
    • What is the process of a statistical investigation?

Introduction

  • Broadly, statistics consists of methods for:
    1. collecting;
    2. summarizing;
    3. analyzing; and
    4. interpreting data.
  • We are in a period of history where statistical methods are a diving force in society.
    • When we use internet devices, data is constantly collected in some form and is used to make business decisions.
      • Advertisements are often targeted to our interests based upon the data that is collected.
    • Polls and surveys are frequently used to understand and predict public opinions.
      • We frequently hear about public opinion polls in regards to upcoming elections for local and national offices.
    • As voters, we are presented with statistical data to inform our decisions on government policies, including economic forecasts and projections based on different choices.
      • We often must decide if a public project will have benefits that outweight the costs in, e.g., tax revenue.
  • Critical statistical thinking has become a basic literacy in how we interact with information in society.

An example

Histogram of voluntary computer virus poll.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • In the figure, America OnLine users were asked to respond to the question “Have you ever been hit by a computer virus?”
  • At a glance, it appears that about two to three times as many peopel have had a computer virus at some point versus those who haven’t.
  • However, this poll and the description of the data has several major flaws:
    1. The data vertical axis doesn’t start at zero, which makes the ratio look much greater than it actually is.
    2. The poll collected voluntary responses. Typically, people respond to voluntary polls only they have very strong feelings about the topic already.
    3. If we want to conclude a statement about all computer users, this sample population may not be representative… Who actually uses America OnLine anymore?
  • Understanding data collection, data interpretation, data presentation and how this impacts decisions based on the above is what we we call “critical statistical thinking”.
  • It is one of the primary goals that everyone can use critical statistical thinking at the end of this course.

Statistical Vocabulary

  • Discuss: can you describe what the word data might mean in statistics? Please provide an example of data.

    • Data – data are collections of observations, such as experimental measurements, survey responses, etc…
    • One example is the collection of all responses to an opinion poll to UNR students.
    • Another example is the collection of all measurements of temperature from a weather balloon given at one second intervals.
  • Discuss: can you describe what the word population might mean in statistics? Please provide an example of a population.

    • Population – the complete collection of all measurements (or possible-to-measure data points) being considered.
    • In the example of the opinion poll, the population is every UNR student, regardless of if they answered or not.
    • In the example of the weather balloon, the population is the temperature of the entire atmosphere, at all times even if we only measure certain locations at discrete times.
  • Discuss: can you describe what the word sample might mean in statistics? Please provide an example of a sample.

    • Sample – a sample is a subcollection of members selected from a population.
    • In the example of the opinion poll, the sample is the collection of UNR students who actually responded.
    • In the example of the weather balloon, the sample is the collection of temperature measurements at locations and times we have recorded.

The process of statistical thinking

  • Statistical thinking or a statistical inquiry has a natural progression.
    • Statistics always relies on some kind of data, which must be collected somehow.
    • Polls must be administered, weather balloons need to be released, etc…
  • The steps of statistical thinking can be loosely summarized as:
    1. Prepare;
    2. Analyze; and
    3. Conclude.
  • Each of these steps includes several elements which we will discuss in the following.

Prepare

  • In preparing a statistical study, we should consider the following:
    1. Context
      • What does the data mean?
      • What was the purpose of the study / data collection?
      • Is this data appropriate for our question of interest?

    2. Source of the data
      • Is the data from a source with a special interest?
      • Would there be pressure to obtain results that are favorable to the source?
      • E.g., health studies from the tobacco industry that denied the link between smoking and lung cancer.
    3. Sampling Method
      • Was the data collected in a way that was unbiased?
      • Are there reasons why the sample wouldn't reflect the population?
      • For example, are the answers all voluntary?
      • Do all participants self-select or are the participants selected methodically?
      • Are there reasons why certain segments of the population wouldn't respond to the poll?
      • In an experiment, was the measurement instrument used appropriate in this context?
      • Is there missing data or are there errors in the data?

Analyze

  • After prepaing the data, we use mathematical techniques to analyze the data.
  • Luckily, in these times computing power is cheap and very little of the analysis is done by hand.
  • In homework assignments, we will use StatCrunch to:
    1. Graph the data.
    2. Explore the data
      • Look at the “shape” of the data, e.g., outliers, many observations of the same value or few very extreme values.
    3. Apply mathematical methods to better understand or summarize the data quantitatively.

Conclude

  • Finally, the goal of a statistical analysis is to reach some kind of conclusion.
  • Honest conclusions in statistics are uncertain and they should be treated carefully.
  • We will always have incomplete, and often somewhat innacurate, samples of the population of interest.
  • Usually, the uncertainty of a conclusion is quantified by the statistical significance.
  • This loosely measures how surprising it would be to make the same conclusion if all the results were made up randomly.
    • For example, if after taking a certain fertility drug in a clinical trial, 98 out of 100 children born to the parents taking the drug were girls, we would call the result statistically significant because this was very unlikely to happen just by chance.
    • If the trial resulted in 52 girls, this could very easily happen by chance and we wouldn’t call the result statistically significant.
  • However, even when we are very confident about the conclusions, these conclusions must be qualified carefully by the uncertainty in the analysis.
  • Likewise, we should consider both our confidence in the results (statistical significance) and the practical significance.
    • E.g., in a test of the Atkins weight loss program, 40 subjects using that program had a mean weight loss of 4.6 pounds after one year.1
    • Using statistical methodology, we can say with relatively strong statistical certainty that the diet is effective in reducing the average weight of participants.
    • However, the actual average weight loss is practically insignificant at only 4.6 pounds, and in this sense we should be careful in calling the diet “effective” because it has low practical significance.
1. based on data from “Comparison of the Atkins, Ornish, Weight Watchers, and Zone Diets for Weight Loss and Heart Disease Risk Reduction,” by Dansinger et al., Journal of the Ameri- can Medical Association, Vol. 293, No. 1

Potential pitfalls in data analysis

  • There are a number of common errors in the process of data analysis that can dramatically affect the conclusions of a statistical inquiry.

  • A non-exhaustive list includes:

    • Relying on reported results – it is usually preferable to make direct measurements than to let participants in a study report their own measurements.
    • For example, if we ask participants in a study what their weight is, they are likely to report their desired weight and not actual weight.
    • Small number of samples – if we have too few samples of a population, it may be grossly missrepresentative of the population.
    • Loaded questions – different phrasings of a survey question can dramatically change the responses of those surveyed. When a survey question is written to ellicit a certain response, we call this a loaded question.
    • Poorly ordered questions – sometimes the order of questions can lead to a reaction from those responding, intentional or not. If the order of questions isn't chosen carefully, this can affect the results.
    • Not considering non-responses – it is possible that those who don't respond to polls do so for specific reasons and not by random chance. The non-responses as a class of data can have an important impact on our analysis of the population.
    • We will discuss these topics again in the next lecture…

Potential pitfalls in making conclusions

Graph of the correlation between the number of letters in the winning word of the Scripps Spelling Bee and the number of deaths by spider bites each year..

Courtesy of Tyler Vigen, Spurious Correlations

  • We must also be careful about the conclusions we make from the analysis.
  • For example, we must always remember that correlation is not the same thing as causation.