Introduction to statistical concepts part III

01/28/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:

    • Observational studies versus experiments
    • Methods of sampling
    • Types of observational studies
    • Types of experimental design

Methods of sampling

  • We have begun to consider the differences between a sample and population.

  • The motivation of sampling is to represent the population with a smaller collection of data points, the sample.

  • However, if the data is not collected in methodological way, it the sample may grossly mis-represent the population.

    • Loosely speaking, this is what is meant by bias in a sample.
  • We will now consider in detail how we can methodically choose samples, to reduce the effects of bias.

Observational studies versus experiments

  • We typically have data that can be categorized into on of the two following types of data:

    • Observational studies – in an observational study, we observe and measure specific characteristics, but we don’t attempt to modify the subjects being studied.
    • Experiments – in an experiment, we apply some treatment and then proceed to observe its effects on the subjects. (Subjects in experiments are called experimental units.)
  • The differences between the two types of data can be easily seen considering a clinical trial.

    • Suppose we want to determine the effectiveness of a hair growth drug in a clinical trial.
    • One group will be given a treatment of the new drug.
    • A control group will be given a placebo, as a control.
    • We will try to measure the effect of the treatment by the difference in the treatment and control group.
  • An observational version of this study might take the following form:

    • Ten years after the hair growth drug is released, we examine rates of adult baldness, without having modified any of the subjects.
    • Examining if the population trends have changed, i.e., rates of adult baldness have been reduced, we will try to conclude if the drug has had an effect.
  • A major difference between the types of observations is that observational studies do not have a way to control for non-measured variables that may have an effect on the study.

    • A variable that has an effect on the outcome that is not part of the study is called a “lurking” or “latent” variable .

Observational studies versus experiments

  • Discuss with a neighbor: suppose a poll is given to UNR students about the quality of UNR food services. Is this poll an observational study or an experiment?

    • This is an observational study because the data is collected without modifying the behavior or applying a treatment to the subjects.
  • Discuss with a neighbor: suppose UNR wants to examine student satisfaction with possible menu changes in its food services. One group of students is given the current menu, while another group is given a new menu, and the satisfaction of each student is recorded for a month. Is this study an observational study or an experiment?

    • This is an experiment because there is a group of students given the treatment of a new menu, while the control group uses the same menu.
    • By examinig the differences between the treatment and the control, we can try to measure the effect of the new menu.

Random Sampling

  • We have seen that voluntary sampling is flawed because it leads to certain groups (with strong feelings about the questions) to be highly represented while other groups (who may not care) to have limited representation.

  • Put another way, one group has a higher probability of responding than other groups.

  • This is the motivation for random sampling…

    • we can try to make certain that the probability of getting any group of responses is the same as any other group.
  • Simple random sample: a simple random sample of \( n \) subjects is selected in such a way that every possible sample of the same size \( n \) has the same chance of being chosen.

    • Suppose we are taking a poll of UNR students and we will have a sample size of 1000.
    • This means, every possible combination 1000 UNR students is must be equally likely to be selected based on our sampling method.
    • E.g., we can randomly select students to give interview responses based on their student ID numbers.
    • Note: a simple random sample is often called a random sample, but strictly speaking, a random sample has the weaker requirement that all members of the population have the same chance of being selected.

Systematic sampling

Diagram of every 3rd individual in a population being sampled.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • Systematic sampling: systematic sampling, we select some starting point and then select every kth (such as every 50th) element in the population.
  • For example, a grocery store decides to give a poll to every 3rd customer who enters in a day.
  • This can be effective when the order of people entering doesn’t hide some pattern – we have no reason to believe that the 2nd, 5th, 7th etc… person would be different from this sequence in a systematic way.
  • This is different from a simple random sample because some groups of 4 out of 12 people do not have equal probability to be selected as group consisting of the the 3rd, 6th, 9th and 12 people.
  • For instance, the 1st, 2nd, 3rd and 4th entrants to the grocery store have zero probability to become a sample based on this rule.

Stratified sampling

Diagram of random sampling within political parties.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • Stratified sampling: we subdivide the population into at least two different subgroups (or strata).
  • Groups are chosen such that subjects within the same subgroup share the same characteristics (such as political party).
  • We then draw a sample from each subgroup (or stratum).
  • This technique works best when all members of the population belong to one and only one of the strata.
  • This allows us to keep the sample balanced with respect to the groups, where we can randomly sample each strata proportionately to their percentage of the population.
  • In the political party example, we need to make sure every member of the population is in one and only one group.
  • In a case like this, it can make sense to have groups such as “Democrat”, “Republican”, “Third Party” and “Unaffiliated ” so that we fulfill this requirement.

Cluster sampling

Diagram of every student within certain class sections being sampled -- selction of the classes are random.

Courtesy of Mario Triola, Essentials of Statistics, 5th edition

  • Cluster sampling: we first divide the population area into sections (or clusters).
  • Then we randomly select some of those clusters and choose all the members from those selected clusters.
  • This type of sampling works well when members within each cluster are not homogeneous, but the clusters should be relatively homogeneous between each other.
  • That is, each cluster should be a small-scale representation of the entire population.
  • In this example we randomly select classes at UNR. Each class section is a cluster.
  • We obtain responses from all students in the randomly selected classes.
  • This method of sampling would work in the class section example if we selected classes from a list of general requirements.
  • In this case, students in the classes would be from all majors and thus heterogeneous.
  • However, because students are required to take all of the classes in the list, the clusters are homogeneous with respect to other clusters.
  • The example from Triola is a case where this wouldn’t work well, unless all majors were required to take Architecture, Art History, Biology, Zoology, etc…
  • Otherwise, e.g., the clusters of Art History and Zoology could not be homogenous with respect to each other.

Multistage sampling methods

  • Often for complex, heterogenous populations like the United States, a multistage design is implemented to get a sample that can reflect the highly complex population with a small sample.
  • For example, the U.S. government’s unemployment statistics are based on surveyed households .
  • It would be very difficult to visit each member of a simple random sample, because individual households are spread all over the country – there is a practical limitation of where sampling can be concentrated.
  • The U.S. Census Bureau and the Bureau of Labor Statistics instead collaborate to conduct a survey called the Current Population Survey. This is sampled in the following steps:
    1. The entire United States is partitioned into 2025 different regions called primary sampling units (PSUs) including metropolitan areas, large counties, or combinations of smaller counties. These 2025 PSUs are then grouped into 824 different strata.
    2. In each of the 824 different strata, one PSU is selected in a way such that the probability of selection is proportional to the size of the population in each primary sampling unit.
    3. In each of the 824 selected PSUs, the housholds are broken up into census enumeration disctricts of about 300 housholds – enumeration districts are then randomly selected.
    4. Finally, about 4 households in each enumeration district are selected randomly.
  • This technique actually utilizes stratified, cluster and random sampling at different stages in the process.
  • This process ensures that the sample will reflect the spatially heterogenous population while it is practical to send interviewers to fewer, concentrated locations.

Examples of sampling methods

  • Discuss with a neighbor: what sampling method is being used in each of the following examples?

    • Twitter poll – In a Pew Research Center poll, 1007 adults were called after their telephone numbers were randomly generated by a computer, and 85% of the respondents were able to correctly identify what Twitter is.
    • This is random sampling of users with telephone numbers, where everyone with a telephone is equally likely to be selected .
    • However, we are assuming that the probability of someone answering their phone is equal across variables of interest.
    • Ecology – When collecting data from different sample locations in a lake, a researcher usesthe “line transect method” by stretching a rope across the lake and coll