Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
Courtesy of Doxiadis, Apostolos, and Christos Papadimitriou. Logicomix: An epic search for truth. Bloomsbury Publishing USA, 2015.
Most objects we have worked with in R can be treated as mathematical sets.
Consider the following vector:
example_vector <- c(1, 2, 3, 4)
Vectors have additional structure that sets don't impose – i.e., there is an order to the elements of the example_vector
.
However, we can treat the example_vector
as a set by using set operations on it.
The statement \( x\in A \) can be read as “x is in the collection named A”.
Q: can you think of an R operation that would give a logical output if an element of some collection B
was contained in the example_vector
?
%in%
operator, as inB <- c(2, 5, 6, 7, 1)
A <- example_vector
A %in% B
[1] TRUE TRUE FALSE FALSE
With the notion of set containment, we can consider the definition of a set intersection: \[ \begin{align} A \cap B = \{\text{All }x\text{ such that }x\in A\text{ and } x\in B\} \end{align} \]
This operation can be replicated with the %in%
operator, but there are also built-in set operations such as the intersection,
A[ A %in% B]
[1] 1 2
intersect(A,B)
[1] 1 2
There are several other relations besides the subset relation, including
the union, \[ \begin{align} A \cup B = \{\text{All }x\text{ such that }x\in A\text{ or } x\in B\} \end{align} \]
union(A,B)
[1] 1 2 3 4 5 6 7
setdiff(A,B)
[1] 3 4
setdiff(B,A)
[1] 5 6 7
setequal(A,B)
[1] FALSE
A combination of several rolls of a die or tosses of a coin leads to more possible results, such as tossing a coin twice, with the sample space
\[ \Omega = \{\{H, H \}, \{H, T \}, \{T, H \}, \{T, T \}\}. \]
Generally, the combination of several different experiments yields a sample space with all possible combinations of the single events.
If, for instance, one needs two coins to fall on the same side, then the favored event is a set of two elements: \( \{H, H \} \) and \( \{T, T \} \).
The prob
package, developed by G. Jay Kerns, is specifically for probabilistic experiments like the above.
It provides methods for elementary probability calculation on finite sample spaces, including counting tools, defining probability spaces discussed later, performing set algebra, and calculating probabilities.
require(prob)
ev = tosscoin(2)
probspace(ev)
toss1 toss2 probs
1 H H 0.25
2 T H 0.25
3 H T 0.25
4 T T 0.25
tosscoin()
refers to how many coin tosses we performprobspace(tosscoin(3))
toss1 toss2 toss3 probs
1 H H H 0.125
2 T H H 0.125
3 H T H 0.125
4 T T H 0.125
5 H H T 0.125
6 T H T 0.125
7 H T T 0.125
8 T T T 0.125
Courtesy of Bin im Garten CC via Wikimedia Commons
prob
package for R: urnsamples(x, size, replace = FALSE, ordered = FALSE, ...),
tosscoin(ncoins, makespace = FALSE),
rolldie(ndies, nsides = 6, makespace = FALSE),
cards(jokers = FALSE, makespace = FALSE),
roulette(european = FALSE, makespace = FALSE).
If the argument makespace is set TRUE, the resulting data frame has an additional column showing the (equal) probability of each single event.
In the simple probablity model, the probability of an event can be computed as the relative frequency of possible outcomes as discussed earlier.
However, a general, finite probability space can be formed using
probspace(outcomes, probs)
with a vector of possible simple outcomes and a vector of the associated probabilities for each simple outcome.
The full probability space will then be created as all set combinations of the simple outcomes, with their probabilities constructed using the relationships described earlier.
For the remaining lecture, we will be focusing on the classical urn sampling problem, and how we can generate different kinds of random sampling.
The basic idea behind an “urn” problem is that we have an urn or a bin in which different objects are stored.
These objects have no order and are assumed to be well-mixed;
We will take two turns drawing one toy vehicle at a time from the urn, without replacement:
ev <- urnsamples(c("bus", "car", "bike", "train"), replace=FALSE, size= 2, ordered=TRUE)
probspace(ev)
X1 X2 probs
1 bus car 0.08333333
2 car bus 0.08333333
3 bus bike 0.08333333
4 bike bus 0.08333333
5 bus train 0.08333333
6 train bus 0.08333333
7 car bike 0.08333333
8 bike car 0.08333333
9 car train 0.08333333
10 train car 0.08333333
11 bike train 0.08333333
12 train bike 0.08333333
In the last example we specified that we do not return an object once it is drawn, that the order in which the objects are drawn matters and that there is only one copy of each object in the urn.
When we sample without replacement, and with order mattering, the number of all possible samples of size \( k \) from a set of \( n \) objects is given by,
\[ \begin{align} \frac{n!}{(n-k)!} & & n! = n (n-1) (n-2) \cdots (2)(1) \end{align} \]
Suppose we want to find the probability that the second object drawn is a bike,
In total there are \[ \frac{4!}{(4-2)!} = \frac{4!}{2!} = 3\times 4 = 12 \] total possible outcomes so that
Prob(probspace(ev), X2 == "bike" )
[1] 0.25
ev <- urnsamples(c("bus", "car", "bike", "train"), replace=TRUE, size= 2, ordered=TRUE)
probspace(ev)
X1 X2 probs
1 bus bus 0.0625
2 car bus 0.0625
3 bike bus 0.0625
4 train bus 0.0625
5 bus car 0.0625
6 car car 0.0625
7 bike car 0.0625
8 train car 0.0625
9 bus bike 0.0625
10 car bike 0.0625
11 bike bike 0.0625
12 train bike 0.0625
13 bus train 0.0625
14 car train 0.0625
15 bike train 0.0625
16 train train 0.0625
Prob(probspace(ev), X2 == "bike" )
[1] 0.25
bike
on the second draw.ev <- urnsamples(c("bus", "car", "bike", "train"), replace=FALSE, size= 2, ordered=FALSE)
probspace(ev)
X1 X2 probs
1 bus car 0.1666667
2 bus bike 0.1666667
3 bus train 0.1666667
4 car bike 0.1666667
5 car train 0.1666667
6 bike train 0.1666667
In this case, the probability space has only six possible outcomes, given by all un-ordered combinations of the different objects in the urn.
Mathematically, this is described by the “choose” function for \( n \) objects choosing combinations of size \( k \) \[ {n \choose k} = \frac{n!}{k!(n-k)!}. \]
In our example, we have \[ {4 \choose 2} = \frac{4!}{2!(4-2)!} = \frac{4!}{2! 2!} = 3 \times 2 \]
The examples above are very specific and restricted to a particular sample space, but we can address sampling from a general perspective with the same principles.
In each case, some random selection mechanism is involved: the theory behind this is called probabilistic sampling.
An important example of randomized sampling is the following:
In 1954, a large-scale experiment was designed to test the effectiveness of the Salk vaccine in preventing polio.
The 401,974 children in the Salk vaccine experiment were assigned to the Salk vaccine treatment group or the placebo group via a process of random selection equivalent to flipping a coin.
You can encode wheter someone is in the treatment or control group in a binary way:
Randomly drawing “Treatment” or “Control” with equal probability is equivalent to flipping a fair coin.
The logic behind randomization is to use chance as a way to create two groups that are similar.
With a large enough sample for both treatment and control groups, this can be very effective when it is difficult to balance factors like age, gender, height, weight, etc… across groups that might affect the outcome.
In randomized sampling, chance is being utilized to balance the many population factors across the control and treatment groups.
The most basic way to perform such a randomized sample is like the coin flip or urn example, where all outcomes have equal probability.
Simple random sample: a simple random sample of \( n \) subjects is selected in such a way that every possible sample of the same size \( n \) has the same chance of being chosen.
The purpose of this randomized sampling is to produce a sample that gives an accurate representation of the population being studied with a limited number of observations.
However, especially with small sample sizes, it is possible that a random sample can be biased and not adequately represent the true population.
One approach to handle this is known as stratified sampling.
Groups are chosen such that subjects within the same subgroup share the same characteristics (such as undergraduate versus graduate student) and each population member belongs to one and only one subgroup.
We then draw a simple random sample from each subgroup (or stratum) with a total number of observations taken from each subgroup proportional to the group's membership in the population.
require(sampling)
set.seed(0)
student_df <- data.frame(graduate_status = c(rep("U", 800), rep("G", 200)), student_number = sample(1:1000, size=1000))
head(student_df)
graduate_status student_number
1 U 398
2 U 836
3 U 679
4 U 129
5 U 930
6 U 509
tail(student_df)
graduate_status student_number
995 G 589
996 G 803
997 G 774
998 G 558
999 G 898
1000 G 188
selected_numbers <- sample(1:1000, size=10)
student_df[student_df$student_number %in% selected_numbers,]
graduate_status student_number
66 U 45
151 U 573
200 U 587
307 U 823
584 U 752
597 U 153
610 U 452
624 U 379
627 U 400
992 G 78
selected_numbers <- sample(1:1000, size=100)
selected_students <- student_df[student_df$student_number %in% selected_numbers,]
cat(sum(selected_students$graduate_status=="U"), "total undergraduates selected\n")
75 total undergraduates selected
cat(sum(selected_students$graduate_status=="G"), "total graduates selected\n")
25 total graduates selected
require(sampling)
st <- strata(student_df, stratanames="graduate_status", size=c(16,4), method="srswor")
statified_sample = getdata(student_df, m=st)
statified_sample
student_number graduate_status ID_unit Prob Stratum
65 642 U 65 0.02 1
142 43 U 142 0.02 1
219 853 U 219 0.02 1
305 609 U 305 0.02 1
348 895 U 348 0.02 1
350 652 U 350 0.02 1
442 735 U 442 0.02 1
544 256 U 544 0.02 1
567 805 U 567 0.02 1
590 206 U 590 0.02 1
596 409 U 596 0.02 1
681 507 U 681 0.02 1
727 411 U 727 0.02 1
746 846 U 746 0.02 1
759 395 U 759 0.02 1
775 766 U 775 0.02 1
849 667 G 849 0.02 2
919 708 G 919 0.02 2
975 68 G 975 0.02 2
993 120 G 993 0.02 2
A variety of different sampling procedures can be introduced to better balance a sample with respect to how it will reflect a population.
Because of the notion of randomness in our observations, probability is essential for modeling the results we see in observed data sets.
Particularly, in the next session we will review the notion of a random variable and some fundamental probability distributions.