A review of discrete probability and sets

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:

A review of basic set theory
A review of basic probability theory
Some basic probabilistic experiments with finite sample spaces
Sampling procedures

Basic set theory: a history of sets

In the nineteenth century, the German mathematician Georg Cantor developed the greater part of today’s set theory.
Set theory is at the basis of mathematical logic and how, similar to computing, mathematical objects are ordered in hierarchies of classes with certain properties.
At the turn of the nineteenth and twentieth centuries, Ernst Zermelo, Bertrand Russell, Cesare Burali-Forti and others found contradictions in the originally proposed set theory, with one of the famous results being Russell’s paradox.

Courtesy of Doxiadis, Apostolos, and Christos Papadimitriou. Logicomix: An epic search for truth. Bloomsbury Publishing USA, 2015.

A history of sets

Moreover, it is decreed that the barber is the one who shaves those that do not shave themselves.
The question then is, who shaves the barber if the barber is the one who shaves those that do not shave themselves?
In terms of sets, there was an early approach that defined a set as any identifiable collection; however, the question arose:

if a set $ A $ is identified as the collection of all collections that are not contained in themselves, does $ A $ belong to itself or not?

Contradictions like the above led Ernst Zermelo in 1908 gave an axiomatic system which precisely described the existence of certain sets and the formation of sets from other sets.
This Zermelo–Fraenkel set theory is still the most common axiomatic system for set theory.
There are 9 axioms, amongst others, that deal with set equality, regularity, pairing sets, infinity, and power sets.
We will be interested in this lecture how this will allow us to calculate probability for discrete distributions with finite possible outcomes as sets or collections.

Common sets in mathematics

In mathematics, the most commonly considered sets are the following:

$ \mathbb{N} $: the set of natural numbers, i.e. $ \{1, 2, 3, 4 \cdots \} $.
$ \mathbb{Z} $: the set of integer numbers, i.e. $ \{\cdots − 3, −2, −1, 0, 1, 2, 3 \cdots \} $.
$ \mathbb{Q} $: the set of rational numbers, i.e., all numbers $ q $ that can be represented as ratios of two numbers $ q=\frac{z_1}{z_2} $ for $ z_1,z_2\in \mathbb{Z} $.
$ \mathbb{R} $: the set of real numbers, i.e., all rationals and irrational numbers like $ \sqrt{2} $.
$ \mathbb{C} $: the set of complex numbers, i.e., all real numbers and all extensions of the real numbers to the complex plane.

For each set there is a cardinal number which stands for the magnitude or number of elements in the set, even for infinite sets.

The cardinal number of $ \mathbb{N} $ is $ \aleph_0 $ (‘Aleph null’), which represents a “smaller” infininity than the infinity of $ \mathbb{R} $, \[ {\mathfrak {c}}=2^{\aleph _{0}} > \aleph_{0}. \]

In R, of course, one cannot create infinite sets due to its limited storage capacity.
The sets above are named by a fixed character or letter, whereas other sets in the literature are labelled arbitrarily with a Latin or Greek letter.

Basics of set theory in R

Most objects we have worked with in R can be treated as mathematical sets.
Consider the following vector:

example_vector <- c(1, 2, 3, 4)

Vectors have additional structure that sets don't impose – i.e., there is an order to the elements of the example_vector.
However, we can treat the example_vector as a set by using set operations on it.
The statement $ x\in A $ can be read as “x is in the collection named A”.
Q: can you think of an R operation that would give a logical output if an element of some collection B was contained in the example_vector?
- A: This can performed with our %in% operator, as in

B <- c(2, 5, 6, 7, 1)
A <- example_vector
A %in% B

[1]  TRUE  TRUE FALSE FALSE

Basics of set theory in R

With the notion of set containment, we can consider the definition of a set intersection: \[ \begin{align} A \cap B = \{\text{All }x\text{ such that }x\in A\text{ and } x\in B\} \end{align} \]
This operation can be replicated with the %in% operator, but there are also built-in set operations such as the intersection,

A[ A %in% B]

[1] 1 2

intersect(A,B)

[1] 1 2

There are several other relations besides the subset relation, including
the union, \[ \begin{align} A \cup B = \{\text{All }x\text{ such that }x\in A\text{ or } x\in B\} \end{align} \]

union(A,B)

[1] 1 2 3 4 5 6 7

Basics of set theory in R

the difference,

setdiff(A,B)

[1] 3 4

setdiff(B,A)

[1] 5 6 7

and test for equality,

setequal(A,B)

[1] FALSE

All these functions contained in base R can be applied to vectors or matrices.

Probabilistic experiments with finite outcome spaces

When working with data that is subject to random variation, the theory behind this probabilistic situation becomes important.
There are two types of experiments:

deterministic, in which all elements of the experiment are controlled precisely, and
random, usually in the form of how the data is sampled or by the way error is introduced by uncertainty in measurements and experimental parameters.

In the following we will review the main ideas about nondeterministic processes with a finite number of possible outcomes.
A random trial (or experiment) yields one of the distinct outcomes that altogether form the sample or event space, which we will call $ \Omega $.
All possible outcomes constitute the universal event $ \Omega $, and for an experiment with finite possible outcomes like a single coin flip, we can actually write out $ \Omega $ as a finite collection or set.
Subsets of $ \Omega $ are called events,

e.g. $ \Omega $ itself is the universal event.

Q: suppose our experiment is to make a single, fair coin flip where a “heads” is represented by $ H $ and a tails is represented by $ T $. Can you identify the collection of all possible outcomes $ \Omega $?

A: in this case $ \Omega = \{H,T\} $.

Probabilistic experiments with finite outcome spaces

A combination of several rolls of a die or tosses of a coin leads to more possible results, such as tossing a coin twice, with the sample space
\[ \Omega = \{\{H, H \}, \{H, T \}, \{T, H \}, \{T, T \}\}. \]
Generally, the combination of several different experiments yields a sample space with all possible combinations of the single events.
If, for instance, one needs two coins to fall on the same side, then the favored event is a set of two elements: $ \{H, H \} $ and $ \{T, T \} $.
The prob package, developed by G. Jay Kerns, is specifically for probabilistic experiments like the above.
It provides methods for elementary probability calculation on finite sample spaces, including counting tools, defining probability spaces discussed later, performing set algebra, and calculating probabilities.

Probabilistic experiments with finite outcome spaces

In the below, we will run the two fair coin flip experiment and provide a summary of the probability distribution of the events in $ \Omega $:

require(prob)
ev = tosscoin(2)
probspace(ev)

  toss1 toss2 probs
1     H     H  0.25
2     T     H  0.25
3     H     T  0.25
4     T     T  0.25

The argument of tosscoin() refers to how many coin tosses we perform

probspace(tosscoin(3))

  toss1 toss2 toss3 probs
1     H     H     H 0.125
2     T     H     H 0.125
3     H     T     H 0.125
4     T     T     H 0.125
5     H     H     T 0.125
6     T     H     T 0.125
7     H     T     T 0.125
8     T     T     T 0.125

Probabilistic experiments with finite outcome spaces

In the simple probability model, we follow the logic that is used in games of chance like the above:

We will assume that the experiment’s outcome can be represented by an element of $ \omega \in \Omega $, and range of possible outcomes can be represented by combinations of these elements.
We will assume that all possible outcomes of the experiment are equally likely.
We will assume that we can list out all possible outcomes into a finite collection.
Therefore, the probability of attaining some outcome in a collection of outcomes $ A \subset \Omega $ can be written as,

\[ P(A) = P(\omega \in A) = \frac{\text{All possible ways that the experiment's outcome }\omega\in A}{\text{All possible ways for the experiment to have an outcome }\omega\in \Omega} \]

Q: given the above “simple probability model”, what possible values can $ P(A) $ take and what is the meaning of the maximum and the minimum value?

A: if there are no possible ways to attain some outcome in $ A $, the smallest numerator is zero.
If the number of ways for an outcome to occur in $ A $ equals the total number of ways the experiment can end, the $ P(A)=1 $.
If we have $ P(A)=0 $ then we say we are certain that $ A $ is impossible while $ P(A)=1 $ means we are certain it will occur.

Basic probability

$Venn diagram of events $ A $ and $ B $ with nontrivial intersection.$

Courtesy of Bin im Garten CC via Wikimedia Commons

We will recall some basic properties of probability, as they relate to sets.
These rules can be easily understood graphically in terms of the venn diagram.

$ P(\Omega)=1 $ and $ P(\emptyset)=0 $;
the probability of a set union is given as \[ P(A\cup B) = P(A) + P(B) - P(A\cap B); \]
consequentially, if $ A $ and $ B $ are disjoint, \[ P(A\cup B) = P(A) + P(B); \]
The conditional probability of $ A $ given $ B $ intuitively restricts the outcome space $ \Omega $ to $ \Omega \cap B = B $ and therefore, \[ P(A\vert B) = \frac{P(A \cap B)}{P(B)}. \]

Note, the above only makes sense when $ P(B)\neq 0 $ and we should only consider the conditional probability when conditioning on possible events.
Consequently, when $ B $ is possible, we recover the notion, \[ P(A \cap B) = P(A\vert B) P( B). \]

Basic probability

Let’s assume that both $ A $ and $ B $ are possible.

Consider applying the relationship for $ P(A \cap B) $ symmetrically in $ A $ and $ B $: \[ \begin{align} P(A \cap B) &= P(A\vert B) P( B);\\ P(B\cap A) &= PA(B \vert A) P(A); \\ P(A \cap B)& = P(B\cap A); \end{align} \]
applying these together we get Bayes' law: \[ P(A \vert B) = \frac{P(B \vert A) P(A)}{P(B)}. \]

Define $ A^\mathrm{c} $ to be the complement of an event, i.e., \[ A^\mathrm{c} = \{x \in \Omega \vert x \notin A\}, \]
then the probability of unions also shows us that \[ \begin{align} & P\left(A \cup A^\mathrm{c}\right) = P(\Omega) \\ \Leftrightarrow & P(A) + P\left(A^\mathrm{c}\right) - P\left(A \cap A^\mathrm{c}\right) = 1 \\ \Leftrightarrow & P(A) + P\left(A^\mathrm{c}\right) - P(\emptyset) = 1 \\ \Leftrightarrow &P(A) + P\left(A^\mathrm{c}\right) =1 \end{align} \]
Therefore it is equivalent to compute the probability of $ A $ as \[ P(A) = 1 - P\left(A^\mathrm{c}\right). \]

Simple probability experiments in R

The following are common experiments that can be run in the prob package for R:

 urnsamples(x, size, replace = FALSE, ordered = FALSE, ...),
 tosscoin(ncoins, makespace = FALSE),
 rolldie(ndies, nsides = 6, makespace = FALSE),
 cards(jokers = FALSE, makespace = FALSE),
 roulette(european = FALSE, makespace = FALSE).

If the argument makespace is set TRUE, the resulting data frame has an additional column showing the (equal) probability of each single event.
In the simple probablity model, the probability of an event can be computed as the relative frequency of possible outcomes as discussed earlier.
However, a general, finite probability space can be formed using
```
probspace(outcomes, probs)
```
with a vector of possible simple outcomes and a vector of the associated probabilities for each simple outcome.
The full probability space will then be created as all set combinations of the simple outcomes, with their probabilities constructed using the relationships described earlier.

Simple probability experiments in R

For the remaining lecture, we will be focusing on the classical urn sampling problem, and how we can generate different kinds of random sampling.
The basic idea behind an “urn” problem is that we have an urn or a bin in which different objects are stored.
These objects have no order and are assumed to be well-mixed;
- if the proportion of each type of object is even, the events can be described by a set and the simple probability model in terms of any type of object.
We will take two turns drawing one toy vehicle at a time from the urn, without replacement:

ev <- urnsamples(c("bus", "car", "bike", "train"), replace=FALSE, size= 2, ordered=TRUE)
probspace(ev)

      X1    X2      probs
1    bus   car 0.08333333
2    car   bus 0.08333333
3    bus  bike 0.08333333
4   bike   bus 0.08333333
5    bus train 0.08333333
6  train   bus 0.08333333
7    car  bike 0.08333333
8   bike   car 0.08333333
9    car train 0.08333333
10 train   car 0.08333333
11  bike train 0.08333333
12 train  bike 0.08333333

Simple probability experiments in R

In the last example we specified that we do not return an object once it is drawn, that the order in which the objects are drawn matters and that there is only one copy of each object in the urn.
When we sample without replacement, and with order mattering, the number of all possible samples of size $ k $ from a set of $ n $ objects is given by,

\[ \begin{align} \frac{n!}{(n-k)!} & & n! = n (n-1) (n-2) \cdots (2)(1) \end{align} \]

Suppose we want to find the probability that the second object drawn is a bike,
- we can pair “bike” with the other objects as the first draw in three different ways.
In total there are \[ \frac{4!}{(4-2)!} = \frac{4!}{2!} = 3\times 4 = 12 \] total possible outcomes so that

Prob(probspace(ev), X2 == "bike" )

[1] 0.25

Simple probability experiments in R

If on the other hand we do replace objects in the urn before we draw new ones notice the difference in the outcome space

ev <- urnsamples(c("bus", "car", "bike", "train"), replace=TRUE, size= 2, ordered=TRUE)
probspace(ev)

      X1    X2  probs
1    bus   bus 0.0625
2    car   bus 0.0625
3   bike   bus 0.0625
4  train   bus 0.0625
5    bus   car 0.0625
6    car   car 0.0625
7   bike   car 0.0625
8  train   car 0.0625
9    bus  bike 0.0625
10   car  bike 0.0625
11  bike  bike 0.0625
12 train  bike 0.0625
13   bus train 0.0625
14   car train 0.0625
15  bike train 0.0625
16 train train 0.0625

If we replace an object after we draw it, we can calculate the total number of possible outcomes as a choice among $ n $ objects $ k $ times with $ n^k $ total outcomes.

Simple probability experiments in R

In this case,

Prob(probspace(ev), X2 == "bike" )

[1] 0.25

because we now have one more possible way to draw bike on the second draw.

Simple probability experiments in R

Instead of replacing objects, we can also consider the experiment where we only care which two objects are retrieved, but we do not care about the order in which they are drawn.

ev <- urnsamples(c("bus", "car", "bike", "train"), replace=FALSE, size= 2, ordered=FALSE)
probspace(ev)

    X1    X2     probs
1  bus   car 0.1666667
2  bus  bike 0.1666667
3  bus train 0.1666667
4  car  bike 0.1666667
5  car train 0.1666667
6 bike train 0.1666667

In this case, the probability space has only six possible outcomes, given by all un-ordered combinations of the different objects in the urn.
Mathematically, this is described by the “choose” function for $ n $ objects choosing combinations of size $ k $ \[ {n \choose k} = \frac{n!}{k!(n-k)!}. \]
In our example, we have \[ {4 \choose 2} = \frac{4!}{2!(4-2)!} = \frac{4!}{2! 2!} = 3 \times 2 \]

Sampling procedures

The examples above are very specific and restricted to a particular sample space, but we can address sampling from a general perspective with the same principles.
In each case, some random selection mechanism is involved: the theory behind this is called probabilistic sampling.
An important example of randomized sampling is the following:
In 1954, a large-scale experiment was designed to test the effectiveness of the Salk vaccine in preventing polio.
- Treatment group – 200,745 children were given a treatment consisting of Salk vaccine injections.
- Control group – 201,229 children were injected with a placebo that contained no drug.
The 401,974 children in the Salk vaccine experiment were assigned to the Salk vaccine treatment group or the placebo group via a process of random selection equivalent to flipping a coin.
You can encode wheter someone is in the treatment or control group in a binary way:
- Treatment / 1 / H
- Control / 0 / T
Randomly drawing “Treatment” or “Control” with equal probability is equivalent to flipping a fair coin.
The logic behind randomization is to use chance as a way to create two groups that are similar.
With a large enough sample for both treatment and control groups, this can be very effective when it is difficult to balance factors like age, gender, height, weight, etc… across groups that might affect the outcome.

Sampling procedures

In randomized sampling, chance is being utilized to balance the many population factors across the control and treatment groups.
- This makes it so that each sub-sample better reflects the full population.
The most basic way to perform such a randomized sample is like the coin flip or urn example, where all outcomes have equal probability.
Simple random sample: a simple random sample of $ n $ subjects is selected in such a way that every possible sample of the same size $ n $ has the same chance of being chosen.
- Suppose we are taking a poll of UNR students and we will have a sample size of 1000.
- This means, every possible combination 1000 UNR students is must be equally likely to be selected based on our sampling method.
- E.g., we can randomly select students to give interview responses based on their student ID numbers.
- Note: a simple random sample is often called a random sample, but strictly speaking, a random sample has the weaker requirement that all members of the population have the same chance of being selected.

Stratified sampling

The purpose of this randomized sampling is to produce a sample that gives an accurate representation of the population being studied with a limited number of observations.
However, especially with small sample sizes, it is possible that a random sample can be biased and not adequately represent the true population.
- For example, if a small number of interviews are to be conducted with UNR students, randomly selected based on their student number, there may be a disproportionate number of undergraduates versus graduate students interviewed.
One approach to handle this is known as stratified sampling.
- Stratified sampling: we subdivide the population into at least two different subgroups (or strata).
Groups are chosen such that subjects within the same subgroup share the same characteristics (such as undergraduate versus graduate student) and each population member belongs to one and only one subgroup.
We then draw a simple random sample from each subgroup (or stratum) with a total number of observations taken from each subgroup proportional to the group's membership in the population.

Stratified sampling

In the following example, it is assumed that a university student body consists of exactly 1000 students, with $ 80\% $ being undergraduate and $ 20\% $ being graduate.

require(sampling)
set.seed(0)
student_df <- data.frame(graduate_status =  c(rep("U", 800), rep("G", 200)), student_number = sample(1:1000, size=1000))
head(student_df)

  graduate_status student_number
1               U            398
2               U            836
3               U            679
4               U            129
5               U            930
6               U            509

tail(student_df)

     graduate_status student_number
995                G            589
996                G            803
997                G            774
998                G            558
999                G            898
1000               G            188

Stratified sampling

Now, let's randomly select 10 students based on their student number for interviews,

selected_numbers <- sample(1:1000, size=10)
student_df[student_df$student_number %in% selected_numbers,]

    graduate_status student_number
66                U             45
151               U            573
200               U            587
307               U            823
584               U            752
597               U            153
610               U            452
624               U            379
627               U            400
992               G             78

In this particular random outcome, we have only selected a single graduate student and we might not get all the perspectives we want.

Stratified sampling

If we take a larger sample size, the proportion of undergraduate students and graduate students in the sample will become closer to the proportion in the true population in general

selected_numbers <- sample(1:1000, size=100)
selected_students <- student_df[student_df$student_number %in% selected_numbers,]
cat(sum(selected_students$graduate_status=="U"), "total undergraduates selected\n")

75 total undergraduates selected

cat(sum(selected_students$graduate_status=="G"), "total graduates selected\n")

25 total graduates selected

Stratified sampling

However, if we have reason to believe that it is important to keep this proportion of the interviews exact according to the population, we can make a stratified sampling as

require(sampling)
st <- strata(student_df, stratanames="graduate_status", size=c(16,4), method="srswor")
statified_sample = getdata(student_df, m=st)
statified_sample

    student_number graduate_status ID_unit Prob Stratum
65             642               U      65 0.02       1
142             43               U     142 0.02       1
219            853               U     219 0.02       1
305            609               U     305 0.02       1
348            895               U     348 0.02       1
350            652               U     350 0.02       1
442            735               U     442 0.02       1
544            256               U     544 0.02       1
567            805               U     567 0.02       1
590            206               U     590 0.02       1
596            409               U     596 0.02       1
681            507               U     681 0.02       1
727            411               U     727 0.02       1
746            846               U     746 0.02       1
759            395               U     759 0.02       1
775            766               U     775 0.02       1
849            667               G     849 0.02       2
919            708               G     919 0.02       2
975             68               G     975 0.02       2
993            120               G     993 0.02       2

Sampling

A variety of different sampling procedures can be introduced to better balance a sample with respect to how it will reflect a population.
- We will not cover sampling in general, but we mention it here to demonstrate how randomness is often introduced into an experiment, even if there isn't randomness in the experimental process itself.
Because of the notion of randomness in our observations, probability is essential for modeling the results we see in observed data sets.
Particularly, in the next session we will review the notion of a random variable and some fundamental probability distributions.