A review of random variables

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:
    • Basic set theory
    • Probabilistic experiments with finite outcome spaces
    • Basic probability
    • Discrete random variables
    • Continuous random variables
    • Parameters of probability distributions

Basic set theory: a history of sets

  • In the nineteenth century, the Georg Cantor developed the greater part of modern set theory.
  • Set theory is at the basis of mathematical logic and how mathematical objects are ordered in hierarchies of classes with certain properties.
  • At the turn of the nineteenth and twentieth centuries, Ernst Zermelo, Bertrand Russell, Cesare Burali-Forti and others found contradictions in the originally proposed set theory, with one of the famous results being Russell’s paradox.
Description of Russell's paradox.

Courtesy of Doxiadis, Apostolos, and Christos Papadimitriou. Logicomix: An epic search for truth. Bloomsbury Publishing USA, 2015.

A history of sets

  • Moreover, it is decreed that the barber is the one who shaves those that do not shave themselves.
  • The question then is, who shaves the barber if the barber is the one who shaves those that do not shave themselves?
  • In terms of sets, there was an early approach that defined a set as any identifiable collection; however, the question arose:
    • if a set \( A \) is identified as the collection of all collections that are not contained in themselves,
      • does \( A \) belong to itself or not?
  • Contradictions like the above led Ernst Zermelo in 1908 gave an axiomatic system which precisely described the existence of certain sets and the formation of sets from other sets.
  • This Zermelo–Fraenkel set theory is still the most common axiomatic system for set theory.
  • There are 9 axioms, amongst others, that deal with set equality, regularity, pairing sets, infinity, and power sets.
  • This set theory formalism is at the heart of theoretical probability, and thus statistical inference.

Common sets in mathematics

  • In mathematics, the most commonly considered sets are the following:
    • \( \mathbb{N} \): the set of natural numbers, i.e. \( \{1, 2, 3, 4 \cdots \} \).
    • \( \mathbb{Z} \): the set of integer numbers, i.e. \( \{\cdots − 3, −2, −1, 0, 1, 2, 3 \cdots \} \).
    • \( \mathbb{Q} \): the set of rational numbers, i.e., all numbers \( q \) that can be represented as ratios of two numbers \( q=\frac{z_1}{z_2} \) for \( z_1,z_2\in \mathbb{Z} \).
    • \( \mathbb{R} \): the set of real numbers, i.e., all rationals and irrational numbers like \( \sqrt{2} \).
    • \( \mathbb{C} \): the set of complex numbers, i.e., all real numbers and all extensions of the real numbers to the complex plane.
  • For each set there is a cardinal number which stands for the magnitude or number of elements in the set, even for infinite sets.
    • The cardinal number of \( \mathbb{N} \) is \( \aleph_0 \) (“Aleph null”), which represents a “smaller” infininity than the infinity of \( \mathbb{R} \), \[ {\mathfrak {c}}=2^{\aleph _{0}} > \aleph_{0}. \]
  • The sets above are named by a fixed character or letter, whereas other sets in the literature are labeled arbitrarily with a Latin or Greek letter.

Probabilistic experiments with finite outcome spaces

  • When working with data that is subject to random variation, the theory behind this probabilistic situation becomes important.
  • There are two types of experiments:
    1. deterministic, in which all elements of the experiment are controlled precisely, and
    2. random, usually in the form of how the data is sampled or by the way error is introduced by uncertainty in measurements and experimental parameters.
  • In the following we will review the main ideas about non-deterministic processes with a finite number of possible outcomes.
    • This will bring us to the idea of a discrete random variable.
    Random trials and event spaces
    A random trial (or experiment) is a procedure that yields one of the distinct outcomes that altogether form the sample or event space, denoted \( \Omega \). All possible outcomes constitute the universal event \( \Omega \), and subsets of \( \Omega \) define all events in the sample space.
  • For an experiment with finite possible outcomes like a single coin flip, we can actually write out \( \Omega \) as a finite collection or set.
  • Question: suppose our experiment is to make a single, fair coin flip where a “heads” is represented by \( H \) and a tails is represented by \( T \). Can you identify the collection of all possible outcomes \( \Omega \)?
    • Answer: in this case \( \Omega = \{H,T\} \).
  • A combination of multiple coin tosses leads to more possible results – with two coin flips, we define \[ \Omega = \{\{H, H \}, \{H, T \}, \{T, H \}, \{T, T \}\}. \]

    • Generally, the combination of several different experiments yields a sample space with all possible combinations of the single events.

Probabilistic experiments with finite outcome spaces

  • In the simple probability model, we follow the logic that is used in games of chance like the above:
    • Assume that the experiment’s outcome can be represented by an element of \( \omega \in \Omega \).
    • Assume that all possible outcomes of the experiment are equally likely.
    • Assume that all possible outcomes in the sample space can be listed as a finite collection.
    • The simple probability model
      The probability of attaining some outcome in a collection of outcomes \( A \subset \Omega \) can be written as,

      \[ \mathcal{P}(A) = \mathcal{P}(\omega \in A) = \frac{\text{All possible ways that the experiment's outcome }\omega\in A}{\text{All possible ways for the experiment to have an outcome }\omega\in \Omega} \]
  • Question: given the above “simple probability model”, what possible values can \( \mathcal{P}(A) \) take and what is the meaning of the maximum and the minimum value?
    • Answer: if there are no possible ways to attain some outcome in \( A \), the smallest numerator is zero.
    • If the number of ways for an outcome to occur in \( A \) equals the total number of ways the experiment can end, the \( \mathcal{P}(A)=1 \).
    • If \( \mathcal{P}(A)=0 \) then we say we are certain that \( A \) is impossible;
    • respectively, \( \mathcal{P}(A)=1 \) means we are certain \( A \) will occur.

Basic probability

Venn diagram of events \( A \) and \( B \) with nontrivial intersection.

Courtesy of Bin im Garten CC via Wikimedia Commons

  • We will recall some basic properties of probability:
    • Probabilities for the full sample space and empty set are defined \( \mathcal{P}(\Omega)=1 \) and \( \mathcal{P}(\emptyset)=0 \).
    • The probability of a set union is given as \[ \mathcal{P}(A\cup B) = \mathcal{P}(A) + \mathcal{P}(B) - \mathcal{P}(A\cap B). \]
    • Consequentially, if \( A \) and \( B \) are disjoint, \[ \mathcal{P}(A\cup B) = \mathcal{P}(A) + \mathcal{P}(B). \]
    • The conditional probability of \( A \) given \( B \) describes the probability of \( A \) when it is assumed that \( B \) occurs.
      • Intuitively restricts the sample space \( \Omega \) to \( \Omega \cap B = B \), therefore, \[ \mathcal{P}(A\vert B) = \frac{\mathcal{P}(A \cap B)}{\mathcal{P}(B)}. \]
  • Note, the above only makes sense when \( \mathcal{P}(B)\neq 0 \) and we should only consider the conditional probability when conditioning on possible events.
  • Consequently, when \( B \) is possible, we recover the notion, \[ \mathcal{P}(A \cap B) = \mathcal{P}(A\vert B) \mathcal{P}( B). \]
  • Both forms of conditional probability above are equivalent, forming different axiomatic approaches to defining the concept.

Basic probability

  • Closely related notions to the conditional probability are independence and dependence of events.
    • Dependence – two events are said to be dependent if the outcome of one event directly affects the probability of the other.
    • Independence – two events are said to be independent if the outcome of either event has no impact on the probability of the other.
      • Mathematically, we can see the meaning of independence clearly as,
      • Independence
        \( A \) and \( B \) are independent by definition if and only if both of the following hold, \[ \begin{matrix} \mathcal{P}(A\vert B) = \mathcal{P}(A) & \text{and} & \mathcal{P}(B\vert A) = \mathcal{P}(B). \end{matrix} \]
    • Using the definition of conditional probability, this states equivalently that \[ \begin{align} &\mathcal{P}(A\vert B) = \frac{\mathcal{P}(A\cap B)}{\mathcal{P}(B)} & & \mathcal{P}(B\vert A) = \frac{\mathcal{P}(B \cap A)}{\mathcal{P}(A)} \\ \Leftrightarrow & \mathcal{P}(A\cap B) = \mathcal{P}(A) \mathcal{P}(B). \end{align} \]
    • Although the two ideas are equivalent, the first version is a more intuitive statement, while the second statement is more useful in computation.
    • Again, both form slightly different axiomatic approaches to defining probability.

Basic probability

  • Assume that both \( A,B\subset \Omega \) are possible.
    • Let us apply the relationship for \( \mathcal{P}(A \cap B) \) symmetrically in \( A \) and \( B \): \[ \begin{align} \mathcal{P}(A \cap B)= \mathcal{P}(A\vert B) \mathcal{P}( B); & & \mathcal{P}(B\cap A) = \mathcal{P}(B \vert A) \mathcal{P}(A). \end{align} \]
    • Recall that \( \mathcal{P}(A \cap B) = \mathcal{P}(B\cap A) \), then applying these together we get Bayes' law.
    • Bayes' law
      Let \( A,B\subset \Omega \) such that \( \mathcal{P}(A), \mathcal{P}(B) > 0 \). Then the probability of \( A\vert B \) can be “inverted” via Bayes' law as \[ \mathcal{P}(A \vert B) = \frac{\mathcal{P}(B \vert A) \mathcal{P}(A)}{\mathcal{P}(B)}. \]
    • Bayes' law is a very simple statement of conditional probability, but with profound consequences on the computation of probabilities.
    • Bayesian analysis was not widely used until advances in digital computers, due to the difficulty in deriving analytical statistics with this framework.
    • However, using Monte-Carlo or ensemble methods, Bayes provides a numerically efficient framework to produce a recursive conditional update of our beliefs in time.
    • If \( B \) represents incoming information about a time-evolving system, \( \mathcal{P}(A\vert B) \) represents our updated belief about an event \( A \) given the new information \( B \).
    • We will return to this theme when returning to conditional probabilities later.

Basic probability

  • Define \( A^\mathrm{c} \) to be the complement of an event, i.e., \[ A^\mathrm{c} = \{\omega \in \Omega \vert \omega \notin A\}, \]
  • then the probability of unions also shows us that \[ \begin{align} & \mathcal{P}\left(A \cup A^\mathrm{c}\right) = \mathcal{P}(\Omega) \\ \Leftrightarrow & \mathcal{P}(A) + \mathcal{P}\left(A^\mathrm{c}\right) - \mathcal{P}\left(A \cap A^\mathrm{c}\right) = 1 \\ \Leftrightarrow & \mathcal{P}(A) + \mathcal{P}\left(A^\mathrm{c}\right) - \mathcal{P}(\emptyset) = 1 \\ \Leftrightarrow &\mathcal{P}(A) + \mathcal{P}\left(A^\mathrm{c}\right) =1 \end{align} \]
  • Complementary probability
    Let \( A, A^c\subset \Omega \) be defined as above, then their probabilities are complementary \[ \mathcal{P}(A) = 1 - \mathcal{P}\left(A^\mathrm{c}\right). \]

Discrete random variables

  • The outcomes of a probabilistic experiment can be modeled by a random variable \( X \).
Random variables
A random variable (RV) \( X \) is a map that assigns a real-world event \( \omega\in\Omega \) to a real value in \( \mathbb{R} \).
  • Prototypically, \( X \) may be considered an observable of the trial, though \( X \) in general may not be observable.

  • Such a model is needed, where there may not be any intrinsic numerical meaning to the events in the sample space \( \Omega \).

  • For example, \( X \) can be the number of heads in the outcome of two coin flips

    • \( \Omega \) will be the set of possible outcomes

    \[ \Omega = \{ \{H,H\}, \{H,T\}, \{T, H\}, \{T,T\}\} \]

    • The event \( \{H,H\} \) has no intrinsic numerical meaning;
    • \( X \) will take the event \( \{H,H\} \) and map this value to \( 2 \in \mathbb{R} \).
  • With a finite number \( k \) of possible outcomes we can create a set of possible values \( x_i \) for \( X \), as \( i = \{1, \cdots , k\} \).

  • More generally, we might consider when a countably-infinite index set can enumerate all possible outcomes.

Discrete random variables
A RV \( X \) is discrete if the values that \( X \) can attain, \( x_i \), can be made into a collection \( \{x_i\}_{i=1}^k \) for a finite \( k \) or for an infinite collection \( \{x_i\}_{i=1}^\infty \).

Discrete random variables

  • A probability distribution is a complete description of the outcomes and associated probabilities of a random trial.

  • The distribution of a discrete RV is typically described by its probability mass function \( p(x_i) \) and the cumulative distribution function \( P(x) \).

Probability mass function
The probability mass function \( p \) of a discrete RV \( X \) is a function that returns the probability that \( X \) is exactly equal to some value. For some indexed possible outcome \( x_j \), \[ \begin{align} p(x_i) = \mathcal{P}(X=x_i). \end{align} \]
  • The cumulative distribution function \( P \) (cdf) is defined for with respect to the ordering \( x_i < x_{i+1} \).
Cumulative distribution function
Let \( X \) be a discrete RV, then the mapping \[ P:\mathbb{R} \rightarrow [0,1] \] defined by \[ \begin{align} P(x) :&= \mathcal{P}(X \leq x) \\ &= \sum_{i \in \mathcal{I}, x_i < x} p(x_i) \end{align} \] is called the cumulative distribution function (cdf) of the RV \( X \).

Discrete random variables

  • Using the probability mass function, we can define the expected value of a discrete RVs.
Expected value of a discrete RV
Let \( X \) be a discrete RV with a range of possible values \( \{x_i\}_{i \in \mathcal{I}} \) where \( \mathcal{I}\subset\mathbb{Z} \). The expected value is defined \[ \begin{align} \overline{x} := \mathbb{E}\left[ X\right] &:= \sum_{i\in \mathcal{I}} x_i \mathcal{P}(X=x_i)\\ &= \sum_{i\in \mathcal{I}} x_i p(x_i) \end{align} \]
  • Using the property that events associated to distinct observeable values (\( X=x_i \)) are disjoint,

    \[ \sum_{i\in \mathcal{I}} \mathcal{P}(X = x_i) =1 \] as seen by taking the probability of all events as a union.

  • \( \mathbb{E} \) in the above represents a weighted average of all possible values that \( X \) might attain.

    • The classical interpretation of the expected value in this way is as the center of mass for the probability of attainable values of \( X \).
  • Suppose we were able to take infinite replications of a random trial and record the observed value of \( X \) over all infinite trials;

    • intuitively, the average of the observed values for \( X \) over all the infinite replications would be given by the expected value \( \mathbb{E}\left[X\right] \).

Discrete random variables

  • The expected value describes how the probability mass for the observable of the trial is centered.

  • However, a critical notion is “how much spread or dispersion will there be in the data around this center of mass?”

Variance of a discrete RV
Let \( X \) be a discrete RV with distinct values \( \{x_i \}_{i\in \mathcal{I}} \) and probability mass function \( p(x_i)=\mathcal{P}(X = x_i ) \in [0, 1] \) for \( i \in \mathcal{I} \). Then the variance of \( X \) is defined to be, \[ \begin{align} \sigma^2 := \mathrm{var}(X) &:= \mathbb{E}\left[\left\{X - \mathbb{E}\left(X\right)\right\}^2\right]\\ &=\sum_{i\in \mathcal{I}} \left\{x_i - \overline{x}\right\}^2 p(x_i) \end{align} \]
  • The two forms above can be shown equivalent by defining

    \[ Y := X - \overline{x}, \] and following the definition of \( \mathbb{E}\left[Y\right] \).

  • Notice that the variance describes a weighted-average, squared-deviation of each possible observable from the center of mass.

  • This means that the variance gives a non-negative measure of the average dispersion of the possible-to-observe outcomes around the center of probability mass.

    • Particularly, the variance is zero if and only if there is only one possible outcome;
    • the variance increases with possible-to-observe values further from the center.

Continuous random variables

  • Unlike discrete RVs, continuous RVs can take on an uncountably infinite number of possible values.

    • This is to say that if \( X \) is a continuous RV, there is no possible index set \( \mathcal{I}\subset \mathbb{Z} \) which can enumerate the possible values \( X \) can attain.
    • For discrete RVs, we could perform this with a possibly infinite index set, \( \{x_i\}_{i=1}^\infty \)
    • This has to do with how the infinity of the continuum \( \mathbb{R} \) is actually larger than the infinity of the counting numbers, \( \aleph_0 \);
    • in the continuum you can arbitrarily sub-divide the units of measurement.
  • A good example to think of is if \( X \) is the daily high temperature in degrees Celsius.

    • If we had a sufficiently accurate thermometer, we could measure temperature to an arbitrary decimal precision and it would make sense.
    • \( X \) thus takes the weather from the outcome space and gives us a number in a continuous unit of measurement.
  • This type of measurement differs fundamentally from counting, e.g., heads of coin flips.

Continuous random variables

  • These RVs are characterized by a cumulative distribution function and a probability density function.
Cumulative distribution function
Let \( X \) be a continuous RV, then the mapping \[ P:\mathbb{R} \rightarrow [0,1] \] defined by \( P(x) = \mathcal{P}(X \leq x) \), is called the cumulative distribution function (cdf) of the RV \( X \).
  • While the CDF is basically equivalent between continuous and discrete RVs, the properties of the contiuum imply differences between the probability density and the probability mass functions.
Probability density function
A mapping \( p: \mathbb{R} \rightarrow \mathbb{R}^+ \) is called the probability density function (PDF) of an RV \( X \) if:
  • \( p(x) = \frac{\mathrm{d} P}{\mathrm{d}x} \) exists for all \( x\in \mathbb{R} \); and
  • the density is integrable, i.e., \[ \int_{-\infty}^\infty p (x) \mathrm{d}x = 1. \]

Continuous random variables

  • Question: How can you use the definition of the PDF and the fundamental theorem of calculus \[ \begin{align} p(x) = \frac{\mathrm{d} P}{\mathrm{d}x} & & \text{ and }& & \int_{a}^b \frac{\mathrm{d}f}{\mathrm{d}x} \mathrm{d}x = f(b) - f(a) \end{align} \] to find another form for the CDF?

    • Answer: Notice that \( p= \frac{\mathrm{d} P}{\mathrm{d}x} \) means that the CDF can be written in terms of the anti-derivative of the density.
    • If \( s \) and \( t \) are arbitrary values, the definite integral is written as

    \[ \begin{align} \int_{s}^t p(x) \mathrm{d}x &= \int_{s}^t \frac{\mathrm{d} P}{\mathrm{d}x} \mathrm{d}x\\ &= P(t) - P(s) \\ & = \mathcal{P}(X \leq t) - \mathcal{P}(X \leq s) = \mathcal{P}(s < X \leq t) \end{align} \]

    • If we take a limit as \( s \rightarrow \infty \) we thus recover that

    \[ \begin{align} \lim_{s\rightarrow - \infty} \int_{s}^t p(x) \mathrm{d} x & = \lim_{s \rightarrow -\infty} \mathcal{P}(s < X \leq t) \\ & = \mathcal{P}(X\leq t) = P(t) \end{align} \]

  • This is analgous to the property for discrete RVs, where we write,

    \[ \begin{align} P(t) = \mathcal{P}(X \leq t) = \sum_{j \in \mathcal{I},\, x_j \leq t} \mathcal{P}(X = x_j) = \sum_{j \in \mathcal{I},\, x_j \leq t} p(x_j) \end{align} \]

Continuous random variables

  • The notion of the probability density function directly extends the idea of the probability mass function for discrete RVs.
  • However, a key difference is that \[ \begin{align} \int_{a}^a p(x) \mathrm{d}x = \mathcal{P}(X=a)=0 \end{align} \] for any value \( a \).
    • I.e., the probability of any single point measurement is always zero.
    • Particularly, we can only compute non-zero probability in ranges for continuous RVs.
  • Based on this last result, it might appear that our model of a continuous RV is useless.
  • In practice, when a particular measurement such as \( 14.47 \) degrees Celsius is observed,
    • this result can be interpreted as the rounded value of a measurement at precision with a true value in a range such \[ 14.465\leq x \leq 14.475 \] at the limit of our precision.
  • The probability that the rounded value \( 14.47 \) is observed as the value for \( X \) is the probability that \( X \) assumes a value in the interval \[ [14.465, 14.475], \] which is not zero.
  • Because each point has zero probability, we need not distinguish between inequalities such as \( < \) or \( \leq \) for continuous RVs.
  • Probability ranges for continuous RVs
    For a continuous RV \( X \), for any \( x_1< x_2 \) \[ \mathcal{P}(x_1 \leq X \leq x_2 ) = \mathcal{P}(x_1 < X \leq x_2) = \mathcal{P}(x_1 \leq X < x_2) = \mathcal{P}(x_1 < X < x_2). \]

Continuous random variables

  • Elementary properties of the probability distribution of a discrete RV can be described by an expectation and a variance.
  • The substantial difference with continuous RVs is in the use of integrals, rather than sums, over the possible values of the RV.
  • Expected value of a continuous RV
    Let \( X \) be a continuous RV with a density function \( p(x) \), then the expected value of \( X \) is defined as \[ \overline{x} := \mathbb{E}\left[X\right] := \int_{-\infty}^{+\infty} x p(x)\mathrm{d}x. \]
  • In the above, we have the same interpretation of \( \overline{x} \) as giving the center of mass for the probability density.
  • Variance of a continuous RV
    If the expectation of \( X \) exists, the variance is defined as \[ \begin{align} \sigma^2:= \mathrm{var} \left(X\right)& := \mathbb{E}\left[\left(X − \mathbb{E}\left[X\right]\right)^2\right] \\ &=\int_{-\infty}^\infty \left(x - \overline{x} \right)^2 p(x)\mathrm{d}x \end{align} \]
  • Once again, this is a measure of dispersion by averaging the deviation of each case from the mean in the square sense, weighted by the probability density.

  • This quantity is also known as the second centered moment, as part of a general family of moments for the distribution.

Moments of continuous random variables

  • A general form for higher-order moments is given as follows:
General moments formula
Let \( X \) be a random variable with PDF \( p \), then the \( n \)-th moment is defined as \[ \begin{align} \mathbb{E}\left[X^n\right]=\int_{-\infty}^\infty x^n p(x) \mathrm{d}x \end{align} \]
  • In particular, we call the second moment \( \mathbb{E}\left[X^2\right] \) the mean-square.

  • In this way, like the first moment as the center of mass, the second centered moment (the variance) is analogous to the rotational inertia at the center of mass.

  • It can be shown that the formula for variance can be reduced as follows:

    \[ \begin{align} \mathrm{var}(X) = \mathbb{E}\left[X^2\right] - \mathbb{E}\left[X\right]^2, \end{align} \] i.e., as the difference between the mean-square and the mean squared.

  • Although this is a useful formula for simple calculations, this formula can produce very poor performance in terms of numerical stability in floating point arithmetic.

  • More generally with multivariate distributions, we will consider a numerically stable approach to compute the second centered moment.

Continuous random variables

  • While the variance is a more “fundamental” theoretical quantity, in practice we are usually concerned with the standard deviation of the RV \( X \).
  • This is due to the fact that the variance \( \sigma^2 \) has the units of \( X^2 \) by the definition of the product.
  • Standard deviation of a RV
    Let \( X \) be a RV (continuous or discrete) with variance \( \sigma^2 \). We define the standard deviation of \( X \) as \[ \mathrm{std}(X):=\sqrt{\mathrm{var}\left(X\right)} = \sigma. \]
    • For example, if the units of \( X \) are \( \mathrm{cm} \), then \( \sigma^2 \) will be in \( \mathrm{cm}^2 \).

  • Taking a square root on the variance gives us the standard deviation \( \sigma \) in the units of \( X \) itself.
  • The wide applicability of the standard deviation as a measure of spead can be understood by Chebyshev's theorem.

Chebyshev's Theorem

    Chebyshev’s Theorem
    Let \( X \) (integrable) be a RV with finite expected value \( \overline{x} \) and finite non-zero variance \( \sigma^2 \). Then for any real number \( k > 0 \), \[ \begin{align} \mathcal{P}(\vert X -\overline{x}\vert > k \sigma ) \leq \frac{1}{k^2} \end{align} \]
  • Question: Suppose \( k=2 \), what does this statement tell us?
    • Answer: For \( k=2 \), we say there is a probability of at least \[ 1 - \frac{1}{2^2} = 1 - \frac{1}{4} = \frac{3}{4} \] that \( X \) takes a realization within \( k=2 \) standard deviations of the mean.
  • In terms of a collection of measurements, this can be interpreted that (on average over infinitely many replicated trials)
    • at least \( \frac{3}{4} \) of the measurements will lie within two standard deviations of the center of mass.
  • Note that Chebyshev’s theorem is only a lower bound on at least how much data lies within standard deviations.
  • This is a weaker statement than might be true for a specific random variable, but this holds for any random variable.

Quantiles / percentiles

  • While together the mean \( \overline{x} \) and the standard deviation \( \sigma \) give a picture of the center and dispersion of a probability distribution, we can analyze this in a different way.

  • For example, while the mean is the notion of the “center of mass”, we may also be interested in where the upper and lower \( 50\% \) of values are separated as a different notion of “center”.

    • The value that separates this upper and lower half does not need to equal the center of mass in general, and it is known commonly as the median.
  • More generally, for any univariate cumulative distribution function \( P \), and for \( 0 < q < 1 \), we can identify \( q \) as a percent of the data that lies under the graph of a density curve.

    • We might be interested in where the lower \( q \) area is separated from the upper \( 1-q \) area.
Percentile of \( P \)
Let \( X \) be a random variable with CFD \( P \). The quantity \[ \begin{align} P^{-1}(q)=\inf \left\{x \vert P(x) \geq q \right\} \end{align} \] is called the theoretical \( q \)-th quantile or percentile of \( P \).
  • The “\( \inf \)” in the above refers to the smallest possible quantity (infimum) in the set on the right-hand-side.

  • We will usually refer to the \( q \)-th quantile as \( x_q \).

  • \( P^{-1} \) is called the quantile function.

    • Particularly, \( x_{\frac{1}{2}} \) is known as the theoretical median of a distribution.

Median and mode

  • The median provides a different notion of the “center” as the middle of the data.
  • Particularly, we have \[ \begin{align} P(x_q) := \mathcal{P}(X \leq x_q) = q \end{align} \] by the definition.
  • For \( q=\frac{1}{2} \), the median thus describes the point which separates upper 50% from the lower 50% of possible-to-observer values.
  • Another notion of the most “central point” in the data can be the value that is measured most frequently.
  • The mode of \( p \)
    Let \( X \) be a RV with PDF \( p \). We say \( x^\ast \) is a (local) mode of \( p \) provided that there is a neighborhood \( x^\ast \in \mathcal{N} \) such that for all \( x\in \mathcal{N} \) not equal to \( x^\ast \), \[ \begin{align} p\left(x^\ast \right) > p(x). \end{align} \]
  • The above refers to a mode as a local maximum for the density function, i.e., a point at which the density is greatest in a neighborhood.
  • This corresponds to a peak of the density curve.

Differences in mean, median and mode

Differences between mean, median and mode with non-symmetric distributions.

Courtesy of Diva Jain CC via Wikimedia Commons.

  • Usually, the mean, median and mode tell us different characteristics of what we call the “center” of the data.
  • In the special case when data is symmetric and unimodal, these coincide.
  • In the left, we see data that is all uni-modal, but with three different cases.
  • In the left case, we have right skewness:
    • Here, the mean and median are discplaced to the right away from the mode.
    • Additionally, the mean and median do not match.
  • In the right case, we have left skewness:
    • In this case, the mean and the median are skewed to the left away from the mode.
  • Note: the precise location of the mean and median do not need to hold this way for all skew distributions – this is only one example of how this can look .