2  Sampling distributions

A statistic is a quantity that can be calculated from sample data. Before observing data, a statistic is an unknown quantity and is, therefore, a rv.

Definition 2.1 (Statistic) Let X_1, \dots, X_n be observable rvs and let g be an arbitrary real-valued function of n random variables. The rv T = g(X_1, \dots, X_n) is a statistic.

We refer to the probability distribution for a statistic as a sampling distribution. The sampling distribution illustrates how the statistic will vary across possible sample data. The sampling distribution contains information about the values a statistic is likely to assume and how likely it is to assume those values prior to observing data.

Definition 2.2 (Sampling distribution) Suppose rvs X_1, \dots, X_n are a random sample from F(\theta), a distribution depending on a parameter \theta whose value is unknown. Let the rv T = g(X_1, \dots, X_n, \theta) be a function of X_1, \dots, X_n and (possibly) \theta. The distribution of T (given \theta) is the sampling distribution of T.

The sampling distribution of T is derived from the distribution of the random sample. Often we will be interested in a statistic T that is an estimator for a parameter \theta (that is, T will not depend on \theta).

In what follows, we review several special families of distributions that are widely used in probability and statistics. These special families of distributions will be indexed by one or more parameters and include discrete distributions (Bernoulli, Binomial, Poisson, and discrete Uniform) as well as continuous distributions (continuous Uniform, Normal, Student’s \mathsf{t}, \chi^2, and \mathsf{F}).

2.1 Bernoulli distribution

The Bernoulli distribution describes a single trial with two possible outcomes, often coded as 1 (success) and 0 (failure). It is the basic building block behind proportions and binomial counts.

Definition 2.3 (Bernoulli distribution) A discrete rv X has a Bernoulli distribution with parameter p \in (0,1) if P(X=x; p) = \begin{cases} p, & x=1,\\ 1-p, & x=0,\\ 0, & \text{otherwise}. \end{cases} Equivalently, P(X=x; p) = p^x(1-p)^{1-x}, \quad x\in\{0,1\}. \tag{2.1} We write X \sim \mathsf{Bernoulli}(p).

WarningParameter

The parameter p is the probability of success, i.e. p = P(X=1).

If X \sim \mathsf{Bernoulli}(p), then it can be shown that \mathbf{E}[X] = p \qquad \text{and} \qquad \mathop{\mathrm{Var}}(X) = p(1-p). \tag{2.2}

The pmf for two Bernoulli rvs with different values of p is shown in Figure 2.1.

Figure 2.1: The pmf of a Bernoulli rv for two values of the success probability p.

Example 2.1 Suppose that a component passes a quality check with probability p = 0.95. Let X=1 if the component passes and X=0 otherwise. Then X \sim \mathsf{Bernoulli}(0.95), and the probability that the component fails is P(X=0) = 1 - p = 0.05.

ImportantIndicator variables

A Bernoulli rv is often used as an indicator: it takes value 1 when a property holds, and 0 when it does not. This is exactly the setup used for estimating proportions in Section 4.2.1.

2.2 Binomial distribution

The binomial distribution describes the number of successes in k independent Bernoulli trials, each having the same success probability p.

Definition 2.4 (Binomial distribution) Let X denote the number of successes in k independent trials, where each trial is a \mathsf{Bernoulli}(p) rv. Then X has a binomial distribution with parameters k \in \mathbf{N}_{>} and p \in (0,1) if P(X=x; k,p) = \binom{k}{x} p^x (1-p)^{k-x}, \quad x = 0,1,\dots,k. \tag{2.3} We write X \sim \mathsf{Binomial}(k,p).

WarningParameters

The binomial distribution has two parameters: the number of trials k and the success probability p.

If X \sim \mathsf{Binomial}(k,p), then \mathbf{E}[X] = kp \qquad \text{and} \qquad \mathop{\mathrm{Var}}(X) = kp(1-p). \tag{2.4}

The pmf for \mathsf{Binomial}(10,p) for several values of p is shown in Figure 2.2.

Figure 2.2: The pmf of X \sim \mathsf{Binomial}(10,p) for several values of p.

Example 2.2 A student answers k=10 multiple-choice questions by guessing. Each question has probability p=0.25 of being correct. If X is the number of correct answers, then X \sim \mathsf{Binomial}(10, 0.25). The probability of getting exactly 3 correct answers is P(X=3) = \binom{10}{3}(0.25)^3(0.75)^7. This can be computed in R as dbinom(3, size = 10, prob = 0.25).

TipBinomial counts are sums of Bernoulli trials

If X_1,\dots,X_k \sim \mathsf{Bernoulli}(p) are independent, then X = \sum_{i=1}^k X_i \sim \mathsf{Binomial}(k,p). This is why binomial distributions appear naturally when counting the number of times an event occurs in repeated trials.

2.3 Poisson distribution

The Poisson distribution is widely used to model the number of events in a fixed unit of exposure (often time or space), such as the number of arrivals at a desk per hour or the number of faults detected per kilometre.

Definition 2.5 (Poisson distribution) A discrete rv X has a Poisson distribution with parameter \lambda > 0 if P(X=x; \lambda) = \frac{e^{-\lambda}\lambda^x}{x!}, \quad x = 0,1,2,\dots \tag{2.5} We write X \sim \mathsf{Poisson}(\lambda).

WarningParameter

The parameter \lambda is the mean number of events in the unit of exposure (e.g. “per hour”).

If X \sim \mathsf{Poisson}(\lambda), then \mathbf{E}[X] = \lambda \qquad \text{and} \qquad \mathop{\mathrm{Var}}(X) = \lambda. \tag{2.6}

That is, the Poisson distribution has the distinctive property that the mean equals the variance. The pmf for several values of \lambda is shown in Figure 2.3.

Figure 2.3: The pmf of a Poisson rv for several values of the rate parameter \lambda.

Example 2.3 Suppose the number of emails you receive in an hour is modelled as X \sim \mathsf{Poisson}(\lambda) with \lambda = 2.5. The probability of receiving no emails in the next hour is P(X=0) = e^{-2.5}\frac{2.5^0}{0!} = e^{-2.5}. In R, this is dpois(0, lambda = 2.5).

NoteWhen does Poisson modelling make sense?

Poisson models are commonly used when (i) events occur one at a time, (ii) they occur “at random” throughout the exposure period, and (iii) the average event rate is roughly constant over the period being observed.

2.4 Uniform Distribution

The uniform distribution places equal weight on the items being sampled. The items can be discrete or be a continuum.

2.4.1 Discrete uniform distribution

The discrete uniform places equal probability on a finite set of possible values.

Definition 2.6 (Discrete uniform distribution) A discrete rv X has a discrete uniform distribution on the integers \{a,a+1,\dots,b\} with a<b, if P(X=x; a,b) = \begin{cases} \frac{1}{b-a+1}, & x \in \{a,a+1,\dots,b\},\\[4pt] 0, & \text{otherwise}. \end{cases} \tag{2.7} We write X \sim \mathsf{Unif}\{a,\dots,b\}.

WarningParameters

The parameters a and b determine the finite support \{a,a+1,\dots,b\}.

If X \sim \mathsf{Unif}\{a,\dots,b\}, then \mathbf{E}[X] = \frac{a+b}{2} \qquad \text{and} \qquad \mathop{\mathrm{Var}}(X) = \frac{(b-a+1)^2 - 1}{12}. \tag{2.8}

A familiar example is the outcome of a fair six-sided die: X \sim \mathsf{Unif}\{1,\dots,6\}. The pmf is shown in Figure 2.4.

Figure 2.4: The pmf of a discrete uniform rv on \{1,2,3,4,5,6\} (a fair die).

Example 2.4 Let X be the outcome of a fair die roll. Then X \sim \mathsf{Unif}\{1,\dots,6\}. The probability of rolling a value at least 5 is P(X \geq 5) = P(X=5) + P(X=6) = \frac{2}{6} = \frac{1}{3}.

2.4.2 Continuous uniform distribution

Compared with the discrete version, the continuous uniform distribution places equal weight across an interval.

Definition 2.7 ((Continuous) Uniform distribution) A continuous rv X has a uniform distribution on [a,b] with a<b, if X has pdf f(x; a,b) = \frac{1}{b-a}\,, \quad a < x < b\,, or zero otherwise. We write X \sim \mathsf{Unif}(a,b).

WarningParameters

Note that a and b are parameters in Definition 2.7.

Exercise 2.1 As an exercise, derive the cdf using the definition. Derive a formula for the mean and variance in terms of the parameters a and b.

2.5 Normal distribution

Normal distributions play an important role in probability and statistics as they describe many natural phenomena. For instance, the Central Limit Theorem tells us that the sample mean of a large random sample (size m) of rvs with mean \mu and variance \sigma^2 is approximately normal in distribution with mean \mu and variance \sigma^2/m.

Definition 2.8 (Normal or Gaussian distribution) A continuous rv X has a normal distribution with parameters \mu and \sigma^2, where -\infty < \mu < \infty and \sigma > 0, if X has pdf f(x; \mu, \sigma) = \frac{1}{\sqrt{2 \pi} \sigma}e^{-(x-\mu)^2/(2\sigma^2)}\,, \quad -\infty < x < \infty \,. We write X \sim \mathsf{N}(\mu, \sigma^2).

For X\sim \mathsf{N}(\mu,\sigma^2), it can be shown that \mathbf{E}(X) = \mu and \mathop{\mathrm{Var}}(X) = \sigma^2, that is, \mu is the mean and \sigma^2 is the variance of X. The pdf forms a bell-shaped curve that is symmetric about \mu, as illustrated in Figure 2.5. The value \sigma (standard deviation) is the distance from \mu to the inflection points of the curve. As \sigma increases, the dispersion in the density increases, as illustrated in Figure 2.6. Thus, the distribution’s position (location) and spread depend on \mu and \sigma.

Figure 2.5: The pdfs of two normal rvs, X_1 \sim \mathsf{N}(-2, 1) and X_2 \sim \mathsf{N}(2, 1), with different means and the same standard deviations.
Figure 2.6: The pdfs of two normal rvs, X_1 \sim \mathsf{N}(0, 9) and X_2 \sim \mathsf{N}(0, 1), with the same means and different standard deviations.

Definition 2.9 (Standard normal distribution) We say that X has a standard normal distribution if \mu=0 and \sigma = 1 and we will usually denote standard normal rvs by Z \sim \mathsf{N}(0,1) (why Z? tradition!1). We denote the cdf of the standard normal by \Phi(z) = P(Z \leq z) and write \varphi = \Phi' for its density function.

ImportantUseful facts about normal variates
  1. If X \sim \mathsf{N}(\mu, \sigma^2), then Z = (X - \mu) / \sigma \sim \mathsf{N}(0,1).
  2. If Z \sim \mathsf{N}(0, 1), then X = \mu + \sigma Z \sim \mathsf{N}(\mu, \sigma^2).
  3. If X_i \sim \mathsf{N}(\mu_i, \sigma_i^2) for i = 1, \dots, n are independent rvs, then \sum_{i=1}^{n} X_i \sim \mathsf{N} \left( \sum_{i=1}^{n} \mu_i, \sum_{i=1}^{n} \sigma_i^2 \right) \,.
WarningVariances add

In particular, for differences of independent rvs X_1 \sim \mathsf{N}(\mu_1, \sigma_1^2) and X_2 \sim \mathsf{N}(\mu_2, \sigma_2^2) then the variances add: X_1 - X_2 \sim \mathsf{N}(\mu_1 - \mu_2, \sigma_1^2 + \sigma_2^2) \,.

Probabilities P(a \leq X \leq b) are found by converting the problem in X \sim \mathsf{N}(\mu, \sigma^2) to the standard normal distribution Z \sim \mathsf{N}(0, 1) whose probability values \Phi(z) = P(Z\leq z) can then be looked up in a table. From (1.) above, \begin{aligned} P(a < X < b) &= P\left( \frac{a-\mu}{\sigma} < Z < \frac{b-\mu}{\sigma} \right) \\ &= \Phi \left( \frac{b-\mu}{\sigma}\right) - \Phi\left(\frac{a-\mu}{\sigma}\right) \,. \end{aligned} This process is often referred to as standardising (the normal rv).

Example 2.5 Let X \sim \mathsf{N}(5, 9) and find P(X \geq 5.5).

\begin{aligned} P(X \geq 5.5) &= P\left(Z \geq \frac{5.5 - 5}{3}\right) \\ &= P(Z \geq 0.1667) \\ &= 1 - P(Z \leq 0.1667) \\ &= 1 - \Phi(0.1667) \\ &= 1 - 0.5662 \\ &= 0.4338\,, \end{aligned} where we look up the value of \Phi(z) = P(Z\leq z) in a table of standard normal curve areas.

The probability corresponds to the shaded area under the normal density \varphi(x) = \Phi'(x) corresponding to x \geq 5.5 (see Figure 2.7). To calculate this area, we can also use the R code: pnorm(5.5, mean = 5, sd = 3, lower.tail = FALSE).

Figure 2.7: The normal density \mathsf{N}(5,9) with the (one-sided) interval shaded in blue that corresponds to the probability P(X \geq 5.5).

Example 2.6 Let X \sim \mathsf{N}(5, 9) and find P(4 \leq X \leq 5.25).

\begin{aligned} P(4 \leq X \leq 5.25) &= P\left(\frac{4-5}{3} \leq Z \leq \frac{5.25-5}{3}\right) \\ &= P(-0.3333 \leq Z \leq 0.0833) \\ &= \Phi(0.0833) - \Phi(-0.3333) \\ &= 0.5332 - 0.3694 \\ &= 0.1638\,. \end{aligned} where we look up the value of \Phi(z) = P(Z\leq z) in a table of standard normal curve areas.

The probability corresponds to the shaded area under the normal density \varphi(x) = \Phi'(x) corresponding to 4 \leq x \leq 5.25 (see Figure 2.8). To calculate this area, we can use the R code: pnorm(5.25, mean = 5, sd = 3) - pnorm(4, mean = 5, sd = 3).

Figure 2.8: The normal density \mathsf{N}(5,9) with the (two-sided) interval shaded in blue that corresponds to the probability P(4 \leq X \leq 5.25).
ImportantEmpirical rule (68-95-99.7 rule)

For samples from a normal distribution, the percentage of values that lie within one, two, and three standard deviations of the mean are 68.27\%, 95.45\%, and 99.73\%, respectively. That is, for X \sim \mathsf{N}(\mu, \sigma^2), P(\mu - 1 \sigma \leq X \leq \mu + 1 \sigma ) \approx 0.6827\,, P(\mu - 2 \sigma \leq X \leq \mu + 2 \sigma ) \approx 0.9545\,, P(\mu - 3 \sigma \leq X \leq \mu + 3 \sigma ) \approx 0.9973\,. For a normal population, nearly all the values lie within “three sigmas” of the mean.

2.6 Student’s \mathsf{t} distribution

Student’s \mathsf{t} distribution gets its peculiar name as it was first published under the pseudonym “Student”.2 This bit of obfuscation was to protect the identity of his employer,3 and thereby vital trade secrets, in a highly competitive and lucrative industry.

Definition 2.10 (Student’s \mathsf{t} distribution) A continuous rv X has a \mathsf{t} distribution with parameter \nu > 0, if X has pdf f(x; \nu) = \frac{\Gamma\left(\tfrac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma \left(\tfrac{\nu}{2}\right)} \left( 1 + \tfrac{x^2}{\nu} \right)^{- \frac{\nu+1}{2}} \,, \quad -\infty < x < \infty\,. We write X \sim \mathsf{t}(\nu). Note \Gamma is the standard gamma function.4

The density for \mathsf{t}(\nu) for several values of \nu are plotted below in Figure 2.9.

Figure 2.9: The density for \mathsf{t}(\nu) for several values of \nu (df).
ImportantProperties of \mathsf{t} distributions
  1. The density for \mathsf{t}(\nu) is a bell-shaped curve centred at 0.
  2. The density for \mathsf{t}(\nu) is more spread out than the standard normal density (i.e., it has “fatter tails” than the normal).
  3. As \nu \to \infty, the spread of the corresponding \mathsf{t}(\nu) density converges to the standard normal density (i.e., the spread of the \mathsf{t}(\nu) density decreases relative to the standard normal).

If X \sim \mathsf{t}(\nu), then \mathbf{E}[X] = 0 for \nu > 1 (otherwise the mean is undefined).

NoteCauchy distribution

A \mathsf{t} distributions with \nu = 1 has pdf f(x) = \frac{1}{\pi (1 + x^2)}\,, and we call this the Cauchy distribution.

2.7 \chi^2 distribution

The \chi^2 distribution arises as the distribution of a sum of the squares of \nu independent standard normal rvs.

Definition 2.11 (\chi^2 distribution) A continuous rv X has a \chi^2 distribution with parameter \nu \in \mathbf{N}_{>}, if X has pdf \begin{equation*} f(x; \nu) = \frac{1}{2^{\nu/2} \Gamma(\nu/2)} x^{(\nu/2)-1} e^{-x/2} \,, \end{equation*} with support x \in (0, \infty) if \nu=1, otherwise x \in [0, \infty). We write X \sim \chi^2(\nu).

The pdf f(x; \nu) of the \chi^2(\nu) distribution depends on a positive integer \nu referred to as the df. The densities for several values of \nu are plotted below in Figure 2.10. The density f(x;\nu) is positively skewed, i.e., the right tail is longer, so the mass is concentrated to the figure’s left in Figure 2.10. The distribution becomes more symmetric as \nu increases. We denote critical values of the \chi^2(\nu) distribution by \chi^2_{\alpha, \nu}.

Figure 2.10: The density for \chi^2(\nu) for several values of \nu (df).
WarningSkew

Unlike the normal and t distributions, the \chi^2 distribution is not symmetric! This means that critical values, e.g., \chi^2_{.99, \nu} \quad \text{and}\quad \chi^2_{0.01,\nu}\,, are not equal. Hence, it will be necessary to look up both values for CIs based on \chi^2 critical values.

If X \sim \chi^2(\nu), then \mathbf{E}[X] = \nu and \mathop{\mathrm{Var}}[X] = 2\nu.

2.8 \mathsf{F} distribution

The \mathsf{F} distribution (“F” for Fisher) arises as a test statistic when comparing population variances and in the analysis of variance (see Chapter 6).

Definition 2.12 (\mathsf{F} distribution) A continuous rv X has an \mathsf{F} distribution with df parameters \nu_1 and \nu_2, if X has pdf f(x; \nu_1, \nu_2) = \frac{\Gamma\left(\frac{\nu_1+\nu_2}{2}\right) \nu_1^{\nu_1/2} \nu_2^{\nu_2/2}} {\Gamma\left(\frac{\nu_1}{2}\right) \Gamma\left(\frac{\nu_2}{2}\right)} \frac{x^{\nu_1/2 - 1}}{(\nu_2+\nu_1 x)^{(\nu_1+\nu_2)/2}} \,.

The pdf f(x; \nu_1, \nu_2) of the \mathsf{F}(\nu_1, \nu_2) distribution depends on two positive integers \nu_1 and \nu_2 referred to, respectively, as the numerator and denominator df. The density is plotted below for several combinations of (\nu_1, \nu_2) in Figure 2.11.

Figure 2.11: The density for \mathsf{F}(\nu_1, \nu_2) for several combinations of (\nu_1, \nu_2).
TipWhere do the terms numerator and denominator df come from?

The \mathsf{F} distribution is related to ratios of \chi^2 rvs, as captured in Theorem 2.1.

Theorem 2.1 (Ratio of \chi^2 rvs) If X_1 \sim \chi^2(\nu_1) and X_2 \sim \chi^2(\nu_2) are independent rvs, then the rv F = \frac{X_1 / \nu_1}{X_2 / \nu_2} \quad \sim \mathsf{F}(\nu_1,\nu_2)\,, that comprises the ratio of two \chi^2 rvs divided by their respective df has an \mathsf{F}(\nu_1, \nu_2) distribution.


  1. “Traditions, traditions… Without our traditions, our lives would be as shaky as a fiddler on the roof!” [https://www.youtube.com/watch?v=gRdfX7ut8gw].↩︎

  2. William Sealy Gosset (1876–1937) wrote under the pseudonym “Student” [https://mathshistory.st-andrews.ac.uk/Biographies/Gosset/].↩︎

  3. Gosset invented the t-test to handle small samples for quality control in brewing, specifically for the Guinness brewery in Dublin [https://www.wikiwand.com/en/Guinness_Brewery].↩︎

  4. The gamma function is defined by \Gamma(z) = \int_0^\infty x^{z-1}e^{-x} dx when the real part of z is positive. For any positive integer n, \Gamma(n) = (n-1)! and for half-integers \Gamma(\tfrac{1}{2} + n) = \frac{(2n)!}{4^n n!} \sqrt{\pi}.↩︎