4 Two samples inferences
We consider inferences—estimators, confidence intervals, and hypothesis testing—for comparing means, proportions, and variances based on two independent samples from different populations, respectively, in Section 4.1, Section 4.3, Section 4.4. We also consider inferences when the samples are not independent, so-called paired samples, in Section 4.2.
4.1 Comparing means
Let us assume that we have two normal populations with iid samples X_1, \dots, X_m \sim \mathsf{N}(\mu_X, \sigma_X^2) and Y_1, \dots, Y_n \sim \mathsf{N}(\mu_Y, \sigma_Y^2) and, moreover, that the X and Y samples are independent of one another. When comparing the means of two populations, the quantity of interest is the difference: \mu_X - \mu_Y.
Proposition 4.1 If we consider the sample means \overline{X} and \overline{Y}, then the mean of the variable \overline{X} - \overline{Y} is, \mu_{\overline{X} - \overline{Y}} = \mathbf{E}\left[ \overline{X} - \overline{Y} \right] = \mu_X - \mu_Y\,, and the variance is, \sigma_{\overline{X} - \overline{Y}}^2 = \mathop{\mathrm{Var}}\left[ \overline{X} - \overline{Y} \right] = \frac{\sigma_X^2}{m} + \frac{\sigma_Y^2}{n} \,.
Proposition 4.1 follows directly from the definition of the sample mean in Equation 2.1 and properties of expectation and variance. If our parameter of interest is \theta = \mu_1 - \mu_2\,, then its estimator, \widehat{\theta} = \overline{X} - \overline{Y}\,, is normally distributed with mean and variance given by Proposition 4.1. If the sample sizes m and n are large, then the estimator is approximately normally distributed by the Central Limit Theorem regardless of the population. We now discuss CIs and hypothesis tests for comparing population means \theta = \mu_X - \mu_Y. We consider three cases when comparing means:
- normal populations when the variances \sigma_X^2 and \sigma_Y^2 are known,
- any populations with unknown variances \sigma_X^2 and \sigma_Y^2, when the sample sizes m and n are large,
- normal populations when the variances \sigma_X^2 and \sigma_Y^2 are unknown, when the sample sizes m and n are small,
noting that the development primarily reflects that of Section 3.1.
4.1.1 Comparing means of normal populations when variances are known
When \sigma_X^2 and \sigma_Y^2 are known, standardizing \overline{X} - \overline{Y} yields the standard normal variable:
Z = \frac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{\sqrt{\frac{\sigma_X^2}{m} + \frac{\sigma_Y^2}{n}}}\quad \sim \mathsf{N}(0,1)\,.
\tag{4.1}
Inferences proceed by treating the parameter of interest \theta as in the single sample case using the test statistic Equation 4.1.
Proposition 4.2 A 100(1-\alpha)\% CI for the parameter \theta = \mu_X - \mu_Y based on samples of size m from a normal population \mathsf{N}(\mu_X, \sigma_X^2) and of size n from \mathsf{N}(\mu_Y, \sigma_Y^2) with known variances, is given by (\overline{x} - \overline{y}) \pm z_{\alpha/2} \cdot \sqrt{\frac{\sigma_X^2}{m} + \frac{\sigma_Y^2}{n}} \,.
Proposition 4.3 Assume that we sample iid X_1, \dots, X_m \sim \mathsf{N}(\mu_X, \sigma_X^2) and iid Y_1, \dots, Y_n \sim \mathsf{N}(\mu_Y, \sigma_Y^2) and that the X and Y samples are independent.
Consider H_0 : \mu_X - \mu_Y = \theta_0. The test statistic is Z = \frac{\overline{X} - \overline{Y} - \theta_0}{\sqrt{\frac{\sigma_{X}^2}{m} + \frac{\sigma_{Y}^2}{n}}}\,. \tag{4.2}
For a hypothesis test at level \alpha, we use the following procedure:
If H_a : \mu_X - \mu_Y > \theta_0, then P = 1 - \Phi(z), i.e., upper-tail R = \{z > z_{\alpha}\}.
If H_a : \mu_X - \mu_Y < \theta_0, then P = \Phi(z), i.e., lower-tail R = \{z < - z_{\alpha}\}.
If H_a : \mu_X - \mu_Y \neq \theta_0, then P = 2(1-\Phi(|z|)), i.e., two-tailed R = \{|z| > z_{\alpha/2}\}.
4.1.2 Comparing means when the sample sizes are large
When the samples are large, the assumptions about the normality of the populations and knowledge of the variances \sigma_X^2 and \sigma_Y^2 can be relaxed. For sufficiently large m and n, the difference of the sample means, \overline{X} - \overline{Y}, has approximately a normal distribution for any underlying population distributions by the Central Limit Theorem. Moreover, if m and n are large enough, replacing the population variances with the sample variances S_X^2 and S_Y^2 will not increase the variability of the estimator or the test statistic too much.
Proposition 4.4 For m and n sufficiently large, an approximate 100(1-\alpha)\% CI for \mu_X - \mu_Y for two samples from populations with any underlying distribution is given by (\overline{x} - \overline{y}) \pm z_{\alpha/2} \cdot \sqrt{\frac{s_{X}^2}{m} + \frac{s_{Y}^2}{n}}
Proposition 4.5 Under the same assumptions and procedures as in Proposition 4.3, a large-sample, i.e., m > 40 and n > 40, test statistic, Z = \frac{\overline{X} - \overline{Y} - \theta_0}{\sqrt{\frac{S_{X}^2}{m} + \frac{S_{Y}^2}{n}}}\,, can be used in place of Equation 4.2 for hypothesis testing.
4.1.3 Comparing means of normal populations when variances are unknown and the sample size is small
If \sigma_X and \sigma_Y are unknown and either sample is small (e.g., m < 30 or n <30), but both populations are normally distributed, then we can use Student’s \mathsf{t} distribution to make inferences. We provide the following theorem without proof.
Theorem 4.1 When both population distributions are normal, the standardised variable T = \frac{\overline{X}-\overline{Y} - (\mu_X - \mu_Y)}{\sqrt{\frac{S_X^2}{m} + \frac{S_Y^2}{n}}} \quad \sim \mathsf{t}(\nu)\,, where the df \nu is estimated from the data. Namely, \nu is given by (round \nu down to the nearest integer): \nu = \frac{ \left( \frac{s_X^2}{m} + \frac{s_Y^2}{n} \right)^2}{\frac{(s_X^2 / m)^2}{m-1} + \frac{(s_Y^2/n)^2}{n-1}} = \frac{ \left( s_{\overline{X}}^2 + s_{\overline{Y}}^2 \right)^2}{\frac{s_{\overline{X}}^4}{m-1} + \frac{s_{\overline{Y}}^4}{n-1}}\,, \tag{4.3} where s_X^2 and s_Y^2 are point estimators of the sample variances; alternatively, we see that the formula Equation 4.3 can also be written in terms of the standard error of the sample means: s_{\overline{X}} = \frac{s_X}{\sqrt{m}} \quad \text{and} \quad \qquad s_{\overline{Y}} = \frac{s_Y}{\sqrt{n}} \,.
The formula Equation 4.3 for the data-driven choice of \nu calls for the computation of the standard error of the sample means.
Proposition 4.6 A 100(1-\alpha)\% CI for \mu_X - \mu_Y for two samples of size m and n from normal populations where the variances are unknown is given by (\overline{x} - \overline{y}) \pm t_{\alpha/2, \nu} \sqrt{ \frac{s_X^2}{m} + \frac{s_Y^2}{n}}\,, where we recall that t_{\alpha/2, \nu} is the \alpha/2 critical value of \mathsf{t}(\nu) with \nu given by Equation 4.3.
Proposition 4.7 Assume that we sample iid X_1, \dots, X_m and iid Y_1, \dots, Y_n from normal populations with unknown variances and means \mu_X and \mu_Y, respectively, and that the X and Y samples are independent.
Consider H_0 : \mu_X - \mu_Y = \theta_0. The test statistic is T = \frac{\overline{X} - \overline{Y} - \theta_0}{\sqrt{\frac{S_{X}^2}{m} + \frac{S_{Y}^2}{n}}}\,. \tag{4.4}
For a hypothesis test at level \alpha, we use the following procedure:
If H_a : \mu_X - \mu_Y > \theta_0, then P-value is the area under \mathsf{t}(\nu) to the right of t, i.e., upper-tail R = \{t > t_{\alpha,\nu}\}.
If H_a : \mu_X - \mu_Y < \theta_0, then P-value is the area under \mathsf{t}(\nu) to the left of t, i.e., lower-tail R = \{t < - t_{\alpha,\nu}\}.
If H_a : \mu_X - \mu_Y \neq \theta_0, then P-value is twice the area under \mathsf{t}(\nu) to the right of |t|, i.e., two-tailed R = \{|t| > t_{\alpha/2, \nu}\}.
Here \nu is given by Equation 4.3.
If the variances of the normal populations are unknown but are the same, \sigma_X^2 = \sigma_Y^2, then deriving CIs and test statistics for comparing the means can be simplified by considering a combined or pooled estimator for the single parameter \sigma^2. If we have two samples from populations with variance \sigma^2, each sample provides an estimate for \sigma^2. That is, S_X^2, based on the m observations of the first sample, is one estimator for \sigma^2 and another is given by S_Y^2, based on n observations of the second sample. The correct way to combine these two estimators into a single estimator for the sample variance is to consider the pooled estimator of \sigma^2, S_{\mathsf{p}}^2 = \frac{m-1}{m+n-2} S_X^2 + \frac{n-1}{m+n-2} S_Y^2 \,. \tag{4.5} The pooled estimator is a weighted average that adjusts for differences between the sample sizes m and n.
If m \neq n, then the estimator with more samples will contain more information about the parameter \sigma^2. Thus, the simple average (S_X^2 + S_Y^2)/2 wouldn’t be fair, would it?
Proposition 4.8 A 100(1-\alpha)\% CI for \mu_X - \mu_Y for two samples of size m and n from normal populations where the variance \sigma^2 is unknown is given by (\overline{x} - \overline{y}) \pm t_{\alpha/2, m + n - 2} \cdot \sqrt{ s_{\mathsf{p}}^2 \left( \frac{1}{m} + \frac{1}{n} \right)} \,, where we recall that t_{\alpha/2, m+n-2} is the \alpha/2 critical value of the \mathsf{t}(\nu) with \nu = m + n - 2 df.
Similarly, one can consider a pooled \mathsf{t} test, i.e., a hypothesis test based on the pooled estimator for the variance as opposed to the two-sample \mathsf{t} test in Proposition 4.7. In the case of a pooled \mathsf{t} test, the test statistic T = \frac{\overline{X} - \overline{Y} - \theta_0}{\sqrt{S_{\mathsf{p}}^2 \left(\frac{1}{m} + \frac{1}{n}\right)}}\,, with the pooled estimator of the variance, replaces Equation 4.4 in Proposition 4.7 and the same procedures are followed for determining the P-value with \nu = m+n-2 in place of Equation 4.3. If you have reasons to believe that \sigma_X^2 = \sigma_Y^2, these pooled \mathsf{t} procedures are appealing because \nu is very easy to compute.
Pooled t procedures are not robust if the assumption of equalised variance is violated. Theoretically, you could first carry out a statistical test H_0 : \sigma_X^2 = \sigma_Y^2 on the equality of variances and then use a pooled \mathsf{t} procedure if the null hypothesis is not rejected. However, there is no free lunch: the typical \mathsf{F} test for equal variances (see Section 4.4) is sensitive to normality assumptions. The two sample \mathsf{t} procedures, with the data-driven choice of \nu in Equation 4.3, are therefore recommended unless, of course, you have a very compelling reason to believe \sigma_X^2 = \sigma_Y^2.
4.2 Comparing paired samples
The preceding analysis for comparing population means was based on the assumption that a random sample X_1, \dots, X_n is drawn from a distribution with mean \mu_X and that a completely independent random sample Y_1, \dots, Y_n is drawn from a distribution with mean \mu_Y. Some situations, e.g., comparing observations before and after a treatment or exposure, necessitate the consideration of paired values.
Consider a random sample of iid pairs, (X_1, Y_1), \dots, (X_n, Y_n)\,, with \mathbf{E}[X_i] = \mu_X and \mathbf{E}[Y_i] = \mu_Y. If we are interested in making inferences about the difference \mu_X - \mu_Y, then the paired differences D_i = X_i - Y_i \,,\quad i=1, \dots, n\,, constitute a sample with mean \mu_D = \mu_X - \mu_Y that can be treated using single-sample CIs and tests, e.g., see Section 3.1.3.
4.3 Comparing proportions
Consider a population containing a proportion p_X of individuals satisfying a given property. For a sample of size m from this population, we denote the sample proportion by \widehat{p}_X. Likewise, we consider a population containing a proportion p_Y of individuals satisfying the same given property. For a sample of size n from this population, we denote the sample proportion by \widehat{p}_Y. We assume the samples from the X and Y populations are independent. The natural estimator for the difference in population proportions p_X - p_Y is the difference in the sample proportions \widehat{p}_X - \widehat{p}_Y.
Provided the samples are much smaller than the population sizes (i.e., the populations are about 20 times larger than the samples), \mu_{(\widehat{p}_X - \widehat{p}_Y)} = \mathbf{E}[\widehat{p}_X - \widehat{p}_Y] = p_X - p_Y\,, and \sigma_{(\widehat{p}_X - \widehat{p}_Y)}^2 = \mathop{\mathrm{Var}}[\widehat{p}_X - \widehat{p}_Y] = \frac{p_X(1-p_X)}{m} + \frac{p_Y(1-p_Y)}{n}\,, because the count of individuals satisfying the given property in each population will be independent draws from \mathsf{Binom}(m, p_X) and \mathsf{Binom}(n, p_Y), respectively. Further, if m and n are large (e.g., m \geq 30 and n \geq 30), then \widehat{p}_X and \widehat{p}_Y are (approximately) normally distributed. Standardizing \widehat{p}_X - \widehat{p}_Y, Z = \frac{\widehat{p}_X - \widehat{p}_Y - (p_X - p_Y)}{\sqrt{\frac{p_X(1-p_X)}{m} + \frac{p_Y(1-p_Y)}{n}}} \quad \sim \mathsf{N}(0,1)\,. A CI for \widehat{p}_X - \widehat{p}_Y then follows from the large-sample CI considered in Section 3.1.2.
Proposition 4.9 An approximate 100(1-\alpha)\% CI for p_X - p_Y is given by \widehat{p}_X - \widehat{p}_Y \pm z_{\alpha/2}\sqrt{\frac{\widehat{p}_X (1 - \widehat{p}_X)}{m} + \frac{\widehat{p}_Y (1 - \widehat{p}_Y)}{n}}\,, \tag{4.6} and, as a rule of thumb, can be reliably used if m \widehat{p}_X, m (1 - \widehat{p}_X), n \widehat{p}_Y, and n (1-\widehat{p}_Y) are greater than or equal to 10.
Proposition 4.9 does not pool the estimators for the population proportions. However, if we are considering a hypothesis test concerning the equality of the population proportions with the null hypothesis H_0 : p_X - p_Y = 0 \,, then we assume p_X = p_Y as our default position. Therefore, as a matter of consistency, we should replace the standard error in Equation 4.6 with a pooled estimator for the standard error of the population proportion, \widehat{p} = \frac{m}{m + n} \widehat{p}_X + \frac{n}{m + n} \widehat{p}_Y \,.
Proposition 4.10 Assume that m \widehat{p}_X, m (1-\widehat{p}_X), n\widehat{p}_Y, n(1-\widehat{p}_Y) are all greater than 10.
Consider H_0 : p_X - p_Y = 0. The test statistic is Z = \frac{\widehat{p}_X - \widehat{p}_Y}{\sqrt{\widehat{p} (1 - \widehat{p}) \left( \frac{1}{m} + \frac{1}{n} \right)}} \,.
For a hypothesis test at level \alpha, we use the following procedure:
If H_a : p_X - p_Y > 0, then P = 1 - \Phi(z), i.e., upper-tail R = \{z > z_{\alpha}\}.
If H_a : p_X - p_Y < 0, then P = \Phi(z), i.e., lower-tail R = \{z < - z_{\alpha}\}.
If H_a : p_X - p_Y \neq 0, then P = 2(1-\Phi(|z|)), i.e., two-tailed R = \{|z| > z_{\alpha/2}\}.
4.4 Comparing variances
For a random sample X_1, \dots, X_m \sim \mathsf{N}(\mu_X, \sigma_X^2) and an independent random sample Y_1, \dots, Y_n \sim \mathsf{N}(\mu_Y, \sigma_Y^2)\,, the rv F = \frac{S_X^2 / \sigma_X^2}{S_Y^2 / \sigma_Y^2} \quad \sim \mathsf{F}(m-1, n-1)\,, \tag{4.7} that is, F has an \mathsf{F} distribution with df \nu_1 = m-1 and \nu_2 = n-1. The statistic F in Equation 4.7 comprises the ratio of variances \sigma_X^2 / \sigma_Y^2 and not the difference; therefore, the plausibility of \sigma_X^2 = \sigma_Y^2 will be based on how much the ratio differs from 1.
Proposition 4.11 For the null hypothesis H_0 : \sigma_X^2 = \sigma_Y^2, the test statistic to consider is: f = \frac{s_X^2}{s_Y^2} and the P-values are determined by the \mathsf{F}(m-1, n-1) distribution where m and n are the respective sample sizes.
A 100(1-\alpha)\% CI for the ratio \sigma_X^2 / \sigma_Y^2 is based on forming the probability, P(F_{1-\alpha/2, \nu_1, \nu_2} < F < F_{\alpha/2, \nu_1, \nu_2}) = 1 - \alpha\,, where F_{\alpha/2, \nu_1, \nu_2} is the \alpha/2 critical value from the \mathsf{F}(\nu_1 = m-1, \nu_2 = n-1) distribution. Substituting Equation 4.7 with point estimates for F and manipulating the inequalities it is possible to isolate the ratio \sigma_X^2 / \sigma_Y^2, P \left( \frac{1}{F_{\alpha/2, \nu_1, \nu_2}} \frac{s_X^2}{s_Y^2} < \frac{\sigma_X^2}{\sigma_Y^2} < \frac{1}{F_{1-\alpha/2, \nu_1, \nu_2}} \frac{s_X^2}{s_Y^2} \right) = 1 - \alpha \,.
Proposition 4.12 A 100(1-\alpha)\% CI for the ratio of population variances \sigma_X^2 / \sigma_Y^2 is given by \left(F_{\alpha/2, m-1, n-1}^{-1} s_X^2 / s_Y^2 \,, F_{1-\alpha/2, m-1, n-1}^{-1} s_X^2 / s_Y^2 \right)\,.
Proposition 4.13 Assume the population distributions are normal and the random samples are independent of one another.
Consider H_0 : \sigma_X^2 = \sigma_Y^2. The test statistic is F = S_X^2 / S_Y^2 \,.
For a hypothesis test at level \alpha, we use the following procedure:
If H_a : \sigma_X^2 > \sigma_Y^2, then P-value is A_R = {} area under the \mathsf{F}(m-1, n-1) curve to the right of f.
If H_a : \sigma_X^2 < \sigma_Y^2, then P-value is A_L = {} area under the \mathsf{F}(m-1, n-1) curve to the left of f.
If H_a : \sigma_X^2 \neq \sigma_Y^2, then P-value is 2 \cdot \min(A_R, A_L).