7 Categorical data

7.1 Multinomial experiments

Suppose we have a population divided into k > 2 distinct categories. We consider an experiment where we select m individuals (or objects) from the population and categorise each. We denote the population proportion in the ith category by p_i. If the sample size m is much smaller than the population size M (so that the m trials are independent), this experiment will be approximately multinomial with success probability p_i for each category, i=1, \dots, k.

Before the experiment is performed, we denote the number (or count) of the trials resulting in category i by the rv N_i. The expected number of trails that result in category i is given by \E[N_i] = m p_i\,, \quad i=1, \dots, k\,. \tag{7.1} After the experiment is performed, we denote the corresponding observed value by n_i. Since the trials result in distinct categories, \sum_{i=1}^k N_i = \sum_{i=1}^{k} n_i = m \,, which indicates that, for a given m, we only need to observe k-1 of the variables to be able to work out what the kth variable should be.

7.2 Goodness-of-fit for a single factor

We are interested in making inferences about the proportion parameters p_i. Specifically, we will consider the null hypothesis, H_0 : p_1 = p_{10}\,, p_2 = p_{20}\,, \cdots\,, p_k = p_{k0}\,, \tag{7.2} that completely specifies a value p_{i0} for each p_i. The alternative hypothesis H_a will state that H_0 is not true, i.e., that at least one p_i is different from the value p_{i0} claimed under the null H_0.

Notation

Here for i=1, \dots, k we use the notation p_{i0} to denote the value of p_i claimed under the null hypothesis.

Provided the null hypothesis in Equation 7.2 is true, the expected values Equation 7.1 can be written in terms of the expected frequencies, \E[N_i] = m p_{i0}\,, \quad i=1,\dots,k\,. Often the n_i, referred to as the observed cell counts, and the corresponding m p_{i0}, referred to as the expected cell counts, are tabulated, for example, as in Table 7.1.

Table 7.1: Observed and expected cell counts.

Category	i=1	i=2	\cdots	i=k	Row total
Observed	n_1	n_2	\cdots	n_k	m
Expected	mp_{10}	mp_{20}	\cdots	mp_{k0}	m

The test procedure assesses the discrepancy between the value of the observed and expected cell counts. This discrepancy, or goodness of fit, is measured by the squared deviations divided by the expected count.

Why divide by expected cell counts?

The division by the expected cell counts accounts for possible differences in the relative magnitude of the observed/expected counts.

Theorem 7.1 For m p_i \geq 5 for i = 1, \dots, k, the rv V = \sum_{i=1}^k \frac{(N_i - m p_i)^2}{m p_i} \quad \sim \chi^2(k-1)\,, that is, V has approximately a \chi^2 distribution with \nu = k-1 df.

Proposition 7.1 Consider the null H_0 : p_1 = p_{10}, p_2 = p_{20}, \cdots, p_k = p_{k0}\,, and the alternative H_a : p_i \neq p_{i0}\; \text{for at least one}\; i\,. The test statistic is V = \sum_{i=1}^k \frac{(N_i - m p_{i0})^2}{m p_{i0}}\,. As a rule of thumb, provided m p_{i0} \geq 5 for all i = 1, \dots, k, then the P-value is the area under \chi^2(k-1) to the right of v.

If m p_{i0} < 5 for some i then it may be possible to combine the categories such that the new categorizations satisfy the assumptions of Proposition 7.1.

What about partial information?

Things are much more complicated if the category probabilities are not entirely specified.

7.3 Test for the independence of factors

In Section 7.2, we considered categorising a population into a single factor. We now consider a single population where each individual is categorised into two factors with I distinct categories for the first factor and J distinct categories for the second factor. Each individual from the population belongs to exactly one of the I categories of the first factor and exactly one of the J categories of the second factor. We want to determine whether or not there is any dependency between the two factors.

For a sample of m individuals, we denote by n_{ij} the count of the m samples that fall both in category i of the first factor and category j of the second factor, for i = 1, \dots, I and j = 1, \dots, J. A contingency table with I rows and J columns (i.e., IJ cells) will be used to record the n_{ij} counts (in an obvious way). Let p_{ij} be the proportion of individuals in the population who belong in category i of factor 1 and category j of factor 2. Then, the probability that a randomly selected individual falls in category i of factor 1 is found by summing over all j: p_{i} = \sum_{j=1}^J p_{ij}\,, and likewise, the probability that a randomly selected individual falls in category j of factor 2 is found by summing over all i: p_{j} = \sum_{i=1}^I p_{ij}\,. The null hypothesis that we will be interested in adopting is H_0 : p_{ij} = p_{i} \cdot p_{j} \; \forall (i,j)\,, \tag{7.3} that is, an individual’s category in factor 1 is independent of the category in factor 2.

Following the same program as for the single category goodness-of-fit test, we note that assuming the null hypothesis Equation 7.3 is true, then the expected count in cell i,j is \E[N_{ij}] = m p_{ij} = m p_{i} p_{j}\,; and we estimate p_i and p_j by the appropriate sample proportion: \widehat{p}_i = \frac{n_i}{m}\,, \qquad n_i = \sum_{j} n_{ij} \quad \text{(row totals)}\,, and \widehat{p}_j = \frac{n_j}{m}\,, \qquad n_j = \sum_{i} n_{ij}\quad \text{(column totals)}\,. Thus, the expected cell count is given by \widehat{e}_{ij} = m \widehat{p}_i \widehat{p}_j = \frac{n_i n_j}{m}\,, and we assess the goodness of fit between the observed cell count n_ij and the expected cell count \widehat{e}_ij.

Proposition 7.2 Assume the null hypothesis H_0 : p_{ij} = p_i p_j \; \text{for all } i=1, \dots, I\,, j=1, \dots, J\,, against the alternative hypothesis H_a : H_0 \;\text{is not true}\,. The test statistic is V = \sum_{i=1}^I \sum_{j=1}^J \frac{(N_{ij} - \widehat{e}_{ij})^2}{\widehat{e}_{ij}} \,. As a rule of thumb, provided \widehat{e}_{ij} \geq 5 for all i,j and when H_0 is true, then the test statistic has approximately a \chi^2(\nu) distribution with \nu = (I-1)(J-1) df. For a hypothesis test at level \alpha, the procedure is upper-tailed, and the P-value is the area under \chi^2(\nu) to the right of v.

Alternative lingo

Contingency is just another word for dependency in the context of goodness-of-fit tables.

{{< include preamble.qmd >}} # Categorical data {#sec-categorical-data} ## Multinomial experiments Suppose we have a population divided into $k > 2$ distinct categories. We consider an experiment where we select $m$ individuals (or objects) from the population and categorise each. We denote the population proportion in the $i$th category by $p_i.$ If the sample size $m$ is much smaller than the population size $M$ (so that the $m$ trials are independent), this experiment will be approximately *multinomial* with success probability $p_i$ for each category, $i=1, \dots, k.$ Before the experiment is performed, we denote the number (or count) of the trials resulting in category $i$ by the rv $N_i.$ The expected number of trails that result in category $i$ is given by $$ \E[N_i] = m p_i\,, \quad i=1, \dots, k\,. $$ {#eq-mean-count-multinomial} After the experiment is performed, we denote the corresponding observed value by $n_i.$ Since the trials result in distinct categories, $$ \sum_{i=1}^k N_i = \sum_{i=1}^{k} n_i = m \,, $$ which indicates that, for a given $m,$ we only need to observe $k-1$ of the variables to be able to work out what the $k$th variable should be. ## Goodness-of-fit for a single factor {#sec-goodness-of-fit-tests} We are interested in making inferences about the proportion parameters $p_i.$ Specifically, we will consider the null hypothesis, $$ H_0 : p_1 = p_{10}\,, p_2 = p_{20}\,, \cdots\,, p_k = p_{k0}\,, $$ {#eq-null-multinomial} that completely specifies a value $p_{i0}$ for each $p_i.$ The alternative hypothesis $H_a$ will state that $H_0$ is not true, i.e., that at least one $p_i$ is different from the value $p_{i0}$ claimed under the null $H_0.$ :::{.callout-warning} ## Notation Here for $i=1, \dots, k$ we use the notation $p_{i0}$ to denote the value of $p_i$ claimed under the null hypothesis. ::: Provided the null hypothesis in @eq-null-multinomial is true, the expected values @eq-mean-count-multinomial can be written in terms of the expected frequencies, $$ \E[N_i] = m p_{i0}\,, \quad i=1,\dots,k\,. $$ Often the $n_i,$ referred to as the observed cell counts, and the corresponding $m p_{i0},$ referred to as the expected cell counts, are tabulated, for example, as in @tbl-cell-counts. ::: {.striped .hover #tbl-cell-counts} |Category | $i=1$ | $i=2$ | $\cdots$ | $i=k$ | Row total | |--- | -- | --- | --- | --- | --- | | Observed | $n_1$ | $n_2$ | $\cdots$ | $n_k$ | $m$ | | Expected | $mp_{10}$ | $mp_{20}$ | $\cdots$ | $mp_{k0}$ | $m$ | : Observed and expected cell counts. ::: The test procedure assesses the discrepancy between the value of the observed and expected cell counts. This discrepancy, or goodness of fit, is measured by the squared deviations divided by the expected count. :::{.callout-tip} ## Why divide by expected cell counts? The division by the expected cell counts accounts for possible differences in the relative magnitude of the observed/expected counts. ::: :::{#thm-multinomial-experiment} For $m p_i \geq 5$ for $i = 1, \dots, k,$ the rv $$ V = \sum_{i=1}^k \frac{(N_i - m p_i)^2}{m p_i} \quad \sim \chi^2(k-1)\,, $$ that is, $V$ has approximately a $\chi^2$ distribution with $\nu = k-1$ df. ::: :::{#prp-htest-multinomial} Consider the null $$ H_0 : p_1 = p_{10}, p_2 = p_{20}, \cdots, p_k = p_{k0}\,, $$ and the alternative $$ H_a : p_i \neq p_{i0}\; \text{for at least one}\; i\,. $$ The test statistic is $$ V = \sum_{i=1}^k \frac{(N_i - m p_{i0})^2}{m p_{i0}}\,. $$ As a rule of thumb, provided $m p_{i0} \geq 5$ for all $i = 1, \dots, k,$ then the $P$-value is the area under $\chi^2(k-1)$ to the right of $v.$ ::: If $m p_{i0} < 5$ for some $i$ then it may be possible to combine the categories such that the new categorizations satisfy the assumptions of @prp-htest-multinomial. :::{.callout-warning} ## What about partial information? Things are much more complicated if the category probabilities are not entirely specified. ::: ## Test for the independence of factors In @sec-goodness-of-fit-tests, we considered categorising a population into a single factor. We now consider a single population where each individual is categorised into two factors with $I$ distinct categories for the first factor and $J$ distinct categories for the second factor. Each individual from the population belongs to exactly one of the $I$ categories of the first factor and exactly one of the $J$ categories of the second factor. We want to determine whether or not there is any dependency between the two factors. For a sample of $m$ individuals, we denote by $n_{ij}$ the count of the $m$ samples that fall both in category $i$ of the first factor and category $j$ of the second factor, for $i = 1, \dots, I$ and $j = 1, \dots, J.$ A contingency table with $I$ rows and $J$ columns (i.e., $IJ$ cells) will be used to record the $n_{ij}$ counts (in an obvious way). Let $p_{ij}$ be the proportion of individuals in the population who belong in category $i$ of factor 1 and category $j$ of factor $2.$ Then, the probability that a randomly selected individual falls in category $i$ of factor 1 is found by summing over all $j$: $$ p_{i} = \sum_{j=1}^J p_{ij}\,, $$ and likewise, the probability that a randomly selected individual falls in category $j$ of factor 2 is found by summing over all $i$: $$ p_{j} = \sum_{i=1}^I p_{ij}\,. $$ The null hypothesis that we will be interested in adopting is $$ H_0 : p_{ij} = p_{i} \cdot p_{j} \; \forall (i,j)\,, $$ {#eq-null-two-factor} that is, an individual's category in factor 1 is independent of the category in factor 2. Following the same program as for the single category goodness-of-fit test, we note that assuming the null hypothesis @eq-null-two-factor is true, then the expected count in cell $i,j$ is $$ \E[N_{ij}] = m p_{ij} = m p_{i} p_{j}\,; $$ and we estimate $p_i$ and $p_j$ by the appropriate sample proportion: $$ \widehat{p}_i = \frac{n_i}{m}\,, \qquad n_i = \sum_{j} n_{ij} \quad \text{(row totals)}\,, $$ and $$ \widehat{p}_j = \frac{n_j}{m}\,, \qquad n_j = \sum_{i} n_{ij}\quad \text{(column totals)}\,. $$ Thus, the expected cell count is given by $$ \widehat{e}_{ij} = m \widehat{p}_i \widehat{p}_j = \frac{n_i n_j}{m}\,, $$ and we assess the goodness of fit between the observed cell count $n_ij$ and the expected cell count $\widehat{e}_ij.$ :::{#prp-independence-test} Assume the null hypothesis $$ H_0 : p_{ij} = p_i p_j \; \text{for all } i=1, \dots, I\,, j=1, \dots, J\,, $$ against the alternative hypothesis $$ H_a : H_0 \;\text{is not true}\,. $$ The test statistic is $$ V = \sum_{i=1}^I \sum_{j=1}^J \frac{(N_{ij} - \widehat{e}_{ij})^2}{\widehat{e}_{ij}} \,. $$ As a rule of thumb, provided $\widehat{e}_{ij} \geq 5$ for all $i,j$ and when $H_0$ is true, then the test statistic has approximately a $\chi^2(\nu)$ distribution with $\nu = (I-1)(J-1)$ df. For a hypothesis test at level $\alpha,$ the procedure is upper-tailed, and the $P$-value is the area under $\chi^2(\nu)$ to the right of $v.$ ::: :::{.callout-tip} ## Alternative lingo Contingency is just another word for dependency in the context of goodness-of-fit tables. :::