1 Exploratory data analysis

Exploratory data analysis (EDA) is the process of understanding a data set before performing formal inference or fitting a statistical model. The goal is to identify the important features of the data (and any problems) early.

In practice, EDA answers questions like:

What are the variables and units?
How much missing data is there?
What do the marginal distributions look like?
Are there outliers or data errors?
Which variables appear associated?

1.1 A first look in R

Cyclamen Data

The Cyclamen Data contain 150 obs. of six features (sequence [], petal length [mm], petal width [mm], aperture diameter [mm], color [pink, white], and group [alpha, beta, gamma, delta, epsilon]) for cyclamen hederifolium. These measurements were taken by small groups of former MA22004 students using the same measurement protocol at the University Botanic Gardens in the autumn of 2024.

Code

cyclamen |> glimpse()

Rows: 150
Columns: 6
$ sequence     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, …
$ petal.width  <dbl> 9, 7, 11, 13, 11, 10, 12, 14, 9, 9, 9, 9, 11, 10, 15, 14, 11, 11, 9…
$ petal.length <dbl> 16, 13, 23, 19, 16, 18, 20, 20, 15, 25, 17, 19, 21, 23, 20, 23, 19,…
$ aperture     <dbl> 5, 4, 6, 6, 7, 6, 7, 8, 7, 6, 6, 6, 7, 8, 9, 7, 9, 8, 6, 7, 6, 7, 7…
$ colour       <fct> white, pink, pink, white, pink, white, pink, white, white, pink, wh…
$ group        <fct> epsilon, epsilon, epsilon, epsilon, epsilon, epsilon, epsilon, epsi…

A useful first step is to compute basic summaries and check that the categorical variables have the expected levels.

Code

cyclamen |> summary()

    sequence     petal.width     petal.length     aperture        colour       group   
 Min.   : 1.0   Min.   : 4.00   Min.   : 9.0   Min.   : 4.000   pink :83   alpha  :30  
 1st Qu.: 8.0   1st Qu.: 9.00   1st Qu.:18.0   1st Qu.: 6.000   white:67   beta   :30  
 Median :15.5   Median :10.00   Median :21.0   Median : 7.000              delta  :30  
 Mean   :15.5   Mean   :10.29   Mean   :20.5   Mean   : 7.173              epsilon:30  
 3rd Qu.:23.0   3rd Qu.:12.00   3rd Qu.:23.0   3rd Qu.: 8.000              gamma  :30  
 Max.   :30.0   Max.   :19.00   Max.   :31.0   Max.   :11.000

The summary above includes the sample mean of the continuous variables (e.g., petal.width) and the counts in each category for factors (e.g., group).

Missingness

In R, most functions will silently return NA if you have missing values. When computing summaries, decide in advance how you want to handle missingness (for example, removing incomplete rows for a particular calculation, or imputing missing values).

A simple missingness check is:

Code

cyclamen |> summarise(across(everything(), ~ sum(is.na(.))))

# A tibble: 1 × 6
  sequence petal.width petal.length aperture colour group
     <int>       <int>        <int>    <int>  <int> <int>
1        0           0            0        0      0     0

In EDA we typically combine numerical summaries with visual summaries. The next few sections outline basic visualisations.

1.2 Histograms and boxplots

Below are some standard plots for petal measurements. Histograms help you see shape (symmetry, skewness, multimodality), and boxplots highlight spread and potential outliers.

Code

cyclamen |>
    ggplot(aes(x = petal.length)) +
    geom_histogram(bins = 15)

Code

cyclamen |>
    ggplot(aes(x = petal.width)) +
    geom_histogram(bins = 15)

Code

cyclamen |>
    ggplot(aes(x = aperture)) +
    geom_histogram(bins = 8)

Code

cyclamen |>
    pivot_longer(cols = c(petal.length, petal.width, aperture),
        names_to = "variable",
        values_to = "value") |>
            ggplot(aes(x = variable, y = value)) +
            geom_boxplot()

Outliers are not necessarily errors

A point that looks like an outlier is not automatically a data error. It may be a legitimate observation (for example, an unusually large flower), but it is always worth checking whether the measurement could have been recorded incorrectly (for example, were the correct units used).

1.3 Comparing groups

A common EDA question is whether distributions differ across a group variable. Here, color is a natural grouping to start with.

Code

cyclamen |>
    ggplot(aes(x = colour, y = petal.length)) +
    geom_boxplot()

1.4 Relationships between variables

For two numerical variables, a scatter plot is usually the first plot to make.

Code

cyclamen |>
    ggplot(aes(x = petal.length, y = petal.width)) +
    geom_point()

Sometimes it is helpful to add a grouping variable (here: colour) to see whether patterns differ between groups.

Code

cyclamen |>
    ggplot(aes(x = petal.length, y = petal.width, shape = colour)) +
    geom_point()

1.5 Correlation as a summary of association

Correlation coefficients give a one-number summary of association between two numerical variables. Correlation does not imply causation!

In this module, we will use three common correlation coefficients:

Pearson correlation measures linear association (it is the default and is most directly connected to linear regression).
Spearman correlation measures monotone association by applying Pearson correlation to the ranks of the data.
Kendall correlation (Kendall’s \tau) measures association using concordant and discordant pairs, and also measures monotone association.

In EDA, it is common to compute Pearson, Spearman, and Kendall correlations together. If Pearson is small but Spearman/Kendall are large (in magnitude), that often suggests a relationship that is monotone but not well-approximated by a straight line.

A good EDA habit

Make a scatter plot first.
Compute Pearson correlation.
If the relationship looks monotone but not linear, or if outliers look influential, also compute Spearman and Kendall as a sensitivity check.

Figure 1.1: Pairwise scatterplot matrix for cyclamen petal width, petal length, and aperture diameter (using `GGally:ggpairs`). Points are shaped by flower colour. The diagonal panels show each variable’s distribution (one density per colour); the upper panels summarise pairwise association using the *Pearson* correlation.

The built-in function cor can be used to compute different correlation coefficients by specifying the method. For the cyclamen data, we will compute correlations between the three numerical flower measurements (here we exclude sequence, which is an index rather than a flower measurement).

Code

cyclamen_num <- cyclamen |>
    select(petal.width, petal.length, aperture)

1.5.1 Pearson correlation matrix

Code

cor(cyclamen_num, use = "complete.obs", method = "pearson")

             petal.width petal.length  aperture
petal.width    1.0000000    0.4474288 0.4277849
petal.length   0.4474288    1.0000000 0.4238784
aperture       0.4277849    0.4238784 1.0000000

1.5.2 Spearman correlation matrix

Code

cor(cyclamen_num, use = "complete.obs", method = "spearman")

             petal.width petal.length  aperture
petal.width    1.0000000    0.4852075 0.4734670
petal.length   0.4852075    1.0000000 0.4154867
aperture       0.4734670    0.4154867 1.0000000

1.5.3 Kendall correlation matrix

Code

cor(cyclamen_num, use = "complete.obs", method = "kendall")

             petal.width petal.length  aperture
petal.width    1.0000000    0.3689397 0.3751712
petal.length   0.3689397    1.0000000 0.3353495
aperture       0.3751712    0.3353495 1.0000000

If Pearson is close to 0 but Spearman/Kendall are not, that often suggests a relationship that is monotone but not well-approximated by a straight line. If Pearson differs substantially from Spearman/Kendall, it is also worth checking for outliers.

1.6 Principal Component Analysis

Principal Component Analysis (PCA) is an exploratory technique for understanding variation in multivariate numerical data. It constructs new variables (principal components) which are linear combinations of the original variables and which capture decreasing amounts of variation.

For cyclamen, we will apply PCA to the three numerical flower measurements: petal width, petal length, and aperture diameter.

Rescaling

PCA is sensitive to the scale of the variables. When variables are measured on different scales (or have very different spreads), it is standard practice to standardise them before doing PCA. In prcomp, this is done with scale. = TRUE.

Code

cyclamen_pca <- cyclamen |>
    select(petal.width, petal.length, aperture, colour)

pca_fit <- cyclamen_pca |> 
    select(!colour) |>
    prcomp(scale. = TRUE)

summary(pca_fit)

Importance of components:
                         PC1    PC2    PC3
Standard deviation     1.366 0.7625 0.7432
Proportion of Variance 0.622 0.1938 0.1841
Cumulative Proportion  0.622 0.8159 1.0000

1.6.1 Understanding the PCA summary

For three variables, PCA produces three principal components (PC1, PC2, PC3). The summary() output reports:

Standard deviation of each principal component: this is the spread of the data along that component.
Proportion of Variance explained by each component: the fraction of the total variability captured by that component.
Cumulative Proportion or the fraction captured by the first k (reading left to right across the table).

In particular, the variance explained by PCk is pca_fit$sdev[k]^2, and the proportion of variance explained is obtained by dividing by the total: \mathrm{PVE}_k = \frac{\mathrm{sdev}_k^2}{\sum_{j=1}^3 \mathrm{sdev}_j^2} \,. Because we used scale. = TRUE, each original variable is standardised to have variance 1, so the total variance is approximately 3 (up to rounding). This is why \sum_{k=1}^3 \mathrm{sdev}_k^2 \approx 3.

For the cylcamen data, the summary output shows:

PC1 explains about 62.2\% of the variation in the three measurements.
PC2 explains about 19.4\%, and PC3 about 18.4\%.
The first two components together explain about 81.6\% of the variation.

Code

pca_var <- pca_fit$sdev^2
pca_pve <- pca_var / sum(pca_var)

pve_tbl <-
    tibble(
        PC = factor(paste0("PC", seq_along(pca_pve)),
        levels = paste0("PC", seq_along(pca_pve))),
        proportion = pca_pve,
        cumulative = cumsum(pca_pve)
        )

pve_tbl

# A tibble: 3 × 3
  PC    proportion cumulative
  <fct>      <dbl>      <dbl>
1 PC1        0.622      0.622
2 PC2        0.194      0.816
3 PC3        0.184      1

Figure 1.2: **Scree plot**: proportion of variance explained by each principal component in the cyclamen PCA.

1.6.2 PCA scores plot

A useful EDA plot is the projection of the observations onto the first two principal components. Here we colour by colour to check whether colour is associated with the main modes of variation in the measurements.

Code

scores <-
    as_tibble(pca_fit$x) |>
    bind_cols(cyclamen_pca |> select(colour))

scores |>
    ggplot(aes(x = PC1, y = PC2, shape = colour)) +
    geom_point()

The scatter plot above displays cyclamen PCA scores (PC1 vs PC2). Each point is a flower; points are shaped by flower colour.

1.6.3 Interpreting the principal components

To interpret the components, we examine the loadings (the coefficients defining each principal component as a linear combination of the original variables). In prcomp, these are stored in pca_fit$rotation.

Code

as_tibble(pca_fit$rotation, rownames = "variable")

# A tibble: 3 × 4
  variable       PC1    PC2     PC3
  <chr>        <dbl>  <dbl>   <dbl>
1 petal.width  0.581 -0.347 -0.736 
2 petal.length 0.580 -0.458  0.674 
3 aperture     0.571  0.818  0.0649

A loading with larger magnitude means that the corresponding variable contributes more strongly to that principal component.

PCA is not a hypothesis test

PCA is an exploratory tool to help you describe patterns and identify dominant directions of variation. A common outcome is dimension reduction: here, PC1 and PC2 capture about 81.6\% of the variation, so a two-dimensional view retains most of the information in the three measurements.

{{< include preamble.qmd >}} ```{r cyclamen-import} #| include: false cyclamen <- read_csv("data/cyclamen24.csv") cyclamen$colour <- factor(cyclamen$colour) cyclamen$group <- factor(cyclamen$group) ``` # Exploratory data analysis {#sec-eda} Exploratory data analysis (EDA) is the process of understanding a data set before performing formal inference or fitting a statistical model. The goal is to identify the important features of the data (and any problems) early. :::{.callout-tip} ## In practice, EDA answers questions like: - What are the variables and units? - How much missing data is there? - What do the marginal distributions look like? - Are there outliers or data errors? - Which variables appear associated? ::: ## A first look in R :::{.callout-note collapse="true"} ## Cyclamen Data The **Cyclamen Data** contain 150 obs. of six features (sequence [], petal length [mm], petal width [mm], aperture diameter [mm], color [pink, white], and group [alpha, beta, gamma, delta, epsilon]) for *cyclamen hederifolium*. These measurements were taken by small groups of former MA22004 students using the same measurement protocol at the University Botanic Gardens in the autumn of 2024. ```{r cyclamen-glimpse} #| echo: true #| warning: false #| error: false cyclamen |> glimpse() ``` ::: A useful first step is to compute basic summaries and check that the categorical variables have the expected levels. ```{r cyclamen-summary} #| echo: true #| warning: false #| message: false cyclamen |> summary() ``` The summary above includes the sample mean of the continuous variables (e.g., `petal.width`) and the counts in each category for factors (e.g., `group`). :::{.callout-warning} ## Missingness In R, most functions will silently return `NA` if you have missing values. When computing summaries, decide in advance how you want to handle missingness (for example, removing incomplete rows for a particular calculation, or imputing missing values). ::: A simple missingness check is: ```{r cyclamen-missingness} #| echo: true #| warning: false #| message: false cyclamen |> summarise(across(everything(), ~ sum(is.na(.)))) ``` In EDA we typically combine numerical summaries with visual summaries. The next few sections outline basic visualisations. ## Histograms and boxplots Below are some standard plots for petal measurements. Histograms help you see shape (symmetry, skewness, multimodality), and boxplots highlight spread and potential outliers. ```{r cyclamen-hist-l} #| echo: true #| warning: false #| message: false cyclamen |> ggplot(aes(x = petal.length)) + geom_histogram(bins = 15) ``` ```{r cyclamen-hist-w} #| echo: true #| warning: false #| message: false cyclamen |> ggplot(aes(x = petal.width)) + geom_histogram(bins = 15) ``` ```{r cyclamen-hist-a} #| echo: true #| warning: false #| message: false cyclamen |> ggplot(aes(x = aperture)) + geom_histogram(bins = 8) ``` ```{r cyclamen-boxplot} #| echo: true #| warning: false #| message: false cyclamen |> pivot_longer(cols = c(petal.length, petal.width, aperture), names_to = "variable", values_to = "value") |> ggplot(aes(x = variable, y = value)) + geom_boxplot() ``` :::{.callout-note} ## Outliers are not necessarily errors A point that looks like an outlier is not automatically a data *error*. It may be a legitimate observation (for example, an unusually large flower), but it is always worth checking whether the measurement could have been recorded incorrectly (for example, were the correct units used). ::: ## Comparing groups A common EDA question is whether distributions differ across a group variable. Here, color is a natural grouping to start with. ```{r cylcamen-boxplot-colour} #| echo: true #| warning: false #| message: false cyclamen |> ggplot(aes(x = colour, y = petal.length)) + geom_boxplot() ``` ## Relationships between variables For two numerical variables, a scatter plot is usually the first plot to make. ```{r cyclamen-scatter} #| echo: true #| warning: false #| message: false cyclamen |> ggplot(aes(x = petal.length, y = petal.width)) + geom_point() ``` Sometimes it is helpful to add a grouping variable (here: `colour`) to see whether patterns differ between groups. ```{r cyclamen-scatter-colour} #| echo: true #| warning: false #| message: false cyclamen |> ggplot(aes(x = petal.length, y = petal.width, shape = colour)) + geom_point() ``` ## Correlation as a summary of association Correlation coefficients give a one-number summary of association between two numerical variables. Correlation does not imply causation! In this module, we will use three common correlation coefficients: - **Pearson** correlation measures linear association (it is the default and is most directly connected to linear regression). - **Spearman** correlation measures monotone association by applying Pearson correlation to the ranks of the data. - **Kendall** correlation (Kendall's $\tau$) measures association using concordant and discordant pairs, and also measures monotone association. In EDA, it is common to compute Pearson, Spearman, and Kendall correlations together. If Pearson is small but Spearman/Kendall are large (in magnitude), that often suggests a relationship that is monotone but not well-approximated by a straight line. :::{.callout-tip} ## A good EDA habit 1. Make a scatter plot first. 2. Compute Pearson correlation. 3. If the relationship looks monotone but not linear, or if outliers look influential, also compute Spearman and Kendall as a sensitivity check. ::: ```{r fig-cyclamen-pairwise-scatter} #| echo: false #| warning: false #| message: false #| fig-cap: "Pairwise scatterplot matrix for cyclamen petal width, petal length, and aperture diameter (using `GGally:ggpairs`). Points are shaped by flower colour. The diagonal panels show each variable's distribution (one density per colour); the upper panels summarise pairwise association using the *Pearson* correlation." library(GGally) cyclamen |> select(petal.width, petal.length, aperture, colour) |> ggpairs( columns = c("petal.width", "petal.length", "aperture"), mapping = aes(shape = colour) ) ``` The built-in function `cor` can be used to compute different correlation coefficients by specifying the method. For the cyclamen data, we will compute correlations between the three numerical flower measurements (here we exclude `sequence`, which is an index rather than a flower measurement). ```{r cyclamen-num} #| echo: true #| warning: false #| message: false cyclamen_num <- cyclamen |> select(petal.width, petal.length, aperture) ``` ### Pearson correlation matrix ```{r cyclamen-pearson} #| echo: true #| warning: false #| message: false cor(cyclamen_num, use = "complete.obs", method = "pearson") ``` ### Spearman correlation matrix ```{r cyclamen-spearman} #| echo: true #| warning: false #| message: false cor(cyclamen_num, use = "complete.obs", method = "spearman") ``` ### Kendall correlation matrix ```{r cyclamen-kendall} #| echo: true #| warning: false #| message: false cor(cyclamen_num, use = "complete.obs", method = "kendall") ``` If Pearson is close to $0$ but Spearman/Kendall are not, that often suggests a relationship that is monotone but not well-approximated by a straight line. If Pearson differs substantially from Spearman/Kendall, it is also worth checking for outliers. ## Principal Component Analysis Principal Component Analysis (PCA) is an exploratory technique for understanding variation in multivariate numerical data. It constructs new variables (principal components) which are linear combinations of the original variables and which capture decreasing amounts of variation. For cyclamen, we will apply PCA to the three numerical flower measurements: petal width, petal length, and aperture diameter. :::{.callout-warning} ## Rescaling PCA is sensitive to the scale of the variables. When variables are measured on different scales (or have very different spreads), it is standard practice to *standardise* them before doing PCA. In `prcomp`, this is done with `scale. = TRUE`. ::: ```{r cyclamen-pca} #| echo: true #| warning: false #| message: false cyclamen_pca <- cyclamen |> select(petal.width, petal.length, aperture, colour) pca_fit <- cyclamen_pca |> select(!colour) |> prcomp(scale. = TRUE) summary(pca_fit) ``` ### Understanding the PCA summary For three variables, PCA produces three principal components (PC1, PC2, PC3). The `summary()` output reports: - **Standard deviation** of each principal component: this is the spread of the data along that component. - **Proportion of Variance** explained by each component: the fraction of the total variability captured by that component. - **Cumulative Proportion** or the fraction captured by the first $k$ (reading left to right across the table). In particular, the variance explained by PC$k$ is `pca_fit$sdev[k]^2`, and the proportion of variance explained is obtained by dividing by the total: $$\mathrm{PVE}_k = \frac{\mathrm{sdev}_k^2}{\sum_{j=1}^3 \mathrm{sdev}_j^2} \,.$$ Because we used `scale. = TRUE`, each original variable is standardised to have variance $1$, so the total variance is approximately $3$ (up to rounding). This is why $\sum_{k=1}^3 \mathrm{sdev}_k^2 \approx 3$. For the cylcamen data, the summary output shows: - PC1 explains about $62.2\%$ of the variation in the three measurements. - PC2 explains about $19.4\%$, and PC3 about $18.4\%$. - The first two components together explain about $81.6\%$ of the variation. ```{r pca-tbl} #| echo: true #| warning: false #| message: false pca_var <- pca_fit$sdev^2 pca_pve <- pca_var / sum(pca_var) pve_tbl <- tibble( PC = factor(paste0("PC", seq_along(pca_pve)), levels = paste0("PC", seq_along(pca_pve))), proportion = pca_pve, cumulative = cumsum(pca_pve) ) pve_tbl ``` ```{r fig-cyclamen-scree} #| echo: false #| warning: false #| message: false #| fig-cap: "**Scree plot**: proportion of variance explained by each principal component in the cyclamen PCA." pve_tbl |> ggplot(aes(x = PC, y = proportion)) + geom_col() ``` ### PCA scores plot A useful EDA plot is the projection of the observations onto the first two principal components. Here we colour by `colour` to check whether colour is associated with the main modes of variation in the measurements. ```{r pca-scores} #| echo: true #| warning: false #| message: false scores <- as_tibble(pca_fit$x) |> bind_cols(cyclamen_pca |> select(colour)) scores |> ggplot(aes(x = PC1, y = PC2, shape = colour)) + geom_point() ``` The scatter plot above displays cyclamen PCA scores (PC1 vs PC2). Each point is a flower; points are shaped by flower colour. ### Interpreting the principal components To interpret the components, we examine the **loadings** (the coefficients defining each principal component as a linear combination of the original variables). In `prcomp`, these are stored in `pca_fit$rotation`. ```{r pca-loadings} #| echo: true #| warning: false #| message: false as_tibble(pca_fit$rotation, rownames = "variable") ``` A loading with larger magnitude means that the corresponding variable contributes more strongly to that principal component. :::{.callout-warning} ## PCA is not a hypothesis test PCA is an exploratory tool to help you describe patterns and identify dominant directions of variation. A common outcome is dimension reduction: here, PC1 and PC2 capture about $81.6\%$ of the variation, so a two-dimensional view retains most of the information in the three measurements. :::