1  Exploratory data analysis

Exploratory data analysis (EDA) is the process of understanding a data set before performing formal inference or fitting a statistical model. The goal is to identify the important features of the data (and any problems) early.

TipIn practice, EDA answers questions like:
  • What are the variables and units?
  • How much missing data is there?
  • What do the marginal distributions look like?
  • Are there outliers or data errors?
  • Which variables appear associated?

1.1 A first look in R

The Cyclamen Data contain 150 obs. of six features (sequence [], petal length [mm], petal width [mm], aperture diameter [mm], color [pink, white], and group [alpha, beta, gamma, delta, epsilon]) for cyclamen hederifolium. These measurements were taken by small groups of former MA22004 students using the same measurement protocol at the University Botanic Gardens in the autumn of 2024.

Code
cyclamen |> glimpse()
Rows: 150
Columns: 6
$ sequence     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, …
$ petal.width  <dbl> 9, 7, 11, 13, 11, 10, 12, 14, 9, 9, 9, 9, 11, 10, 15, 14, 11, 11, 9…
$ petal.length <dbl> 16, 13, 23, 19, 16, 18, 20, 20, 15, 25, 17, 19, 21, 23, 20, 23, 19,…
$ aperture     <dbl> 5, 4, 6, 6, 7, 6, 7, 8, 7, 6, 6, 6, 7, 8, 9, 7, 9, 8, 6, 7, 6, 7, 7…
$ colour       <fct> white, pink, pink, white, pink, white, pink, white, white, pink, wh…
$ group        <fct> epsilon, epsilon, epsilon, epsilon, epsilon, epsilon, epsilon, epsi…

A useful first step is to compute basic summaries and check that the categorical variables have the expected levels.

Code
cyclamen |> summary()
    sequence     petal.width     petal.length     aperture        colour       group   
 Min.   : 1.0   Min.   : 4.00   Min.   : 9.0   Min.   : 4.000   pink :83   alpha  :30  
 1st Qu.: 8.0   1st Qu.: 9.00   1st Qu.:18.0   1st Qu.: 6.000   white:67   beta   :30  
 Median :15.5   Median :10.00   Median :21.0   Median : 7.000              delta  :30  
 Mean   :15.5   Mean   :10.29   Mean   :20.5   Mean   : 7.173              epsilon:30  
 3rd Qu.:23.0   3rd Qu.:12.00   3rd Qu.:23.0   3rd Qu.: 8.000              gamma  :30  
 Max.   :30.0   Max.   :19.00   Max.   :31.0   Max.   :11.000                          

The summary above includes the sample mean of the continuous variables (e.g., petal.width) and the counts in each category for factors (e.g., group).

WarningMissingness

In R, most functions will silently return NA if you have missing values. When computing summaries, decide in advance how you want to handle missingness (for example, removing incomplete rows for a particular calculation, or imputing missing values).

A simple missingness check is:

Code
cyclamen |> summarise(across(everything(), ~ sum(is.na(.))))
# A tibble: 1 × 6
  sequence petal.width petal.length aperture colour group
     <int>       <int>        <int>    <int>  <int> <int>
1        0           0            0        0      0     0

In EDA we typically combine numerical summaries with visual summaries. The next few sections outline basic visualisations.

1.2 Histograms and boxplots

Below are some standard plots for petal measurements. Histograms help you see shape (symmetry, skewness, multimodality), and boxplots highlight spread and potential outliers.

Code
cyclamen |>
    ggplot(aes(x = petal.length)) +
    geom_histogram(bins = 15)

Code
cyclamen |>
    ggplot(aes(x = petal.width)) +
    geom_histogram(bins = 15)

Code
cyclamen |>
    ggplot(aes(x = aperture)) +
    geom_histogram(bins = 8)

Code
cyclamen |>
    pivot_longer(cols = c(petal.length, petal.width, aperture),
        names_to = "variable",
        values_to = "value") |>
            ggplot(aes(x = variable, y = value)) +
            geom_boxplot()

NoteOutliers are not necessarily errors

A point that looks like an outlier is not automatically a data error. It may be a legitimate observation (for example, an unusually large flower), but it is always worth checking whether the measurement could have been recorded incorrectly (for example, were the correct units used).

1.3 Comparing groups

A common EDA question is whether distributions differ across a group variable. Here, color is a natural grouping to start with.

Code
cyclamen |>
    ggplot(aes(x = colour, y = petal.length)) +
    geom_boxplot()

1.4 Relationships between variables

For two numerical variables, a scatter plot is usually the first plot to make.

Code
cyclamen |>
    ggplot(aes(x = petal.length, y = petal.width)) +
    geom_point()

Sometimes it is helpful to add a grouping variable (here: colour) to see whether patterns differ between groups.

Code
cyclamen |>
    ggplot(aes(x = petal.length, y = petal.width, shape = colour)) +
    geom_point()

1.5 Correlation as a summary of association

Correlation coefficients give a one-number summary of association between two numerical variables. Correlation does not imply causation!

In this module, we will use three common correlation coefficients:

  • Pearson correlation measures linear association (it is the default and is most directly connected to linear regression).
  • Spearman correlation measures monotone association by applying Pearson correlation to the ranks of the data.
  • Kendall correlation (Kendall’s \tau) measures association using concordant and discordant pairs, and also measures monotone association.

In EDA, it is common to compute Pearson, Spearman, and Kendall correlations together. If Pearson is small but Spearman/Kendall are large (in magnitude), that often suggests a relationship that is monotone but not well-approximated by a straight line.

TipA good EDA habit
  1. Make a scatter plot first.
  2. Compute Pearson correlation.
  3. If the relationship looks monotone but not linear, or if outliers look influential, also compute Spearman and Kendall as a sensitivity check.
Figure 1.1: Pairwise scatterplot matrix for cyclamen petal width, petal length, and aperture diameter (using GGally:ggpairs). Points are shaped by flower colour. The diagonal panels show each variable’s distribution (one density per colour); the upper panels summarise pairwise association using the Pearson correlation.

The built-in function cor can be used to compute different correlation coefficients by specifying the method. For the cyclamen data, we will compute correlations between the three numerical flower measurements (here we exclude sequence, which is an index rather than a flower measurement).

Code
cyclamen_num <- cyclamen |>
    select(petal.width, petal.length, aperture)

1.5.1 Pearson correlation matrix

Code
cor(cyclamen_num, use = "complete.obs", method = "pearson")
             petal.width petal.length  aperture
petal.width    1.0000000    0.4474288 0.4277849
petal.length   0.4474288    1.0000000 0.4238784
aperture       0.4277849    0.4238784 1.0000000

1.5.2 Spearman correlation matrix

Code
cor(cyclamen_num, use = "complete.obs", method = "spearman")
             petal.width petal.length  aperture
petal.width    1.0000000    0.4852075 0.4734670
petal.length   0.4852075    1.0000000 0.4154867
aperture       0.4734670    0.4154867 1.0000000

1.5.3 Kendall correlation matrix

Code
cor(cyclamen_num, use = "complete.obs", method = "kendall")
             petal.width petal.length  aperture
petal.width    1.0000000    0.3689397 0.3751712
petal.length   0.3689397    1.0000000 0.3353495
aperture       0.3751712    0.3353495 1.0000000

If Pearson is close to 0 but Spearman/Kendall are not, that often suggests a relationship that is monotone but not well-approximated by a straight line. If Pearson differs substantially from Spearman/Kendall, it is also worth checking for outliers.

1.6 Principal Component Analysis

Principal Component Analysis (PCA) is an exploratory technique for understanding variation in multivariate numerical data. It constructs new variables (principal components) which are linear combinations of the original variables and which capture decreasing amounts of variation.

For cyclamen, we will apply PCA to the three numerical flower measurements: petal width, petal length, and aperture diameter.

WarningRescaling

PCA is sensitive to the scale of the variables. When variables are measured on different scales (or have very different spreads), it is standard practice to standardise them before doing PCA. In prcomp, this is done with scale. = TRUE.

Code
cyclamen_pca <- cyclamen |>
    select(petal.width, petal.length, aperture, colour)

pca_fit <- cyclamen_pca |> 
    select(!colour) |>
    prcomp(scale. = TRUE)

summary(pca_fit)
Importance of components:
                         PC1    PC2    PC3
Standard deviation     1.366 0.7625 0.7432
Proportion of Variance 0.622 0.1938 0.1841
Cumulative Proportion  0.622 0.8159 1.0000

1.6.1 Understanding the PCA summary

For three variables, PCA produces three principal components (PC1, PC2, PC3). The summary() output reports:

  • Standard deviation of each principal component: this is the spread of the data along that component.
  • Proportion of Variance explained by each component: the fraction of the total variability captured by that component.
  • Cumulative Proportion or the fraction captured by the first k (reading left to right across the table).

In particular, the variance explained by PCk is pca_fit$sdev[k]^2, and the proportion of variance explained is obtained by dividing by the total: \mathrm{PVE}_k = \frac{\mathrm{sdev}_k^2}{\sum_{j=1}^3 \mathrm{sdev}_j^2} \,. Because we used scale. = TRUE, each original variable is standardised to have variance 1, so the total variance is approximately 3 (up to rounding). This is why \sum_{k=1}^3 \mathrm{sdev}_k^2 \approx 3.

For the cylcamen data, the summary output shows:

  • PC1 explains about 62.2\% of the variation in the three measurements.
  • PC2 explains about 19.4\%, and PC3 about 18.4\%.
  • The first two components together explain about 81.6\% of the variation.
Code
pca_var <- pca_fit$sdev^2
pca_pve <- pca_var / sum(pca_var)

pve_tbl <-
    tibble(
        PC = factor(paste0("PC", seq_along(pca_pve)),
        levels = paste0("PC", seq_along(pca_pve))),
        proportion = pca_pve,
        cumulative = cumsum(pca_pve)
        )

pve_tbl
# A tibble: 3 × 3
  PC    proportion cumulative
  <fct>      <dbl>      <dbl>
1 PC1        0.622      0.622
2 PC2        0.194      0.816
3 PC3        0.184      1    
Figure 1.2: Scree plot: proportion of variance explained by each principal component in the cyclamen PCA.

1.6.2 PCA scores plot

A useful EDA plot is the projection of the observations onto the first two principal components. Here we colour by colour to check whether colour is associated with the main modes of variation in the measurements.

Code
scores <-
    as_tibble(pca_fit$x) |>
    bind_cols(cyclamen_pca |> select(colour))

scores |>
    ggplot(aes(x = PC1, y = PC2, shape = colour)) +
    geom_point()

The scatter plot above displays cyclamen PCA scores (PC1 vs PC2). Each point is a flower; points are shaped by flower colour.

1.6.3 Interpreting the principal components

To interpret the components, we examine the loadings (the coefficients defining each principal component as a linear combination of the original variables). In prcomp, these are stored in pca_fit$rotation.

Code
as_tibble(pca_fit$rotation, rownames = "variable")
# A tibble: 3 × 4
  variable       PC1    PC2     PC3
  <chr>        <dbl>  <dbl>   <dbl>
1 petal.width  0.581 -0.347 -0.736 
2 petal.length 0.580 -0.458  0.674 
3 aperture     0.571  0.818  0.0649

A loading with larger magnitude means that the corresponding variable contributes more strongly to that principal component.

WarningPCA is not a hypothesis test

PCA is an exploratory tool to help you describe patterns and identify dominant directions of variation. A common outcome is dimension reduction: here, PC1 and PC2 capture about 81.6\% of the variation, so a two-dimensional view retains most of the information in the three measurements.