6  Linear regression

Regression analysis allows us to study the relationship among two or more rvs. Typically, we are interested in the relationship between a response or dependent rv Y and a covariate X. The relationship between X and Y will be explained through a regression function, r(x) = \mathbf{E}[Y \mid X = x] = \int y f(y\mid x) dy \,. In particular, we shall assume that r is linear, r(x) = \beta_0 + \beta_1 x\,, \tag{6.1} and estimate the intercept \beta_0 and slope \beta_1 of this linear model from sample data (Y_1, X_1), \dots, (Y_m, X_m) \sim F_{Y,X}\,.

Alternative lingo

The covariates X are also called predictor variables, explanatory variables, independent variables, and/or features depending on who you are talking to.

6.1 Simple linear regression models

The simplest regression is when X_i is one-dimensional and r(x) is linear as in Equation 6.1. A linear regression posits the expected value of Y_i is a linear function of the data X_i, but that Y deviates from its expected value by a random amount for fixed x_i.

Definition 6.1 The simple linear regression model relates a random response Y_i to a set of independent variables X_i, Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \,, \tag{6.2} where the intercept \beta_0 and slope \beta_1 are unknown parameters and the random deviation or random error \epsilon_i is a rv assumed to satisfy:

  1. \mathbf{E}[\epsilon_i \mid X_i = x_i] = 0,
  2. \mathop{\mathrm{Var}}[\epsilon_i \mid X_i = x_i] = \sigma^2 does not depend on x_i,
  3. \epsilon_i and \epsilon_j are independent for i, j = 1, \dots, m.

From the assumptions on \epsilon_i, the linear model Equation 6.2 implies \mathbf{E}[Y_i \mid X_i = x_i] = \beta_0 + \beta_1 x_i \,. Thus, if \widehat{\beta}_0 and \widehat{\beta}_1 are estimators of \beta_0 and \beta_1, then the fitted line is \widehat{r}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x and the predicted or fitted value \widehat{Y}_i = \widehat{r}(X_i) is an estimator for \mathbf{E}[Y_i \mid X_i = x_i]. The residuals are defined to be \widehat{\epsilon}_i = Y_i - \widehat{Y}_i = Y_i - \left( \widehat{\beta}_0 + \widehat{\beta}_1 X_i \right) \,. \tag{6.3} The residual sums of squares, \mathsf{RSS} = \sum_{i=1}^m \widehat{\epsilon}_i^2\,, \tag{6.4} measures how well the regression line \widehat{r} fits the data (Y_1, X_1), \dots, (Y_m, X_m). The least squares estimates of \widehat{\beta}_0 and \widehat{\beta}_1 are the values that minimize the \mathsf{RSS} in Equation 6.4.

Theorem 6.1 The least squares estimates for \widehat{\beta}_1 and \widehat{\beta}_0 are given by, respectively, \widehat{\beta}_1 = \frac{\sum_{i=1}^m (X_i - \overline{X})(Y_i - \overline{Y})}{\sum_{i=1}^m (X_i - \overline{X})^2} = \frac{S_{xy}}{S_{xx}} \,, \tag{6.5} and \widehat{\beta}_0 = \overline{Y} - \widehat{\beta}_1 \overline{X}\,. \tag{6.6}

Equation 6.4 is a function of \widehat{\beta}_0 and \widehat{\beta}_1 from the definition of the residuals Equation 6.3. Then Equation 6.5 and Equation 6.6 follow by equating the partial derivatives of Equation 6.4 to zero. The \widehat{\beta}_0 and \widehat{\beta}_1 are the unique solution to this linear system.

Alternative lingo

The \mathsf{RSS} is sometimes referred to as the error sum of squares and abbreviated \mathsf{SSE} (no, the order is not a typo).

Example 6.1 In Figure 6.1 and Figure 6.2, we consider the Cherry Tree Data (see Table 2.1) and discussion). We fit a least squares regression of timber volume (response variable) to the tree’s diameter (independent variable). As you would expect, the timber yield increases with diameter.

The R code below can be used to calculate the least squares regression and residuals.

Code
data(trees)
y <- trees$Volume
x <- trees$Girth # NB: this is the diameter; data mislabeled!
fit <- lm(y ~ x)
e <- resid(fit)
yhat <- predict(fit)

The fit data frame contains the estimates for \widehat{\beta}_0 and \widehat{\beta}_1:

Code
fit$coefficients
(Intercept)           x 
 -36.943459    5.065856 

Both Figure 6.1 and Figure 6.2 are scatter plots of the observed values y. In Figure 6.1, the regression line \widehat{y} is plotted along with the residuals \widehat{\epsilon}. In Figure 6.2, the sample mean \overline{y} is plotted together with the deviations y - \overline{y}.

Figure 6.1: Linear regression (or least squares fit) of Volume to Diameter from the Cherry Tree Data. The vertical bars between the observed data point and the regression line indicate the error in the fit (the least squares residual). The residuals are squared and summed to yield the \mathsf{RSS} (alt: \mathsf{SSE}).
Figure 6.2: The deviations about the sample mean \overline{y}. The sum of the squared deviations or \mathsf{SST} (total sum of squares) is a measure of the total variation in the observations.

6.2 Estimating \sigma^2 for linear regressions

The parameter \sigma^2 (the variance of the random deviation) determines the variability in the regression model.

Theorem 6.2 An unbiased estimate of \sigma^2 is given by \widehat{\sigma}^2 = s^2 = \frac{\mathsf{RSS}}{m-2} = \frac{1}{m-2} \sum_{i=1}^m (y_i - \widehat{y}_i)^2\,. \tag{6.7}

In Figure 6.3, we present a least squares regression of timber volume on both tree diameter and height (for the Cherry Tree Data). As expected, the regressions indicate the volume increases with both covariates. Estimates for the variance of the random deviation Equation 6.7 in both regression models, \sigma_{D}^2 and \sigma_{H}^2, respectively, are computed to be s^2_{D} = 18.08 and s^2_{H} = 179.48. Thus, we see that small variances lead to observations of (x_i, y_i) that sit tightly around the regression line, in contrast to large variances that lead to a large cloud of points.

Figure 6.3: For the Cherry Tree Data, we estimate the variance to be s^2_{D} = 18.08 (for Diameter) and s^2_{H} = 179.48 (for Height); small variances lead to observations of (x_i, y_i) that sit tightly around the regression line, in contrast to large variances that lead to a large cloud of points.
Why do we lose two degrees of freedom?

In Theorem 6.2, the number in the denominator is the df associated with the \mathsf{RSS} and s^2. To calculate \mathsf{RSS}, you must estimate two parameters \beta_0 and \beta_1, which results in the loss of two df. Hence the m-2.

We note to make inferences, the statistic S^2 = \frac{\mathsf{RSS}}{m-2} is an unbiased estimator or \sigma^2 and the random variable \frac{(m-2) S^2}{\sigma^2} \sim \chi^2(m-2)\,. Moreover, the statistic S^2 is independent of both \widehat{\beta}_0 and \widehat{\beta}_1.

6.3 Inferences for least-squares parameters

If \epsilon_i in Equation 6.2 is assumed to be normally distributed, then we can derive the sampling distributions of the estimators \widehat{\beta}_0 and \widehat{\beta}_1. Hence, we can use these sampling distributions to make inferences about the parameters \beta_0 and \beta_1.

Provided iid \epsilon_i \mid X_i \sim \mathsf{N}(0, \sigma^2), the least-squares estimators possess the following properties.

  1. Both \widehat{\beta}_0 and \widehat{\beta}_1 are normally distributed.
  2. Both \widehat{\beta}_0 and \widehat{\beta}_1 are unbiased, i.e., \mathbf{E}[\widehat{\beta}_i] = \beta_i for i = 0,1.
  3. \mathop{\mathrm{Var}}[\widehat{\beta}_0] = c_{00} \sigma^2 where c_{00} = \sum_{i=1}^m x_i^2 / (m S_{xx}).
  4. \mathop{\mathrm{Var}}[\widehat{\beta}_1] = c_{11} \sigma^2 where c_{11} = 1/S_{xx}.
  5. \mathop{\mathrm{Cov}}[\widehat{\beta}_0, \widehat{\beta}_1] = c_{01} \sigma^2 where c_{01} = - \overline{x} / S_{xx}.

These properties can be determined by working directly from Equation 6.5 and Equation 6.6.

Proposition 6.1 Consider H_0 : \beta_i = \beta_{i0}. The test statistic is T = \frac{\widehat{\beta}_i - \beta_{i0}}{S\sqrt{c_{ii}}} \,. \tag{6.8}

For a hypothesis test at level \alpha, we use the following procedure:

If H_a : \beta_i > \beta_{i0}, then P-value is the area under \mathsf{t}(m-2) to the right of t.

If H_a : \beta_i < \beta_{i0}, tthen P-value is the area under \mathsf{t}(m-2) to the left of t.

If H_a : \beta_i \neq \beta_{i0}, then P-value is twice the area under \mathsf{t}(m-2) to the right of |t|.

A confidence interval for \beta_i, based on the statistic Equation 6.8, can be given following the procedures in Chapter 3.

Proposition 6.2 A 100(1-\alpha)\% CI for \beta_i is given by \widehat{\beta}_i \pm t_{\alpha/2, m-2} S \sqrt{c_{ii}} \,.

6.4 Correlation

Let (X_1, Y_1), \dots, (X_m, Y_m) denote a random sample from a bivariate normal distribution with \mathbf{E}[X_i] = \mu_X, \mathbf{E}[Y_i] = \mu_Y, \mathop{\mathrm{Var}}[X_i] = \sigma_X^2, \mathop{\mathrm{Var}}[Y_i] = \sigma_Y^2, and correlation coefficient \rho. The sample correlation coefficient is given by, r = \frac{\sum_{i=1}^m (X_i - \overline{X})(Y_i - \overline{Y})}{\sqrt{\sum_{i=1}^m (X_i - \overline{X})^2 \sum_{i=1}^m (Y_i - \overline{Y})^2}}\,, \tag{6.9} which can be rewritten in terms of S_{xx}, S_{xy}, and S_{yy}: r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} = \widehat{\beta}_1 \sqrt{\frac{S_{xx}}{S_{yy}}}\,, using Equation 6.5 and we see that r and \widehat{\beta}_1 have the same sign. A |r| close to 1 means that the regression line is a good fit to the data, and, similarly, an |r| close to 0 means a poor fit to the data. Note that the correlation coefficient (and the least squares regression) are only suitable for describing linear relationships; a nonlinear relationship can also yield r near zero (see Figure 6.4).

Figure 6.4: Correlations range from -1 to 1 with |r|=1 indicating a strong linear relationship and r near zero indicating the absence of a linear relationship.

6.5 Prediction using linear models

Once a model is fit, it can be used to predict a value of y for a given x. However, the model only gives the most likely value of y; a corresponding prediction interval is usually more appropriate.

Proposition 6.3 A 100(1-\alpha)\% prediction interval for an actual value of Y when x = x^* is given by (\widehat{\beta}_0 + \widehat{\beta}_1 x^*) \pm t_{\alpha/2, m-2} S \sqrt{1 + \frac{1}{n} + \frac{(x^* - \overline{x})^2}{S_{xx}}} \,.

Prediction versus confidence intervals

The prediction interval is different from the confidence interval for expected Y. Note that the length of the confidence interval for \mathbf{E}[Y] when x=x^* is given by 2 \cdot t_{\alpha/2} S \sqrt{\frac{1}{n} + \frac{(x^* - \overline{x})^2}{S_{xx}}} whereas the length for the prediction interval of Y is 2 \cdot t_{\alpha/2} S \sqrt{1 + \frac{1}{n} + \frac{(x^* - \overline{x})^2}{S_{xx}}} \,. Thus the prediction intervals for an actual value of Y are longer than the confidence intervals for \mathbf{E}[Y] if both are determined for the same value x^*.

The linear model \mathbf{E}[ Y \mid X = x ] = \beta_0 + \beta_1 x \,, assumes that the conditional expectation of Y for a fixed value of X is a linear function of the x value. If we assume that (X,Y) has a bivariate normal distribution, then \beta_1 = \frac{\sigma_Y}{\sigma_X} \rho \,, and thus, for the simple hypothesis tests we have considered (Table 2.2), statistical tests for \beta_1 and \rho are equivalent.