Code
data(trees)
<- trees$Volume
y <- trees$Girth # NB: this is the diameter; data mislabeled!
x <- lm(y ~ x)
fit <- resid(fit)
e <- predict(fit) yhat
Regression analysis allows us to study the relationship among two or more rvs. Typically, we are interested in the relationship between a response or dependent rv Y and a covariate X. The relationship between X and Y will be explained through a regression function, r(x) = \mathbf{E}[Y \mid X = x] = \int y f(y\mid x) dy \,. In particular, we shall assume that r is linear, r(x) = \beta_0 + \beta_1 x\,, \tag{6.1} and estimate the intercept \beta_0 and slope \beta_1 of this linear model from sample data (Y_1, X_1), \dots, (Y_m, X_m) \sim F_{Y,X}\,.
The covariates X are also called predictor variables, explanatory variables, independent variables, and/or features depending on who you are talking to.
The simplest regression is when X_i is one-dimensional and r(x) is linear as in Equation 6.1. A linear regression posits the expected value of Y_i is a linear function of the data X_i, but that Y deviates from its expected value by a random amount for fixed x_i.
Definition 6.1 The simple linear regression model relates a random response Y_i to a set of independent variables X_i, Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \,, \tag{6.2} where the intercept \beta_0 and slope \beta_1 are unknown parameters and the random deviation or random error \epsilon_i is a rv assumed to satisfy:
From the assumptions on \epsilon_i, the linear model Equation 6.2 implies \mathbf{E}[Y_i \mid X_i = x_i] = \beta_0 + \beta_1 x_i \,. Thus, if \widehat{\beta}_0 and \widehat{\beta}_1 are estimators of \beta_0 and \beta_1, then the fitted line is \widehat{r}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x and the predicted or fitted value \widehat{Y}_i = \widehat{r}(X_i) is an estimator for \mathbf{E}[Y_i \mid X_i = x_i]. The residuals are defined to be \widehat{\epsilon}_i = Y_i - \widehat{Y}_i = Y_i - \left( \widehat{\beta}_0 + \widehat{\beta}_1 X_i \right) \,. \tag{6.3} The residual sums of squares, \mathsf{RSS} = \sum_{i=1}^m \widehat{\epsilon}_i^2\,, \tag{6.4} measures how well the regression line \widehat{r} fits the data (Y_1, X_1), \dots, (Y_m, X_m). The least squares estimates of \widehat{\beta}_0 and \widehat{\beta}_1 are the values that minimize the \mathsf{RSS} in Equation 6.4.
Theorem 6.1 The least squares estimates for \widehat{\beta}_1 and \widehat{\beta}_0 are given by, respectively, \widehat{\beta}_1 = \frac{\sum_{i=1}^m (X_i - \overline{X})(Y_i - \overline{Y})}{\sum_{i=1}^m (X_i - \overline{X})^2} = \frac{S_{xy}}{S_{xx}} \,, \tag{6.5} and \widehat{\beta}_0 = \overline{Y} - \widehat{\beta}_1 \overline{X}\,. \tag{6.6}
Equation 6.4 is a function of \widehat{\beta}_0 and \widehat{\beta}_1 from the definition of the residuals Equation 6.3. Then Equation 6.5 and Equation 6.6 follow by equating the partial derivatives of Equation 6.4 to zero. The \widehat{\beta}_0 and \widehat{\beta}_1 are the unique solution to this linear system.
The \mathsf{RSS} is sometimes referred to as the error sum of squares and abbreviated \mathsf{SSE} (no, the order is not a typo).
Example 6.1 In Figure 6.1 and Figure 6.2, we consider the Cherry Tree Data (see Table 2.1) and discussion). We fit a least squares regression of timber volume (response variable) to the tree’s diameter (independent variable). As you would expect, the timber yield increases with diameter.
The R
code below can be used to calculate the least squares regression and residuals.
data(trees)
<- trees$Volume
y <- trees$Girth # NB: this is the diameter; data mislabeled!
x <- lm(y ~ x)
fit <- resid(fit)
e <- predict(fit) yhat
The fit
data frame contains the estimates for \widehat{\beta}_0 and \widehat{\beta}_1:
$coefficients fit
(Intercept) x
-36.943459 5.065856
Both Figure 6.1 and Figure 6.2 are scatter plots of the observed values y. In Figure 6.1, the regression line \widehat{y} is plotted along with the residuals \widehat{\epsilon}. In Figure 6.2, the sample mean \overline{y} is plotted together with the deviations y - \overline{y}.
The parameter \sigma^2 (the variance of the random deviation) determines the variability in the regression model.
Theorem 6.2 An unbiased estimate of \sigma^2 is given by \widehat{\sigma}^2 = s^2 = \frac{\mathsf{RSS}}{m-2} = \frac{1}{m-2} \sum_{i=1}^m (y_i - \widehat{y}_i)^2\,. \tag{6.7}
In Figure 6.3, we present a least squares regression of timber volume on both tree diameter and height (for the Cherry Tree Data). As expected, the regressions indicate the volume increases with both covariates. Estimates for the variance of the random deviation Equation 6.7 in both regression models, \sigma_{D}^2 and \sigma_{H}^2, respectively, are computed to be s^2_{D} = 18.08 and s^2_{H} = 179.48. Thus, we see that small variances lead to observations of (x_i, y_i) that sit tightly around the regression line, in contrast to large variances that lead to a large cloud of points.
In Theorem 6.2, the number in the denominator is the df associated with the \mathsf{RSS} and s^2. To calculate \mathsf{RSS}, you must estimate two parameters \beta_0 and \beta_1, which results in the loss of two df. Hence the m-2.
We note to make inferences, the statistic S^2 = \frac{\mathsf{RSS}}{m-2} is an unbiased estimator or \sigma^2 and the random variable \frac{(m-2) S^2}{\sigma^2} \sim \chi^2(m-2)\,. Moreover, the statistic S^2 is independent of both \widehat{\beta}_0 and \widehat{\beta}_1.
If \epsilon_i in Equation 6.2 is assumed to be normally distributed, then we can derive the sampling distributions of the estimators \widehat{\beta}_0 and \widehat{\beta}_1. Hence, we can use these sampling distributions to make inferences about the parameters \beta_0 and \beta_1.
Provided iid \epsilon_i \mid X_i \sim \mathsf{N}(0, \sigma^2), the least-squares estimators possess the following properties.
These properties can be determined by working directly from Equation 6.5 and Equation 6.6.
Proposition 6.1 Consider H_0 : \beta_i = \beta_{i0}. The test statistic is T = \frac{\widehat{\beta}_i - \beta_{i0}}{S\sqrt{c_{ii}}} \,. \tag{6.8}
For a hypothesis test at level \alpha, we use the following procedure:
If H_a : \beta_i > \beta_{i0}, then P-value is the area under \mathsf{t}(m-2) to the right of t.
If H_a : \beta_i < \beta_{i0}, tthen P-value is the area under \mathsf{t}(m-2) to the left of t.
If H_a : \beta_i \neq \beta_{i0}, then P-value is twice the area under \mathsf{t}(m-2) to the right of |t|.
A confidence interval for \beta_i, based on the statistic Equation 6.8, can be given following the procedures in Chapter 3.
Proposition 6.2 A 100(1-\alpha)\% CI for \beta_i is given by \widehat{\beta}_i \pm t_{\alpha/2, m-2} S \sqrt{c_{ii}} \,.
Let (X_1, Y_1), \dots, (X_m, Y_m) denote a random sample from a bivariate normal distribution with \mathbf{E}[X_i] = \mu_X, \mathbf{E}[Y_i] = \mu_Y, \mathop{\mathrm{Var}}[X_i] = \sigma_X^2, \mathop{\mathrm{Var}}[Y_i] = \sigma_Y^2, and correlation coefficient \rho. The sample correlation coefficient is given by, r = \frac{\sum_{i=1}^m (X_i - \overline{X})(Y_i - \overline{Y})}{\sqrt{\sum_{i=1}^m (X_i - \overline{X})^2 \sum_{i=1}^m (Y_i - \overline{Y})^2}}\,, \tag{6.9} which can be rewritten in terms of S_{xx}, S_{xy}, and S_{yy}: r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} = \widehat{\beta}_1 \sqrt{\frac{S_{xx}}{S_{yy}}}\,, using Equation 6.5 and we see that r and \widehat{\beta}_1 have the same sign. A |r| close to 1 means that the regression line is a good fit to the data, and, similarly, an |r| close to 0 means a poor fit to the data. Note that the correlation coefficient (and the least squares regression) are only suitable for describing linear relationships; a nonlinear relationship can also yield r near zero (see Figure 6.4).
Once a model is fit, it can be used to predict a value of y for a given x. However, the model only gives the most likely value of y; a corresponding prediction interval is usually more appropriate.
Proposition 6.3 A 100(1-\alpha)\% prediction interval for an actual value of Y when x = x^* is given by (\widehat{\beta}_0 + \widehat{\beta}_1 x^*) \pm t_{\alpha/2, m-2} S \sqrt{1 + \frac{1}{n} + \frac{(x^* - \overline{x})^2}{S_{xx}}} \,.
The prediction interval is different from the confidence interval for expected Y. Note that the length of the confidence interval for \mathbf{E}[Y] when x=x^* is given by 2 \cdot t_{\alpha/2} S \sqrt{\frac{1}{n} + \frac{(x^* - \overline{x})^2}{S_{xx}}} whereas the length for the prediction interval of Y is 2 \cdot t_{\alpha/2} S \sqrt{1 + \frac{1}{n} + \frac{(x^* - \overline{x})^2}{S_{xx}}} \,. Thus the prediction intervals for an actual value of Y are longer than the confidence intervals for \mathbf{E}[Y] if both are determined for the same value x^*.
The linear model \mathbf{E}[ Y \mid X = x ] = \beta_0 + \beta_1 x \,, assumes that the conditional expectation of Y for a fixed value of X is a linear function of the x value. If we assume that (X,Y) has a bivariate normal distribution, then \beta_1 = \frac{\sigma_Y}{\sigma_X} \rho \,, and thus, for the simple hypothesis tests we have considered (Table 2.2), statistical tests for \beta_1 and \rho are equivalent.