Section 1: Multiple Regression

The in-built dataset trees contains data pertaining to the Volume, Girth and Height of 31 felled black cherry trees. In the Simple Regression session, we constructed a simple linear model for Volume using Girth as the independent variable. Now we will expand this by considering Height as another predictor.

Start by plotting the dataset:

plot(trees)

This plots all variables against each other, enabling visual information about correlations within the dataset.

Re-create the original model of Volume against Girth:

m1 = lm(Volume~Girth,data=trees)
summary(m1)
## 
## Call:
## lm(formula = Volume ~ Girth, data = trees)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.065 -3.107  0.152  3.495  9.587 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
## Girth         5.0659     0.2474   20.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9331 
## F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

Now include Height as an additional variable:

m2 = lm(Volume~Girth+Height,data=trees)
summary(m2)
## 
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4065 -2.6493 -0.2876  2.2003  8.4847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
## Girth         4.7082     0.2643  17.816  < 2e-16 ***
## Height        0.3393     0.1302   2.607   0.0145 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9442 
## F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

Note that the R^2 has improved, yet the Height term is less significant than the other two parameters.

Try including the interaction term between Girth and Height:

m3 = lm(Volume~Girth*Height,data=trees)
summary(m3)
## 
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5821 -1.0673  0.3026  1.5641  4.6649 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.39632   23.83575   2.911  0.00713 ** 
## Girth        -5.85585    1.92134  -3.048  0.00511 ** 
## Height       -1.29708    0.30984  -4.186  0.00027 ***
## Girth:Height  0.13465    0.02438   5.524 7.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared:  0.9756, Adjusted R-squared:  0.9728 
## F-statistic: 359.3 on 3 and 27 DF,  p-value: < 2.2e-16

All terms are highly significant. Note that the Height is more significant than in the previous model, despite the introduction of an additional parameter.

We’ll now try a different functional form - rather than looking for an additive model, we can explore a multiplicative model by applying a log-log transformation (leaving out the interaction term for now).

m4 = lm(log(Volume)~log(Girth)+log(Height),data=trees)
summary(m4)
## 
## Call:
## lm(formula = log(Volume) ~ log(Girth) + log(Height), data = trees)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.168561 -0.048488  0.002431  0.063637  0.129223 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.63162    0.79979  -8.292 5.06e-09 ***
## log(Girth)   1.98265    0.07501  26.432  < 2e-16 ***
## log(Height)  1.11712    0.20444   5.464 7.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08139 on 28 degrees of freedom
## Multiple R-squared:  0.9777, Adjusted R-squared:  0.9761 
## F-statistic: 613.2 on 2 and 28 DF,  p-value: < 2.2e-16

All terms are significant. Note that the residual standard error is much lower than for the previous models. However, this value cannot be compared with the previous models due to transforming the response variable. The R^2 value has increased further, despite reducing the number of parameters from four to three.

confint(m4)
##                 2.5 %    97.5 %
## (Intercept) -8.269912 -4.993322
## log(Girth)   1.828998  2.136302
## log(Height)  0.698353  1.535894

Looking at the confidence intervals for the parameters reveals that the estimated power of Girth is around 2, and Height around 1. This makes a lot of sense, given the well-known dimensional relationship between Volume, Girth and Height!

For completeness, we’ll now add the interaction term.

m5 = lm(log(Volume)~log(Girth)*log(Height),data=trees)
summary(m5)
## 
## Call:
## lm(formula = log(Volume) ~ log(Girth) * log(Height), data = trees)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.165941 -0.048613  0.006384  0.062204  0.132295 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)             -3.6869     7.6996  -0.479    0.636
## log(Girth)               0.7942     3.0910   0.257    0.799
## log(Height)              0.4377     1.7788   0.246    0.808
## log(Girth):log(Height)   0.2740     0.7124   0.385    0.704
## 
## Residual standard error: 0.08265 on 27 degrees of freedom
## Multiple R-squared:  0.9778, Adjusted R-squared:  0.9753 
## F-statistic: 396.4 on 3 and 27 DF,  p-value: < 2.2e-16

The R^2 value has increased (of course, as all we’ve done is add an additional parameter), but interestingly none of the four terms are significant. This means that none of the individual terms alone are vital for the model - there is duplication of information between the variables. So we will revert back to the previous model.

Given that it would be reasonable to expect the power of Girth to be 2, and Height to be 1, we will now fix those parameters, and instead just estimate the one remaining parameter.

m6 = lm(log(Volume)-log((Girth^2)*Height)~1,data=trees)
summary(m6)
## 
## Call:
## lm(formula = log(Volume) - log((Girth^2) * Height) ~ 1, data = trees)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.168446 -0.047355 -0.003518  0.066308  0.136467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.16917    0.01421  -434.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0791 on 30 degrees of freedom

Note that there is no R^2 (as only the intercept was included in the model), and that the Residual Standard Error is incomparable with previous models due to changing the response variable.

We can alternatively construct a model with the response being y, and the error term additive rather than multiplicative.

m7 = lm(Volume~0+I(Girth^2):Height,data=trees)
summary(m7)
## 
## Call:
## lm(formula = Volume ~ 0 + I(Girth^2):Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6696 -1.0832 -0.3341  1.6045  4.2944 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## I(Girth^2):Height 2.108e-03  2.722e-05   77.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.455 on 30 degrees of freedom
## Multiple R-squared:  0.995,  Adjusted R-squared:  0.9949 
## F-statistic:  5996 on 1 and 30 DF,  p-value: < 2.2e-16

Note that the parameter estimates for the last two models are slightly different… this is due to differences in the error model.

Section 2: Model Selection

Of the last two models, the one with the log-Normal error model would seem to have the more Normal residuals. This can be inspected by looking at diagnostic plots, by and using the shapiro.test():

fitted = fitted(m6)
resid = resid(m6)
plot(fitted, resid, xlab="Fitted values", ylab="Raw residuals")

fitted = fitted(m7)
resid = resid(m7)
plot(fitted, resid, xlab="Fitted values", ylab="Raw residuals")

shapiro.test(residuals(m6))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m6)
## W = 0.97013, p-value = 0.5225
shapiro.test(residuals(m7))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m7)
## W = 0.95846, p-value = 0.2655

The Akaike Information Criterion (AIC) can help to make decisions regarding which model is the most appropriate. Now calculate the AIC for each of the above models:

summary(m1)
## 
## Call:
## lm(formula = Volume ~ Girth, data = trees)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.065 -3.107  0.152  3.495  9.587 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
## Girth         5.0659     0.2474   20.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9331 
## F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16
AIC(m1)
## [1] 181.6447
summary(m2)
## 
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4065 -2.6493 -0.2876  2.2003  8.4847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
## Girth         4.7082     0.2643  17.816  < 2e-16 ***
## Height        0.3393     0.1302   2.607   0.0145 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9442 
## F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16
AIC(m2)
## [1] 176.91
summary(m3)
## 
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5821 -1.0673  0.3026  1.5641  4.6649 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.39632   23.83575   2.911  0.00713 ** 
## Girth        -5.85585    1.92134  -3.048  0.00511 ** 
## Height       -1.29708    0.30984  -4.186  0.00027 ***
## Girth:Height  0.13465    0.02438   5.524 7.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared:  0.9756, Adjusted R-squared:  0.9728 
## F-statistic: 359.3 on 3 and 27 DF,  p-value: < 2.2e-16
AIC(m3)
## [1] 155.4692
summary(m4)
## 
## Call:
## lm(formula = log(Volume) ~ log(Girth) + log(Height), data = trees)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.168561 -0.048488  0.002431  0.063637  0.129223 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.63162    0.79979  -8.292 5.06e-09 ***
## log(Girth)   1.98265    0.07501  26.432  < 2e-16 ***
## log(Height)  1.11712    0.20444   5.464 7.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08139 on 28 degrees of freedom
## Multiple R-squared:  0.9777, Adjusted R-squared:  0.9761 
## F-statistic: 613.2 on 2 and 28 DF,  p-value: < 2.2e-16
AIC(m4)
## [1] -62.71125
summary(m5)
## 
## Call:
## lm(formula = log(Volume) ~ log(Girth) * log(Height), data = trees)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.165941 -0.048613  0.006384  0.062204  0.132295 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)             -3.6869     7.6996  -0.479    0.636
## log(Girth)               0.7942     3.0910   0.257    0.799
## log(Height)              0.4377     1.7788   0.246    0.808
## log(Girth):log(Height)   0.2740     0.7124   0.385    0.704
## 
## Residual standard error: 0.08265 on 27 degrees of freedom
## Multiple R-squared:  0.9778, Adjusted R-squared:  0.9753 
## F-statistic: 396.4 on 3 and 27 DF,  p-value: < 2.2e-16
AIC(m5)
## [1] -60.88061
summary(m6)
## 
## Call:
## lm(formula = log(Volume) - log((Girth^2) * Height) ~ 1, data = trees)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.168446 -0.047355 -0.003518  0.066308  0.136467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.16917    0.01421  -434.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0791 on 30 degrees of freedom
AIC(m6)
## [1] -66.34198
summary(m7)
## 
## Call:
## lm(formula = Volume ~ 0 + I(Girth^2):Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6696 -1.0832 -0.3341  1.6045  4.2944 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## I(Girth^2):Height 2.108e-03  2.722e-05   77.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.455 on 30 degrees of freedom
## Multiple R-squared:  0.995,  Adjusted R-squared:  0.9949 
## F-statistic:  5996 on 1 and 30 DF,  p-value: < 2.2e-16
AIC(m7)
## [1] 146.6447

Whilst the AIC can help differentiate between similar models, it cannot help deciding between models that have different responses. Which model would you select as the most appropriate?

Section 3: Practical Exercises

Puromycin

The in-built R dataset Puromycin contains data regarding the reaction velocity versus substrate concentration in an enzymatic reaction involving untreated cells or cells treated with Puromycin.

  • Plot conc (concentration) against rate. What is the nature of the relationship between conc and rate?
plot(conc~rate,data=Puromycin)

# There is a non-linear positive relationship between conc and rate
  • Find a transformation that linearises the data and stabilises the variance, making it possible to use linear regression. Create the corresponding linear regression model. Are all terms significant?
plot(log(conc)~rate,data=Puromycin)

m10 = lm(log(conc)~rate,data=Puromycin)
fitted = fitted(m10)
resid = resid(m10)
plot(fitted, resid, xlab="Fitted values", ylab="Raw residuals")

summary(m10)
## 
## Call:
## lm(formula = log(conc) ~ rate, data = Puromycin)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66740 -0.37570 -0.00859  0.25076  1.12635 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.247389   0.293626  -17.87 3.52e-14 ***
## rate         0.026352   0.002174   12.12 6.04e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4845 on 21 degrees of freedom
## Multiple R-squared:  0.875,  Adjusted R-squared:  0.869 
## F-statistic: 146.9 on 1 and 21 DF,  p-value: 6.039e-11
# Both terms are significant
  • Add the state term to the model. What type of variable is this? Is the inclusion of this term appropriate?
m11 = lm(log(conc)~rate+state,data=Puromycin)
fitted = fitted(m11)
resid = resid(m11)
plot(fitted, resid, xlab="Fitted values", ylab="Raw residuals")

summary(m11)
## 
## Call:
## lm(formula = log(conc) ~ rate + state, data = Puromycin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6516 -0.1915  0.0066  0.1544  0.6669 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5.915289   0.248309 -23.822 3.74e-16 ***
## rate            0.028912   0.001612  17.936 8.55e-14 ***
## stateuntreated  0.717807   0.149953   4.787 0.000112 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3389 on 20 degrees of freedom
## Multiple R-squared:  0.9417, Adjusted R-squared:  0.9359 
## F-statistic: 161.6 on 2 and 20 DF,  p-value: 4.515e-13
# `state` is a boolean factor or indicator variable
# The inclusion of `state` is appropriate, as the term is significant and the diagnostic plots look reasonable 
  • Now add a term representing the interaction between rate and state. Are all terms significant? What can you conclude?
m12 = lm(log(conc)~rate*state,data=Puromycin)
summary(m12)
## 
## Call:
## lm(formula = log(conc) ~ rate * state, data = Puromycin)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38444 -0.25402  0.03685  0.21395  0.39790 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.504198   0.235291 -23.393 1.81e-15 ***
## rate                 0.026008   0.001565  16.624 8.91e-13 ***
## stateuntreated      -0.436875   0.362822  -1.204  0.24334    
## rate:stateuntreated  0.009619   0.002848   3.378  0.00316 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2748 on 19 degrees of freedom
## Multiple R-squared:  0.9636, Adjusted R-squared:  0.9578 
## F-statistic: 167.6 on 3 and 19 DF,  p-value: 7.622e-14
# The `state` term is not significant when the interaction between `rate` and `state` is included in the model. So it may be better to remove the `state` term from the model.
  • Given this information, create the regression model you believe to be the most appropriate for modelling conc. Regenerate the plot of conc against rate. Draw curves corresponding to the fitted values of the final model onto this plot (note that two separate curves should be drawn, corresponding to the two levels of state).
m13 = lm(log(conc)~rate+rate:state,data=Puromycin)
summary(m13)
## 
## Call:
## lm(formula = log(conc) ~ rate + rate:state, data = Puromycin)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47172 -0.20654  0.01401  0.17728  0.49947 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.6879279  0.1811094 -31.406  < 2e-16 ***
## rate                 0.0271582  0.0012530  21.675 2.31e-15 ***
## rate:stateuntreated  0.0063885  0.0009651   6.619 1.91e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2779 on 20 degrees of freedom
## Multiple R-squared:  0.9608, Adjusted R-squared:  0.9569 
## F-statistic: 245.2 on 2 and 20 DF,  p-value: 8.539e-15
# Solution one:
plot(conc~rate,data=Puromycin)
idx = order(Puromycin$rate)
treated = Puromycin$state[idx] == "treated"
untreated = Puromycin$state[idx] == "untreated"
lines(exp(fitted(m13))[idx][treated]~Puromycin$rate[idx][treated])
lines(exp(fitted(m13))[idx][untreated]~Puromycin$rate[idx][untreated],col="red")

# Solution two (better - more general):
plot(conc~rate,data=Puromycin)
xvals = range(Puromycin$rate)[1]:range(Puromycin$rate)[2]
lines(exp(coef(m13)[1] + coef(m13)[2]*xvals) ~ xvals)
lines(exp(coef(m13)[1] + coef(m13)[2]*xvals + coef(m13)[3]*xvals) ~ xvals, col="red")