Let’s look at some real data.
The in-built dataset trees
contains data pertaining to
the Volume
, Girth
and Height
of
31 felled black cherry trees.
We will now attempt to construct a simple linear model that uses
Girth
to predict Volume
.
##
## Pearson's product-moment correlation
##
## data: trees$Volume and trees$Girth
## t = 20.478, df = 29, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9322519 0.9841887
## sample estimates:
## cor
## 0.9671194
It is evident that Volume
and Girth
are
highly correlated.
Model residuals can be readily accessed using the
resid()
function:
Diagnostic plots for the model can reveal whether or not modelling assumptions are reasonable. In this case, there is visual evidence to suggest that the assumptions are not satisfied - note in particular the trend observed in the plot of residuals vs fitted values:
fitted = fitted(m1)
resid = resid(m1)
plot(fitted, resid, xlab="Fitted values", ylab="Raw residuals")
Let’s see what happens if we try to describe a non-linear relationship using a linear model. Consider the sine function in the range [0,1.5*pi):
In this case, it is clear that a linear model is not appropriate for describing the relationship. However, we are able to fit a linear model, and the linear model summary does not identify any major concerns:
##
## Call:
## lm(formula = sin(z) ~ z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.02542 -0.37054 0.01294 0.42276 0.59274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.02542 0.18641 5.501 1.58e-05 ***
## z -0.35443 0.06944 -5.104 4.09e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.471 on 22 degrees of freedom
## Multiple R-squared: 0.5422, Adjusted R-squared: 0.5214
## F-statistic: 26.05 on 1 and 22 DF, p-value: 4.094e-05
Here we see that the overall p-value is low enough to suggest that
the model has significant utility, and both terms (the intercept and the
coefficient of z
) are significantly different from zero.
Indeed, the linear model summary does not check whether the underlying
model assumptions are satisfied.
By observing strong patterns in the diagnostic plots, we can see that the modelling assumptions are not satisified in this case.
fitted = fitted(m2)
resid = resid(m2)
plot(fitted, resid, xlab="Fitted values", ylab="Raw residuals")
It is sometimes possible to use linear models to describe non-linear relationships (which is perhaps counterintuitive!). This can be achieved by applying transformations to the variable(s) in order to linearise the relationship, whilst ensuring that modelling assumptions are satisfied.
Another in-built dataset cars
provides the speeds and
associated stopping distances of cars in the 1920s.
Let’s construct a linear model to predict stopping distance using speed:
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The model summary indicates that the intercept term does not have significant utility. So that term could/should be removed from the model.
In addition, the plot of residuals versus fitted values indicates potential issues with variance stability:
fitted = fitted(m3)
resid = resid(m3)
plot(fitted, resid, xlab="Fitted values", ylab="Raw residuals")
In this case, variance stability can be aided by a square-root transformation of the response variable:
fitted = fitted(m4)
resid = resid(m4)
plot(fitted, resid, xlab="Fitted values", ylab="Raw residuals")
##
## Call:
## lm(formula = sqrt(dist) ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0684 -0.6983 -0.1799 0.5909 3.1534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.27705 0.48444 2.636 0.0113 *
## speed 0.32241 0.02978 10.825 1.77e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.102 on 48 degrees of freedom
## Multiple R-squared: 0.7094, Adjusted R-squared: 0.7034
## F-statistic: 117.2 on 1 and 48 DF, p-value: 1.773e-14
Note that again that the intercept term is not significant.
We’ll now try a log-log transformation, that is applying a log transformation to the predictor and response variables. This represents a power relationship between the two variables.
fitted = fitted(m5)
resid = resid(m5)
plot(fitted, resid, xlab="Fitted values", ylab="Raw residuals")
##
## Call:
## lm(formula = log(dist) ~ log(speed), data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00215 -0.24578 -0.02898 0.20717 0.88289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.7297 0.3758 -1.941 0.0581 .
## log(speed) 1.6024 0.1395 11.484 2.26e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4053 on 48 degrees of freedom
## Multiple R-squared: 0.7331, Adjusted R-squared: 0.7276
## F-statistic: 131.9 on 1 and 48 DF, p-value: 2.259e-15
The diagnostic plots don’t look too unreasonable. However, again the intercept term does not have significant utility. So we’ll now remove it from the model:
m6 = lm(log(dist)~0+log(speed),data=cars)
fitted = fitted(m6)
resid = resid(m6)
plot(fitted, resid, xlab="Fitted values", ylab="Raw residuals")
##
## Call:
## lm(formula = log(dist) ~ 0 + log(speed), data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.21083 -0.22501 0.01129 0.25636 0.85978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## log(speed) 1.33466 0.02187 61.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4166 on 49 degrees of freedom
## Multiple R-squared: 0.987, Adjusted R-squared: 0.9867
## F-statistic: 3724 on 1 and 49 DF, p-value: < 2.2e-16
This model seems reasonable. We can now transform the model back, and display the regression curve on the plot:
The inbuilt R dataset faithful
pertains to the waiting
time between eruptions and the duration of the eruption for the Old
Faithful geyser in Yellowstone National Park, Wyoming, USA.
faithful$eruptions
using waiting time
faithful$waiting
as the independent variable, storing the
model in a variable. Look at the summary of the model.##
## Call:
## lm(formula = eruptions ~ waiting, data = faithful)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.29917 -0.37689 0.03508 0.34909 1.19329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
## waiting 0.075628 0.002219 34.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
## F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
# In the absence of other information, this summary would indicate a linear relationship between the two variables. However, we cannot conclude that without first checking that the modelling assumptions have been satistified...
# At a glance, the model seems to describe the overall dependence of eruptions on waiting time reasonably well. However, this is misleading...
fitted = fitted(m7)
resid = resid(m7)
plot(fitted, resid, xlab="Fitted values", ylab="Raw residuals")
# There is strong systematic behaviour in the plot of residuals versus fitted values. This indicates that the relationship/dependence is different or more complicated than can be described with the simple linear model.
# Specifically, it should be identified what causes observations to fall into one or other of the two groups. Differences between the two groups should be accounted for when modelling the relationship. It seems that the direct dependence of `eruptions` on `waiting` is not as strong as is indicated by the simple linear model.
Consider the inbuilt R dataset anscombe
. This dataset
contains four x-y datasets, contained in the columns: (x1,y1), (x2,y2),
(x3,y3) and (x4,y4).
## [1] 0.8164205
##
## Pearson's product-moment correlation
##
## data: anscombe$x1 and anscombe$y1
## t = 4.2415, df = 9, p-value = 0.00217
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4243912 0.9506933
## sample estimates:
## cor
## 0.8164205
## [1] 0.8162365
##
## Pearson's product-moment correlation
##
## data: anscombe$x2 and anscombe$y2
## t = 4.2386, df = 9, p-value = 0.002179
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4239389 0.9506402
## sample estimates:
## cor
## 0.8162365
## [1] 0.8162867
##
## Pearson's product-moment correlation
##
## data: anscombe$x3 and anscombe$y3
## t = 4.2394, df = 9, p-value = 0.002176
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4240623 0.9506547
## sample estimates:
## cor
## 0.8162867
## [1] 0.8165214
##
## Pearson's product-moment correlation
##
## data: anscombe$x4 and anscombe$y4
## t = 4.243, df = 9, p-value = 0.002165
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4246394 0.9507224
## sample estimates:
## cor
## 0.8165214
# All four datasets seem to exhibit positive linear relationships, with the same correlation and the same p-value.
##
## Call:
## lm(formula = anscombe$y1 ~ anscombe$x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## anscombe$x1 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
##
## Call:
## lm(formula = anscombe$y2 ~ anscombe$x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9009 -0.7609 0.1291 0.9491 1.2691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.667 0.02576 *
## anscombe$x2 0.500 0.118 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
##
## Call:
## lm(formula = anscombe$y3 ~ anscombe$x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1586 -0.6146 -0.2303 0.1540 3.2411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0025 1.1245 2.670 0.02562 *
## anscombe$x3 0.4997 0.1179 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
##
## Call:
## lm(formula = anscombe$y4 ~ anscombe$x4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017 1.1239 2.671 0.02559 *
## anscombe$x4 0.4999 0.1178 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
# The four datasets are very different, with very different relationships between the x and y variables.
# This demonstrates how very different datasets can appear to be very similar when looking solely at summary statistics.
# We conclude that it is always important to peform exploratory data analysis, and look at the data before modelling.
Consider the inbuilt R dataset Indometh
, which contains
data on the pharmacokinetics of indometacin.
Indometh$time
versus Indometh$conc
(concentration). What is the nature of the relationship between
time
and conc
?
- After creating the linear model, inspect the diagnostic plots to
ensure that the assumptions are not violated (too much). Are there any
outliers with large influence? What are the parameter estimates? Are
both terms significant?
m8 = lm(log(time)~log(conc),data=Indometh)
#plot(m8)
fitted = fitted(m8)
resid = resid(m8)
plot(fitted, resid, xlab="Fitted values", ylab="Raw residuals")
# The diagnostic plots indicate that the residuals aren't perfectly Normally distributed, but the modelling assumptions aren't violated so much as to inhibit construction of a model.
summary(m8)
##
## Call:
## lm(formula = log(time) ~ log(conc), data = Indometh)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59838 -0.21398 0.02711 0.20979 0.77285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4203 0.0474 -8.867 9.67e-13 ***
## log(conc) -0.9066 0.0298 -30.429 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2743 on 64 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9343
## F-statistic: 925.9 on 1 and 64 DF, p-value: < 2.2e-16
# Intercept = -0.4203
# Coefficient of log(conc) = -0.9066
# Both terms are significantly different from zero.
- Now regenerate the original plot of
time
versus
conc
(i.e. the untransformed data). Using the
lines
function, add a curve to the plot corresponding to
the fitted values of the model.
plot(time~conc,data=Indometh)
idx <- order(Indometh$conc)
lines(exp(fitted(m8))[idx]~Indometh$conc[idx])