Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law

Based on the course RNAseq analysis in R delivered on May 11/12th 2016

Simulating linear models

We are going to simulate an experiment with 2 factors:

Treatment <- rep(c("Treatment A", "Control"), 3)
X <- cbind(c(1,0,1,0,1,0), c(0,1,0, 1, 0, 1))
Treatment <- factor(Treatment, levels=c("Treatment A", "Control"))
model.matrix(~Treatment)
  (Intercept) TreatmentControl
1           1                0
2           1                1
3           1                0
4           1                1
5           1                0
6           1                1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$Treatment
[1] "contr.treatment"
model.matrix(~Treatment -1)
  TreatmentTreatment A TreatmentControl
1                    1                0
2                    0                1
3                    1                0
4                    0                1
5                    1                0
6                    0                1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$Treatment
[1] "contr.treatment"
Treatment <- rep(c("Treatment A", "No Treatment"), 4)
ER <- rep(c("+", "-"), c(4,4))

We can obtain the design matrix for a model without and with interaction with the following commands:

model.matrix(~Treatment + ER)
  (Intercept) TreatmentTreatment A ER+
1           1                    1   1
2           1                    0   1
3           1                    1   1
4           1                    0   1
5           1                    1   0
6           1                    0   0
7           1                    1   0
8           1                    0   0
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$Treatment
[1] "contr.treatment"

attr(,"contrasts")$ER
[1] "contr.treatment"
model.matrix(~Treatment * ER)
  (Intercept) TreatmentTreatment A ER+ TreatmentTreatment A:ER+
1           1                    1   1                        1
2           1                    0   1                        0
3           1                    1   1                        1
4           1                    0   1                        0
5           1                    1   0                        0
6           1                    0   0                        0
7           1                    1   0                        0
8           1                    0   0                        0
attr(,"assign")
[1] 0 1 2 3
attr(,"contrasts")
attr(,"contrasts")$Treatment
[1] "contr.treatment"

attr(,"contrasts")$ER
[1] "contr.treatment"

Now we will simulate a model without interaction. We need to provide a set of theoretical variable for our parameters, and a level of noise in our model:

params <- c(3, 2, -4)
names(params) <- c("Intercept", "TreatA", "ER+")
noise <- 1
X <- model.matrix(~Treatment + ER)
Y <- X %*% params
errors <- rnorm(length(Y), 0, 1)
Y <- Y + errors

Now we can compute the estimates of our parameters: We need to use specific functions for matrix algebra, like: Transpose: t(X) Multiply matrices: %*% Inverse of a matrix: solve(X)

beta.pars <- solve(t(X) %*% X) %*% t(X) %*% Y

However, we can use the linear model function in R, lm() and inspect the results of the fit, compute confidence intervals, etc.:

m <- lm(Y ~ Treatment + ER)
summary(m)

Call:
lm(formula = Y ~ Treatment + ER)

Residuals:
       1        2        3        4        5        6        7        8 
-1.09088  1.47036  0.03393 -0.41342  0.22952 -0.32015  0.82742 -0.73679 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)   
(Intercept)            3.2550     0.6066   5.366  0.00303 **
TreatmentTreatment A   1.8778     0.7005   2.681  0.04378 * 
ER+                   -3.8231     0.7005  -5.458  0.00281 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9906 on 5 degrees of freedom
Multiple R-squared:  0.8809,    Adjusted R-squared:  0.8332 
F-statistic: 18.49 on 2 and 5 DF,  p-value: 0.004897
confint(m)
                           2.5 %    97.5 %
(Intercept)           1.69558831  4.814381
TreatmentTreatment A  0.07715716  3.678429
ER+                  -5.62371468 -2.022443

Large Scale Hypothesis testing: FDR

When we are doing thousands of tests for differential expression, the overall significance level of a test is very difficult to control. Let’s see why: First, we simulate 40,000 genes not differentially expressed (with a mean of zero). We assume that we have 10 replicates of this experiment:

N <- 40000
R <- 10
X <- matrix(rnorm(N* R, 0, 1), nrow=N)

Now we assume that we run a t-test under the null hypothesis that the mean is zero for each of these genes, that is each row in the matrix:

t.test(X[1,])$p.value
[1] 0.5803333
pvals <- apply(X, 1, function(y) t.test(y)$p.value)

Because we have generated this data with mean zero, we know that none of these genes are differentially expressed, so we would like to be able to not reject any of the hypothesis. However, if you choose a significance level of 0.05 we get

sum(pvals<0.05)
[1] 2064

Too many rejections!!! In fact, if we look at the distributions of the p-values obtained we get:

hist(pvals)

That is, if the null hypothesis is true, the p-values will follow a uniform distribution. This is the key to all methods that aim to control the proportion of false positives amongs the genes that we call differentially expressed. Let’s add 1000 genes to our set that are really differentially expressed (mean of 1):

df <- 1000
Y <- matrix(rnorm(df* R, 1, 1), nrow=df)
Z <- rbind(X, Y)
pvals <- apply(Z, 1, function(y) t.test(y)$p.value)

Let’s look at the distribution of p-values now:

hist(pvals)

What would be the number of false positives now? How many would we expect if we reject p-values samller than our significance level, 0.05?

exp.sig<- (nrow(Z))*0.05
obs.sig <- sum(pvals<0.05)
FDR <- exp.sig / obs.sig
FDR
[1] 0.7167832

We can compare this with the Benjamini-Hochberg method:

pvals.adj <- p.adjust(pvals, method="BH")
plot(pvals, pvals.adj)
abline(v=0.05, col=2)

LS0tCnRpdGxlOiAiUk5BLXNlcSBhbmFseXNpcyBpbiBSIgphdXRob3I6ICJPc2NhciIKZGF0ZTogJ2ByIGZvcm1hdChTeXMudGltZSgpLCAiTGFzdCBtb2RpZmllZDogJWQgJWIgJVkiKWAnCm91dHB1dDoKICBodG1sX25vdGVib29rOgogICAgdG9jOiB5ZXMKICAgIHRvY19mbG9hdDogeWVzCiAgaHRtbF9kb2N1bWVudDoKICAgIHRvYzogeWVzCiAgICB0b2NfZmxvYXQ6IHllcwptaW51dGVzOiAzMDAKbGF5b3V0OiBwYWdlCnN1YnRpdGxlOiBMaW5lYXIgTW9kZWxzCmJpYmxpb2dyYXBoeTogcmVmLmJpYgotLS0KCmBgYHtyIHNldHVwLCBpbmNsdWRlPUZBTFNFfQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUpCmBgYAoqKk9yaWdpbmFsIEF1dGhvcnM6IEJlbGluZGEgUGhpcHNvbiwgQW5uYSBUcmlnb3MsIE1hdHQgUml0Y2hpZSwgTWFyaWEgRG95bGUsIEhhcnJpZXQgRGFzaG5vdywgQ2hhcml0eSBMYXcqKgoKCkJhc2VkIG9uIHRoZSBjb3Vyc2UgW1JOQXNlcSBhbmFseXNpcyBpbiBSXShodHRwOi8vY29tYmluZS1hdXN0cmFsaWEuZ2l0aHViLmlvLzIwMTYtMDUtMTEtUk5Bc2VxLykgZGVsaXZlcmVkIG9uIE1heSAxMS8xMnRoIDIwMTYKCiMjIFNpbXVsYXRpbmcgbGluZWFyIG1vZGVscwoKV2UgYXJlIGdvaW5nIHRvIHNpbXVsYXRlIGFuIGV4cGVyaW1lbnQgd2l0aCAyIGZhY3RvcnM6CgpgYGB7ciBwcmVzc3VyZSwgZWNobz1UUlVFfQpUcmVhdG1lbnQgPC0gcmVwKGMoIlRyZWF0bWVudCBBIiwgIkNvbnRyb2wiKSwgMykKWCA8LSBjYmluZChjKDEsMCwxLDAsMSwwKSwgYygwLDEsMCwgMSwgMCwgMSkpClRyZWF0bWVudCA8LSBmYWN0b3IoVHJlYXRtZW50LCBsZXZlbHM9YygiVHJlYXRtZW50IEEiLCAiQ29udHJvbCIpKQptb2RlbC5tYXRyaXgoflRyZWF0bWVudCkKbW9kZWwubWF0cml4KH5UcmVhdG1lbnQgLTEpCgpUcmVhdG1lbnQgPC0gcmVwKGMoIlRyZWF0bWVudCBBIiwgIk5vIFRyZWF0bWVudCIpLCA0KQpFUiA8LSByZXAoYygiKyIsICItIiksIGMoNCw0KSkKCmBgYApXZSBjYW4gb2J0YWluIHRoZSBkZXNpZ24gbWF0cml4IGZvciBhIG1vZGVsIHdpdGhvdXQgYW5kIHdpdGggaW50ZXJhY3Rpb24gd2l0aCB0aGUgZm9sbG93aW5nIGNvbW1hbmRzOgpgYGB7cn0KbW9kZWwubWF0cml4KH5UcmVhdG1lbnQgKyBFUikKbW9kZWwubWF0cml4KH5UcmVhdG1lbnQgKiBFUikKYGBgCk5vdyB3ZSB3aWxsIHNpbXVsYXRlIGEgbW9kZWwgd2l0aG91dCBpbnRlcmFjdGlvbi4gV2UgbmVlZCB0byBwcm92aWRlIGEgc2V0IG9mIHRoZW9yZXRpY2FsIHZhcmlhYmxlIGZvciBvdXIgcGFyYW1ldGVycywgYW5kIGEgbGV2ZWwgb2Ygbm9pc2UgaW4gb3VyIG1vZGVsOgpgYGB7cn0KcGFyYW1zIDwtIGMoMywgMiwgLTQpCm5hbWVzKHBhcmFtcykgPC0gYygiSW50ZXJjZXB0IiwgIlRyZWF0QSIsICJFUisiKQpub2lzZSA8LSAxClggPC0gbW9kZWwubWF0cml4KH5UcmVhdG1lbnQgKyBFUikKWSA8LSBYICUqJSBwYXJhbXMKZXJyb3JzIDwtIHJub3JtKGxlbmd0aChZKSwgMCwgMSkKWSA8LSBZICsgZXJyb3JzCmBgYApOb3cgd2UgY2FuIGNvbXB1dGUgdGhlIGVzdGltYXRlcyBvZiBvdXIgcGFyYW1ldGVyczogV2UgbmVlZCB0byB1c2Ugc3BlY2lmaWMgZnVuY3Rpb25zIGZvciBtYXRyaXggYWxnZWJyYSwgbGlrZTogClRyYW5zcG9zZTogdChYKQpNdWx0aXBseSBtYXRyaWNlczogJSolCkludmVyc2Ugb2YgYSBtYXRyaXg6IHNvbHZlKFgpICAKYGBge3J9CmJldGEucGFycyA8LSBzb2x2ZSh0KFgpICUqJSBYKSAlKiUgdChYKSAlKiUgWQpgYGAKSG93ZXZlciwgd2UgY2FuIHVzZSB0aGUgbGluZWFyIG1vZGVsIGZ1bmN0aW9uIGluIFIsIGxtKCkgYW5kIGluc3BlY3QgdGhlIHJlc3VsdHMgb2YgdGhlIGZpdCwgY29tcHV0ZSBjb25maWRlbmNlIGludGVydmFscywgZXRjLjoKYGBge3J9Cm0gPC0gbG0oWSB+IFRyZWF0bWVudCArIEVSKQpzdW1tYXJ5KG0pCmNvbmZpbnQobSkKYGBgCiMjIExhcmdlIFNjYWxlIEh5cG90aGVzaXMgdGVzdGluZzogRkRSCgpXaGVuIHdlIGFyZSBkb2luZyB0aG91c2FuZHMgb2YgdGVzdHMgZm9yIGRpZmZlcmVudGlhbCBleHByZXNzaW9uLCB0aGUgb3ZlcmFsbCBzaWduaWZpY2FuY2UgbGV2ZWwgb2YgYSB0ZXN0IGlzIHZlcnkgZGlmZmljdWx0IHRvIGNvbnRyb2wuIExldCdzIHNlZSB3aHk6CkZpcnN0LCB3ZSBzaW11bGF0ZSA0MCwwMDAgZ2VuZXMgbm90IGRpZmZlcmVudGlhbGx5IGV4cHJlc3NlZCAod2l0aCBhIG1lYW4gb2YgemVybykuIFdlIGFzc3VtZSB0aGF0IHdlIGhhdmUgMTAgcmVwbGljYXRlcyBvZiB0aGlzIGV4cGVyaW1lbnQ6CmBgYHtyfQoKTiA8LSA0MDAwMApSIDwtIDEwClggPC0gbWF0cml4KHJub3JtKE4qIFIsIDAsIDEpLCBucm93PU4pCmBgYApOb3cgd2UgYXNzdW1lIHRoYXQgd2UgcnVuIGEgdC10ZXN0IHVuZGVyIHRoZSBudWxsIGh5cG90aGVzaXMgdGhhdCB0aGUgbWVhbiBpcyB6ZXJvIGZvciBlYWNoIG9mIHRoZXNlIGdlbmVzLCB0aGF0IGlzIGVhY2ggcm93IGluIHRoZSBtYXRyaXg6CmBgYHtyfQp0LnRlc3QoWFsxLF0pJHAudmFsdWUKcHZhbHMgPC0gYXBwbHkoWCwgMSwgZnVuY3Rpb24oeSkgdC50ZXN0KHkpJHAudmFsdWUpCmBgYApCZWNhdXNlIHdlIGhhdmUgZ2VuZXJhdGVkIHRoaXMgZGF0YSB3aXRoIG1lYW4gemVybywgd2Uga25vdyB0aGF0IG5vbmUgb2YgdGhlc2UgZ2VuZXMgYXJlIGRpZmZlcmVudGlhbGx5IGV4cHJlc3NlZCwgc28gd2Ugd291bGQgbGlrZSB0byBiZSBhYmxlIHRvIG5vdCByZWplY3QgYW55IG9mIHRoZSBoeXBvdGhlc2lzLiBIb3dldmVyLCBpZiB5b3UgY2hvb3NlIGEgc2lnbmlmaWNhbmNlIGxldmVsIG9mIDAuMDUgd2UgZ2V0IApgYGB7cn0Kc3VtKHB2YWxzPDAuMDUpCmBgYApUb28gbWFueSByZWplY3Rpb25zISEhCkluIGZhY3QsIGlmIHdlIGxvb2sgYXQgdGhlIGRpc3RyaWJ1dGlvbnMgb2YgdGhlIHAtdmFsdWVzIG9idGFpbmVkIHdlIGdldDoKYGBge3J9Cmhpc3QocHZhbHMpCmBgYAoKClRoYXQgaXMsIGlmIHRoZSBudWxsIGh5cG90aGVzaXMgaXMgdHJ1ZSwgdGhlIHAtdmFsdWVzIHdpbGwgZm9sbG93IGEgdW5pZm9ybSBkaXN0cmlidXRpb24uClRoaXMgaXMgdGhlIGtleSB0byBhbGwgbWV0aG9kcyB0aGF0IGFpbSB0byBjb250cm9sIHRoZSBwcm9wb3J0aW9uIG9mIGZhbHNlIHBvc2l0aXZlcyBhbW9uZ3MgdGhlIGdlbmVzIHRoYXQgd2UgY2FsbCBkaWZmZXJlbnRpYWxseSBleHByZXNzZWQuIExldCdzIGFkZCAxMDAwIGdlbmVzIHRvIG91ciBzZXQgdGhhdCBhcmUgcmVhbGx5IGRpZmZlcmVudGlhbGx5IGV4cHJlc3NlZCAobWVhbiBvZiAxKToKYGBge3J9CmRmIDwtIDEwMDAKWSA8LSBtYXRyaXgocm5vcm0oZGYqIFIsIDEsIDEpLCBucm93PWRmKQpaIDwtIHJiaW5kKFgsIFkpCnB2YWxzIDwtIGFwcGx5KFosIDEsIGZ1bmN0aW9uKHkpIHQudGVzdCh5KSRwLnZhbHVlKQpgYGAKTGV0J3MgbG9vayBhdCB0aGUgZGlzdHJpYnV0aW9uIG9mIHAtdmFsdWVzIG5vdzoKYGBge3J9Cmhpc3QocHZhbHMpCmBgYAoKCldoYXQgd291bGQgYmUgdGhlIG51bWJlciBvZiBmYWxzZSBwb3NpdGl2ZXMgbm93PyBIb3cgbWFueSB3b3VsZCB3ZSBleHBlY3QgaWYgd2UgcmVqZWN0IHAtdmFsdWVzIHNhbWxsZXIgdGhhbiBvdXIgc2lnbmlmaWNhbmNlIGxldmVsLCAwLjA1PwpgYGB7cn0KZXhwLnNpZzwtIChucm93KFopKSowLjA1Cm9icy5zaWcgPC0gc3VtKHB2YWxzPDAuMDUpCkZEUiA8LSBleHAuc2lnIC8gb2JzLnNpZwpGRFIKYGBgCldlIGNhbiBjb21wYXJlIHRoaXMgd2l0aCB0aGUgQmVuamFtaW5pLUhvY2hiZXJnIG1ldGhvZDoKYGBge3J9CnB2YWxzLmFkaiA8LSBwLmFkanVzdChwdmFscywgbWV0aG9kPSJCSCIpCnBsb3QocHZhbHMsIHB2YWxzLmFkaikKYWJsaW5lKHY9MC4wNSwgY29sPTIpCmBgYAo=