```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Loading and pre-processing the NKI data We start by loading the NKI breast cancer microarray data set: ```{r echo=TRUE, message=FALSE, results="hide"} library(breastCancerNKI) data(nki) ``` ```{r } nki ``` There are some missing data elements in the data set, which we'll eliminate or replace to make life easier. First we'll get rid of samples for which we don't have survival data: ```{r } nki$e.dmfs nki <- nki[,!is.na(nki$e.dmfs)] ncol(nki) ``` Next we'll impute some missing values. Less represent less than 1% of the total data so this shouldn't be a problem: ```{r} library(e1071) sum(is.na(exprs(nki))) exprs(nki)=impute(exprs(nki)) ``` # Separating Training and Validation sets of samples ```{r} totalSamples <- ncol(nki) validationSamples <- rep(TRUE,totalSamples) trainingSamples <- sample(totalSamples,totalSamples * .75) validationSamples[trainingSamples] <- FALSE validationSamples <- which(validationSamples) training <- nki[,trainingSamples] validation <- nki[,validationSamples] ``` # Differential Expression Analysis Next we'll do a differential expression analysis. For this we use the "event" data which separates the samples into "high" and a "low" risk groups. We can use these groups to make a design matrix. ```{r} SampleGroup <- factor(training$e.dmfs) design <- model.matrix(~0+SampleGroup) colnames(design) <- c("Low","High") ``` Next we'll filter out probes with low variability: ```{r} library(genefilter) nrow(training) training <- varFilter(training) nrow(training) ``` Now we can use the *limma* package to fit a linear model and test our contrast: : ```{r} library(limma) fit <- lmFit(exprs(training), design) contrasts <- makeContrasts(High - Low, levels=design) fit2 <- contrasts.fit(fit, contrasts) fit2 <- eBayes(fit2) ``` # Extracting the 70-gene Signature We can add gene symbol annotations before extracting the 70 features with the lowest FDR: ```{r} anno <- fData(training)[,"NCBI.gene.symbol"] sig <-topTable(fit2,genelist=anno,number=70) head(sig) ``` # Testing the 70-gene Signature Now that we have a 70-gene signature, we'll need to test it. We'll test it separately on the training set, and then on the validation set. First we extract the data we are interested in: ```{r} ids <- rownames(sig) sigTrain <- training[(rownames(training) %in% ids),] trainExpr <- exprs(sigTrain) sigValid <- validation[(rownames(validation) %in% ids),] validExpr <- exprs(sigValid) ``` ## Generating Low and High Risk expression templates Now we need a way to map the signature to a predicted outcome class. We create a "template" expression vector for the Low and High risk groups by taking the mean value for each gene amongst samples in that group: ```{r} riskTrain <- sigTrain$e.dmfs lowOnly <- sigTrain[,riskTrain==0] lowTemplate <- rowMeans(exprs(lowOnly)) highOnly <- sigTrain[,riskTrain==1] highTemplate <- rowMeans(exprs(highOnly)) riskValid <- sigValid$e.dmfs ``` Now we can evaluate each sample by computing how well they are correlated to the Low Risk and the High Risk templates: ```{r} numSamplesTrain <- ncol(sigTrain) lows <- rep(0,numSamplesTrain) highs <- rep(0,numSamplesTrain) for(i in 1:numSamplesTrain) { lows[i] <- cor(lowTemplate,trainExpr[,i]) highs[i] <- cor(highTemplate,trainExpr[,i]) } resultsTrain <- cbind(lows,highs,riskTrain) colnames(resultsTrain) <- c("Low","High","Risk") ``` ## Plotting correlations with the templates We can plot the the correlation values comparing each sample to each of the two vectors: ```{r} plot(resultsTrain) ``` Showing that the values are themselves negatively correlated, which is what we expect. We can also make some boxplots showing how the correlation values for each sample are distributed based on their known risk group. We can compare how they correlate with the Low Risk template, and then with the High Risk template: ```{r} par(mfrow=c(1,2)) boxplot(Low~Risk,resultsTrain,sub="Training Set - Low Risk Template") boxplot(High~Risk,resultsTrain,sub="Training Set - High Risk Template") ``` We can see that Low Risk samples correlate more highly with the Low Risk template, and High Risk Samples correlate more highly with the High Risk template. ## Validation set Now let's try to correlate the expression vectors for the signature genes for the samples in the validation set. ```{r} numSamplesValid <- ncol(sigValid) lows <- rep(0,numSamplesValid) highs <- rep(0,numSamplesValid) for(i in 1:numSamplesValid) { lows[i] <- cor(lowTemplate,validExpr[,i]) highs[i] <- cor(highTemplate,validExpr[,i]) } resultsValid <- cbind(lows,highs,riskValid) colnames(resultsValid) <- c("Low","High","Risk") par(mfrow=c(1,2)) boxplot(Low~Risk,resultsValid,sub="Validation Set - Low Risk Template") boxplot(High~Risk,resultsValid,sub="Validation Set - High Risk Template") ``` # Survival Analysis Now, we want to do some survival analysis. We need to separate the samples into two risk groups based on the correlation values. There are many ways to do this. We'll look at two of them ## Classifying samples First, we can assign each sample to a risk group based on which template it is more highly correlated with: ```{r} classifyTrain <- apply(resultsTrain[,1:2],1,which.max)-1 resultsTrain <- cbind(resultsTrain,classifyTrain) colnames(resultsTrain)[4] <- "Predicted" ``` We can look at some stats to see how will this does as a classifier: ```{r} cor(classifyTrain,riskTrain) correct <- riskTrain == classifyTrain lowErr <- riskTrain < classifyTrain highErr <- riskTrain > classifyTrain sum(correct)/numSamplesTrain sum(lowErr)/sum(riskTrain==0) sum(highErr)/sum(riskTrain==1) ``` Now we can do the same for the samples in the validation set: ```{r} classifyValid <- apply(resultsValid[,1:2],1,which.max)-1 resultsValid <- cbind(resultsValid,classifyValid) colnames(resultsValid)[4] <- "Predicted" cor(classifyValid,riskValid) correct <- riskValid == classifyValid lowErr <- riskValid < classifyValid highErr <- riskValid > classifyValid sum(correct)/numSamplesValid sum(lowErr)/sum(riskValid==0) sum(highErr)/sum(riskValid==1) ``` ## Computing a p-value using ChiSq test Now we can get a p-value indicating how well the two groups separate in terms of survival, based on expression of genes in the signature. We can do this separately for the training set and the validation set. ```{r} library(survival) trainDMFS <- sigTrain$t.dmfs pvalTrain <- pchisq(survdiff(Surv(trainDMFS, riskTrain) ~classifyTrain)$chisq, 1, lower.tail=FALSE) pvalTrain validDMFS <- sigValid$t.dmfs pvalValid <- pchisq(survdiff(Surv(validDMFS, riskValid) ~classifyValid)$chisq, 1, lower.tail=FALSE) pvalValid ``` ## Kaplan-Meier Plot Finally, KM plots show us visually how the samples separate. ```{r} par(mfrow=c(1,2)) plot(survfit(Surv(trainDMFS, riskTrain) ~ classifyTrain),col=c(4,2),sub=pvalTrain) plot(survfit(Surv(validDMFS, riskValid) ~ classifyValid),col=c(4,2),sub=pvalValid) ``` #Conclusion If you have time, some things to try include: * Try other values for the length of the gene signatures. You can try increasing numbers to see what is the optimal signature length. * Given a signature of a certain size n, the genes don't necessary have to be the top n genes from topTable. What other ways can you try to select genes (eg by fold change) * Try other methods of classifying the samples based on the signature * Try the signature in another data set