```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Loading and pre-processing the NKI data

We start by loading the NKI breast cancer microarray data set:

```{r echo=TRUE, message=FALSE, results="hide"}
library(breastCancerNKI)
data(nki)
```
```{r }
nki
```

There are some missing data elements in the data set, which we'll eliminate or 
replace to make life easier.

First we'll get rid of samples for which we don't have survival data:

```{r }
nki$e.dmfs
nki <- nki[,!is.na(nki$e.dmfs)]
ncol(nki)
```

Next we'll impute some missing values. Less represent less than 1% of the
total data so this shouldn't be a problem:

```{r}
library(e1071)
sum(is.na(exprs(nki)))
exprs(nki)=impute(exprs(nki))
```

# Separating Training and Validation sets of samples

```{r}
totalSamples      <- ncol(nki)
validationSamples <- rep(TRUE,totalSamples)
trainingSamples   <- sample(totalSamples,totalSamples * .75)
validationSamples[trainingSamples] <- FALSE
validationSamples <- which(validationSamples)
training   <- nki[,trainingSamples]
validation <- nki[,validationSamples]
```

# Differential Expression Analysis

Next we'll do a differential expression analysis. For this we use the "event"
data which separates the samples into "high" and a "low" risk groups. 
We can use these groups to make a design matrix.

```{r}
SampleGroup <- factor(training$e.dmfs)
design <- model.matrix(~0+SampleGroup)
colnames(design) <- c("Low","High")
```

Next we'll filter out probes with low variability:

```{r}
library(genefilter)
nrow(training)
training <- varFilter(training)
nrow(training)
```

Now we can use the *limma* package to fit a linear model and test our contrast: :

```{r}
library(limma)
fit <- lmFit(exprs(training), design)
contrasts <- makeContrasts(High - Low, levels=design)
fit2 <- contrasts.fit(fit, contrasts)
fit2 <- eBayes(fit2)
```

# Extracting the 70-gene Signature

We can add gene symbol annotations before extracting the 70 features with
the lowest FDR:

```{r}
anno <- fData(training)[,"NCBI.gene.symbol"]
sig <-topTable(fit2,genelist=anno,number=70)
head(sig)
```


# Testing the 70-gene Signature

Now that we have a 70-gene signature, we'll need to test it.
We'll test it separately on the training set, and then on the validation set.

First we extract the data we are interested in:

```{r}
ids <- rownames(sig)
sigTrain <- training[(rownames(training) %in% ids),]
trainExpr <- exprs(sigTrain)
sigValid <- validation[(rownames(validation) %in% ids),]
validExpr <- exprs(sigValid)
```

## Generating Low and High Risk expression templates

Now we need a way to map the signature to a predicted outcome class.
We create a "template" expression vector for the Low and High risk groups
by taking the mean value for each gene amongst samples in that group:

```{r}
riskTrain <- sigTrain$e.dmfs

lowOnly <- sigTrain[,riskTrain==0]
lowTemplate <- rowMeans(exprs(lowOnly))

highOnly <- sigTrain[,riskTrain==1]
highTemplate <- rowMeans(exprs(highOnly))

riskValid <- sigValid$e.dmfs
```

Now we can evaluate each sample by computing how well they are 
correlated to the Low Risk and the High Risk templates:

```{r}
numSamplesTrain <- ncol(sigTrain)

lows  <- rep(0,numSamplesTrain)
highs <- rep(0,numSamplesTrain)
for(i in 1:numSamplesTrain) {
   lows[i]  <- cor(lowTemplate,trainExpr[,i])
   highs[i] <- cor(highTemplate,trainExpr[,i])
}

resultsTrain <- cbind(lows,highs,riskTrain)
colnames(resultsTrain) <- c("Low","High","Risk")
```

## Plotting correlations with the templates

We can plot the the correlation values comparing each sample to each 
of the two vectors:

```{r}
plot(resultsTrain)
```

Showing that the values are themselves negatively correlated,
which is what we expect. 

We can also make some boxplots showing how the correlation
values for each sample are distributed based on their known risk group. 
We can compare how they correlate with the Low Risk template, and then with
the High Risk template:

```{r}
par(mfrow=c(1,2))
boxplot(Low~Risk,resultsTrain,sub="Training Set - Low Risk Template")
boxplot(High~Risk,resultsTrain,sub="Training Set - High Risk Template")
```

We can see that Low Risk samples correlate more highly with the 
Low Risk template, and High Risk Samples correlate more
highly with the High Risk template.

## Validation set

Now let's try to correlate the expression vectors for the signature genes 
for the samples in the validation set.
```{r}
numSamplesValid <- ncol(sigValid)
lows  <- rep(0,numSamplesValid)
highs <- rep(0,numSamplesValid)
for(i in 1:numSamplesValid) {
   lows[i]  <- cor(lowTemplate,validExpr[,i])
   highs[i] <- cor(highTemplate,validExpr[,i])
}
resultsValid <- cbind(lows,highs,riskValid)
colnames(resultsValid) <- c("Low","High","Risk")
par(mfrow=c(1,2))
boxplot(Low~Risk,resultsValid,sub="Validation Set - Low Risk Template")
boxplot(High~Risk,resultsValid,sub="Validation Set - High Risk Template")
```

# Survival Analysis

Now, we want to do some survival analysis. We need to separate
the samples into two risk groups based on the correlation values.
There are many ways to do this. We'll look at two of them

## Classifying samples

First, we can assign each sample to a risk group based on which template
it is more highly correlated with:

```{r}
classifyTrain <- apply(resultsTrain[,1:2],1,which.max)-1
resultsTrain  <- cbind(resultsTrain,classifyTrain)
colnames(resultsTrain)[4] <- "Predicted"
```

We can look at some stats to see how will this does as a classifier:

```{r}
cor(classifyTrain,riskTrain)
correct <- riskTrain == classifyTrain
lowErr  <- riskTrain < classifyTrain
highErr <- riskTrain > classifyTrain
sum(correct)/numSamplesTrain
sum(lowErr)/sum(riskTrain==0)
sum(highErr)/sum(riskTrain==1)
```

Now we can do the same for the samples in the validation set:
```{r}
classifyValid <- apply(resultsValid[,1:2],1,which.max)-1
resultsValid <- cbind(resultsValid,classifyValid)
colnames(resultsValid)[4] <- "Predicted"
cor(classifyValid,riskValid)
correct <- riskValid == classifyValid
lowErr  <- riskValid < classifyValid
highErr <- riskValid > classifyValid
sum(correct)/numSamplesValid
sum(lowErr)/sum(riskValid==0)
sum(highErr)/sum(riskValid==1)
```

## Computing a p-value using ChiSq test

Now we can get a p-value indicating how well the two groups separate
in terms of survival, based on expression of genes in the signature.
We can do this separately for the training set and the validation set.

```{r}
library(survival)
trainDMFS <- sigTrain$t.dmfs
pvalTrain <- pchisq(survdiff(Surv(trainDMFS, riskTrain) ~classifyTrain)$chisq,
               1, lower.tail=FALSE)
pvalTrain

validDMFS <- sigValid$t.dmfs
pvalValid <- pchisq(survdiff(Surv(validDMFS, riskValid) ~classifyValid)$chisq,
               1, lower.tail=FALSE)
pvalValid
```

## Kaplan-Meier Plot

Finally, KM plots show us visually how the samples separate.

```{r}
par(mfrow=c(1,2))
plot(survfit(Surv(trainDMFS, riskTrain) ~ classifyTrain),col=c(4,2),sub=pvalTrain)
plot(survfit(Surv(validDMFS, riskValid) ~ classifyValid),col=c(4,2),sub=pvalValid)
```


#Conclusion

If you have time, some things to try include:
* Try other values for the length of the gene signatures. 
You can try increasing numbers to see what is
the optimal signature length.

* Given a signature of a certain size n, 
the genes don't necessary have to be the top n genes from topTable. 
What other ways can you try to select genes (eg by fold change)

* Try other methods of classifying the samples based on the signature

* Try the signature in another data set