If you havenāt done so already, you can check out our R crash course for an overview of why R is great and some of the basic syntax. Here are some great examples of people that have used R to do cool stuff with data
The BBC makes code available for some of their stories (e.g.Ā gender bias in music festivals)
Rās role in promoting reproducible research cannot be overstated.
Two Biostatiscians (later termed āForensic Bioinformaticiansā) from M.D.Ā Anderson used R extensively during their re-analysis and investigation of a Clinical Prognostication paper from Duke. The subsequent scandal put Reproducible Research at the forefront of everyoneās mind.
Keith Baggerlyās talk on the subject is highly-recommended.
About the practicals for this workshop
The traditional way to enter R commands is via the Terminal, or using the console in RStudio (bottom-left panel when RStudio opens for first time).
However, for this course we will use a relatively new feature called R-notebooks.
An R-notebook mixes plain text with R code
we use the terms notebook and markdown interchangeably; they are pretty much the same thing
Markdown is a very simple way of writing a template to produce a pdf, HTML or word document. The compiled version of this document is available online and is more convenient to browse here.
āchunksā of R code can be added using the insert option from the toolbar, or the CTRL + ALT + I shortcut
Each line of R can be executed by clicking on the line and pressing CTRL and ENTER
Or you can press the green triangle on the right-hand side to run everything in the chunk
Try this now!
The code might have different options which dictate how the output is displayed in the compiled document (e.g.Ā HTML)
e.g.Ā you might see EVAL = FALSE or echo = FALSE
you donāt have to worry about this if stepping through the markdown interactively
print("Hello World")
This will be displayed in italic
This will be displayed in bold
this
is
a
list
this is a sub-list
You can also add hyperlinks, images and tables. More help is available through RStudio Help -> Markdown Quick Reference
To create markdown files for your own analysis; File -> New File -> R Markdownā¦
About the Bioconductor project
Established in 2001 as a means of distributing tools for the analysis and comprehension of high-throughput genomic data in R. Initially focused on microarrays, but now has many tools for NGS data
primary processing of NGS data nearly-always done in other languages
R is extensively-used for visualisation and interpretation once data are in manageable form
installation instructions are given, which involves running the biocLite command
this will install any additional dependancies
you only need to run this procedure once for each version of R
## You don't need to run this, edgeR should already be installed for the course
source("http://www.bioconductor.org/biocLite.R")
biocLite("edgeR")
Once installed, a Bioconductor package can be loaded in the usual way with the library function. All packages are required to have a vignette which gives detailed instructions on how to use the package and the workflow of commands. Some packages such as edgeR have very comprehensive user guides with lots of use-cases.
Package documentation can also be accessed via the Help tab in RStudio.
Structures for data analysis
Complex data structures are used in Bioconductor to represent high-throughput data, but we often have simple functions that we can use to access the data. We will use some example data available through Bioconductor to demonstrate how high-throughput data can be represented, and also to review some basic concepts in data manipulation in R.
the data are from a microarray experiment. We will be concentrating on more-modern technologies in this class, but most of the R techniques required will be similar
experimental data packages are available through Bioconductor, and can be installed in the way we just described
the package should already be installed on your computer, so you wonāt need to run this.
## No need to run this - for reference only!
##
biocLite("breastCancerVDX")
To make the dataset accessible in R, we first need to load the package. If we navigate to the documentation for breastCancerVDX in RStudio, we find that it provides an object called vdx which we load into Rās memory using the data function.
library(breastCancerVDX)
Error in library(breastCancerVDX) :
there is no package called ābreastCancerVDXā
The object vdx is a representation of breast cancer dataset that has been converted for use with standard Bioconductor tools. The package authors donāt envisage that we will want to view the entire dataset at once, so have provided a number of ways to interact with the data
typing the name of the object provides a summary
how many genes in the dataset, how many samples etc
we can also omit certain rows or columns from the output by prefixing the indices with a -
eValues[1:3,-(1:4)]
Simple visualisations
The most basic plotting function in R is plot
using the plot function with a vector will plot the values of that vector against the index
what do you think is displayed in the plot below?
plot(eValues[1,])
## Various aspects of the plot can be controlled by specifying additional arguments
one possible use is to compare the values in a vector with respect to a given factor
but we donāt know the clinical variables in our dataset yet (to come later)
a boxplot can also accept a matrix or data frame as an argument
what do you think the following plot shows?
boxplot(eValues,outline=FALSE)
Accessing patient data
The metadata, or phenotypic data, for the samples is retrieved using the pData function.
metadata <- pData(vdx)
metadata
indivdual columns can be accessed using the $ notation.
auto-complete is available in RStudio; once you type the $it should give you a list of possible options.
the data are returned as a vector, which can be fed-into other standard plotting and analysis functions
metadata$samplename
Exercise
what type of R object is used to store the metadata
why is this different to the object used for the expression values?
use the square bracket notation [] to print
the first 10 rows of the metadata object, first five columns
last 10 rows of the metadata object, columns 7 to 9
visualise the distribution of the patient ages using a histogram
calculate the average age of patients in the study with the mean function
what do you notice about the result?
can you change the arguments to mean to get a more meaningful result
So far we have been able to print out a subset of our data by specifying a set of numeric indices (e.g.Ā first 10 rows etc). Lets say weāre interested in high-grade tumours, in which case we might not know in advance which rows these correspond to
==>, <, != can used to make a logical comparison
result is a TRUE or FALSE indicating whether each entry satisfies the test condition, or not.
however, if a particular entry in the vector is NA the resulting logical vector will have NA at the same positions
metadata$grade == 3
metadata$age > 50
Such a logical vector can then be used for subsetting
which can be used to make sure there are no NA values in the logical vectors
it gives the numeric indices that correspond to TRUE
here, we donāt specify any column indices inside the []
R will print all the columns
however, donāt forget the ,!
if you do, R will still try and do something. It almost certainly be what you expected
metadata[which(metadata$grade == 3),]
Can use same expression to subset the columns of the expression matrix
why can we do this? Because the columns of the expression matrix are in the same order as the rows of the metadata
donāt believe me? see belowā¦
this isnāt a coincidence. the data have been carefully curated to ensure that this is the case
data stored in online repositories are often organised in this way
in fact, we can subset the entire vdx object by sample subsets if we wish
vdx[,metadata$grade==3]
Previously, we used a boxplot to visualise the expression levels of all genes in a given sample to look for trends across the dataset. Another use for a boxplot is to visualise the expression level of a particular gene with respect to the sample metadata
we can extract the column of interest with a $ and use the formula syntax
table in this case will tell us how many observations of each category are present
R will be clever and convert the factor into a factor type if required
Performing a test to assess significance follows similar syntax
t.test is the generic function to perform a t-test, and can be adapted to different circumstances
e.g.Ā if our data are paired, or not
see the help for t.test for more details
for now, we will gloss over testing assumptions on the data such as requiring a normal (Gaussian) distribtion and multiple testing correction
t.test(eValues[1,]~fac)
Welch Two Sample t-test
data: eValues[1, ] by fac
t = -1.7476, df = 244.15, p-value = 0.08179
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2101311 0.0125555
sample estimates:
mean in group 0 mean in group 1
11.72515 11.82394
Accessing feature (gene) information
We could be lucky, and the first row in the expression matrix could be our favourite gene of interest! However, this is unlikely to be the case and we will have to figure out which row we want to plot
we can use another aspect of the nki object; the feature data
there is a handy fData function to access these data
again, this gives us a data frame
this is a pre-built table supplied with the dataset
later in the course we will look at using online services and databases for annotation and converting between identifiers
features <- fData(vdx)
class(features)
[1] "data.frame"
features[1:10,]
As we know, gene symbols (more-common gene names) can be accessed using the $ syntax; returning a vector
features$Gene.symbol
We could inspect the data frame manually (e.g.Ā using View(features)) and identfy the row number corresponding to our gene of interest. However, as aspiring R programmers, there is a better way
== to test for exact matching
match will return the index of the first match
grep can be used for partial matches
each of the above will give an vector that can be used to subset the expression values
which(features$Gene.symbol == "BRCA1")
[1] 4058 11245
match("BRCA1",features$Gene.symbol)
[1] 4058
grep("BRCA1",features$Gene.symbol)
[1] 4058 11245
grep("BRCA",features$Gene.symbol)
[1] 4058 7869 11245 14103
Exercise
Verify that the rows of the feature matrix and the expression values are in the same order
Find the row corresponding to your favourite gene
if you donāt have one, try ESR1
if you find multiple matches, pick the first one that occurs
Does the expression level of this gene appear to be associated with the ER status?
Testing all genes for significance
Later in the course we will see how to execute a differential expression analysis for RNA-seq data, and discuss some of the issues surrounding this. For now we will perform a simple two-sample t-test for all genes in our study, and derive a results table
firstly, we load the genefilter package which has a very convenient function for performing many t tests in parallel
rowttests will run a t-test for each row in a given matrix and produce an output table
statistic; test statistic
dm; difference in means
p.value; the p-value
rowttests expects a factor as the second argument, so we have to use as.factor
The rows are ordered in the same way as the input matrix
to change this to increasing significance we can use the order function
when given a vector, order will return a vector of the same length giving the permutation that rearranges that vector into ascending or descending order
how can we find out what the default way of ordering is?
x <- c(9, 3, 4, 2, 1, 6,5, 10,8,7)
x
[1] 9 3 4 2 1 6 5 10 8 7
order(x)
[1] 5 4 2 3 7 6 10 9 1 8
x[order(x)]
[1] 1 2 3 4 5 6 7 8 9 10
so if we want to order by p-value we first use order on the p-value vector
this can then be used to re-arrange the rows of the table
tstats[order(tstats$p.value,decreasing = FALSE),]
note that we could save the result of order as a variable and then use in the subsetting
either is fine, itās a matter of personal preference
From the annotated table above, select all genes with p-values less than 0.05
Write this data frame as a csv file
(OPTIONAL) Use the p.adjust to produce a vector of p-values that are adjusted. Add this as an extra column to your table of results and write as a file
More advanced visualisation
Clustering
Clustering leads to readily interpretable figures and can be helpful for identifying patterns in time or space.
it is known as an unsupervised
classes unknown, want to discover them from the data
although we can compare to the classes we know about afterwards
We can cluster samples (columns)
e.g.Ā identification of new / unknown tumor classes using gene expression profiles
We can cluster genes (rows)
e.g.Ā using large numbers of yeast experiments to identify groups of co-regulated genes
we can cluster genes to reduce redundancy (i.e.Ā variable selection) in predictive models
it has been used to good effect for many years
The procedure is
Preprocess the data
Choose a dissimilarity measure
Calculate a distance matrix
Choose a cluster algorithm
Cluster the samples
Visualise
Fortunately, there are easy to use functions for calculating the distance matrix (dist) and performing the clustering (hclust). The tricky part is choosing the correct subset of the data and interpreting the results
if you look at the help for dist (?dist) you will see that it takes a numeric matrix as its argument
how you define this matrix is up to youā¦
also notice that it computes distances between the rows of the data matrix
the most common application of clustering is to discover relationships between samples in our dataset
which are the matrix columns in our case
so we need to transpose our expression values before using this function
?dist
distMat <- dist(t(eValues))
We can see which samples are grouped together by cutting the tree at a particular height, or to give a pre-determined number of groups
Letās just go ahead and run the function heatmap
Cool! There seems to be some kind of structure to the data. However, the plot is a little ugly. Fortunately there are plenty of options in the function that we can use to tweak the plot (?heatmap)
we can specify a block of colour to go underneath each sample to indicate if belong to the āER Positiveā or āER Negativeā group
this has to be a vector with the same length as the number of samples
we can build such a vector in two stages
first repeating one colour (e.g. blue) for each samples
then replacing entries corresponding to a particular sample type with a different colour
The colour scheme for the cells in the heatmap can also be changed. The RColorBrewer package has a number of pre-defined palettes, several of which are suitable for those with colour-blindness (the commonly used red-green scale for heatmaps is clearly a bad choice).
A good choice for a heatmap is a palette which diverges from one extreme to another. If we use the RdBu palette, samples with lower gene expression will be shown in red
the function brewer.pal is used to create the palette with a certain number of colours
Modify the labels on the side of the heatmap so the gene names are displayed rather than probe names
You will need to check the help page of heatmap to see which argument needs to be specified to change the labels
(Optional) Can you remove the samples names on the bottom of the heatmap?
Of course we should not be surprised that our heatmap showed good separation; we used genes that we previously found to be statistically different between samples.
the heatmap function requires a numeric matrix as its argument
it doesnāt really care how this matrix was generated
so itās your choice about what genes / samples are present in the matrix
For those that might be interested, here is an overview of the commands one might use to download data from the Gene Expression Omnibus. The GEOquery package is used and you have to know the accession number of the dataset that you want to analyse. Typically this would be stated in a publication in the form GSEā¦..
## If not installed already, install GEOquery with
## source("http://www.bioconductor.org/biocLite.R")
## biocLite("GEOquery")
library(GEOquery)
mydata <- getGEO("GSE1729")
The mydata object that is created is a list in R. This is used because some studies might involve data generated on different platforms, which have separate annotations.