If you haven’t learned the basics of R prior to attending this course, you should check out our R crash course for an overview of R’s syntax. It’s also a great refresher if you feel it has been a while since you last worked with R.
Markdown is a very simple way of writing a template to produce a pdf, HTML or word document. For example, the compiled version of this document is available online and is more convenient to browse here.
EVAL = FALSE
or echo = FALSE
print("Hello World")
## [1] "Hello World"
When viewing the R notebooks directly, not the compiled documents, sections may have additional characters so as to format them nicely when compiled. This is markdown, a simple language that formats the text whle, crucially, being still readable in the raw state.
For example:
This will be displayed in italic
This will be displayed in bold
You can also add hyperlinks, images and tables.
Lastly, you can even embed chunks of code written in other programming languages.
a='Wow python'
print(a.split())
## ['Wow', 'python']
More help is available through RStudio Help -> Markdown Quick Reference or you can view a cheat sheet here.
To create markdown files for your own analysis; File -> New File -> R Markdown…
Established in 2001, Bioconductor provided a convenient method to distribute tools for the analysis and comprehension of high-throughput genomic data in R. Initially focused on microarrays, Bioconductor now has packages (read: software) to process data obtained from most modern data sources.
On the Bioconductor website, you will find
For this session, we will introduce the Bioconductor project as a means of analysing high-throughput data
All Bioconductor software packages are listed under
install
command from the BiocManager package
BiocManager::install
forgoes needing to load the package manager into the work enviroment## You don't need to run this, edgeR should already be installed for the course
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("edgeR")
Once installed, a Bioconductor package can be loaded in the usual way with the library
function. All packages are required to have a vignette which gives detailed instructions on how to use the package and the workflow of commands. Some packages such as edgeR
have very comprehensive user guides with lots of use-cases.
library(edgeR)
vignette("edgeR")
edgeRUsersGuide()
Package documentation can also be accessed via the Help tab in RStudio, which can also be invoked in the console using “?”
?edgeR
Complex data structures are used in Bioconductor to represent high-throughput data, but we often have simple functions that we can use to access the data. We will use some example data available via Bioconductor to demonstrate how high-throughput data can be represented, and also to review some basic concepts in data manipulation in R.
## No need to run this - for reference only!
BiocManager::install("breastCancerVDX")
To make the dataset accessible in R, we first need to load the package. If we navigate to the documentation for breastCancerVDX
in RStudio, we find that it provides an object called vdx
which we load into R’s memory using the data
function.
library(breastCancerVDX)
data(vdx)
The object vdx
is a representation of breast cancer dataset that has been converted for use with standard Bioconductor tools. The package authors don’t envisage that we will want to view the entire dataset at once, so have provided a number of ways to interact with the data
vdx
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 344 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: VDX_3 VDX_5 ... VDX_2038 (344 total)
## varLabels: samplename dataset ... e.os (21 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22283
## total)
## fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 17420468
## Annotation: hgu133a
The expression values can be obtained by the exprs
function:
<-
is used for assignment to create a new variablematrix
in R
dim
, ncol
, nrow
etc.eValues <- exprs(vdx) # also found at vdx@assayData$exprs
class(eValues)
## [1] "matrix"
dim(eValues)
## [1] 22283 344
ncol(eValues)
## [1] 344
nrow(eValues)
## [1] 22283
[row, column]
notation
c
is used to make a one-dimensional vector:
can used to stand for a sequence of consecutive numberseValues[c(1,2,3),c(1,2,3,4)]
## VDX_3 VDX_5 VDX_6 VDX_7
## 1007_s_at 11.965135 11.798593 11.777625 11.538577
## 1053_at 7.895424 7.885696 7.949535 7.481396
## 117_at 8.259272 7.052025 8.225930 8.382408
eValues[1:3,1:4]
## VDX_3 VDX_5 VDX_6 VDX_7
## 1007_s_at 11.965135 11.798593 11.777625 11.538577
## 1053_at 7.895424 7.885696 7.949535 7.481396
## 117_at 8.259272 7.052025 8.225930 8.382408
-
eValues[1:3,1:4][,2:3]
## VDX_5 VDX_6
## 1007_s_at 11.798593 11.777625
## 1053_at 7.885696 7.949535
## 117_at 7.052025 8.225930
eValues[1:3,1:4][,-(2:3)]
## VDX_3 VDX_7
## 1007_s_at 11.965135 11.538577
## 1053_at 7.895424 7.481396
## 117_at 8.259272 8.382408
The most basic plotting function in R is plot
plot
function with a vector will plot the values of that vector against the index
plot(eValues[1,])
boxplot(eValues,outline=FALSE)
The metadata, or phenotypic data, for the samples is retrieved using the pData
function.
metadata <- pData(vdx) #vdx@phenoData@data
head(metadata)
head(metadata,10)
$
notation.$
, it should give you a list of possible options.head(metadata$samplename,15)
## [1] "VDX_3" "VDX_5" "VDX_6" "VDX_7" "VDX_8" "VDX_9" "VDX_11"
## [8] "VDX_14" "VDX_15" "VDX_17" "VDX_18" "VDX_19" "VDX_20" "VDX_21"
## [15] "VDX_22"
[]
to print
mean
function
#FILL IN SOLUTIONS HERE
So far we have been able to print out a subset of our data by specifying a set of numeric indices (e.g. first 10 rows etc). Lets say we’re interested in high-grade tumours, in which case we might not know in advance which rows these correspond to
==
>
, <
, !=
can used to make a logical comparisonTRUE
or FALSE
indicating whether each entry satisfies the test condition, or not.
NA
the resulting logical vector will have NA
at the same positionstable(metadata$grade == 3)
##
## FALSE TRUE
## 49 148
table(metadata$grade == 3,useNA="ifany")
##
## FALSE TRUE <NA>
## 49 148 147
Such a logical vector can then be used for subsetting
which
can be used to make sure there are no NA
values in the logical vectors
TRUE
[]
metadata[which(metadata$grade == 3),]
Can use same expression to subset the columns of the expression matrix
head(colnames(eValues))
head(rownames(metadata))
table(colnames(eValues) == rownames(metadata))
eValues[,which(metadata$grade==3)][1:10,1:10]
## VDX_5 VDX_6 VDX_7 VDX_9 VDX_11 VDX_14
## 1007_s_at 11.798593 11.777625 11.538577 12.248698 12.259243 11.695620
## 1053_at 7.885696 7.949535 7.481396 7.708049 8.182891 7.970394
## 117_at 7.052025 8.225930 8.382408 7.731319 8.226412 7.710118
## 121_at 10.666845 10.888819 10.795472 10.886687 10.192909 11.275077
## 1255_g_at 5.452859 6.456149 6.147714 6.551516 6.376777 6.620586
## 1294_at 9.585901 9.422906 9.456970 9.831782 9.404290 9.292552
## 1316_at 7.644577 7.065012 8.121534 7.573647 7.068241 8.377644
## 1320_at 5.066089 4.661065 6.400879 4.153805 3.678072 4.209453
## 1405_i_at 10.043711 9.350276 11.520619 13.012869 9.896030 8.618386
## 1431_at 6.230741 6.263034 5.951868 6.183883 5.741467 6.797013
## VDX_15 VDX_17 VDX_18 VDX_20
## 1007_s_at 11.577193 12.483665 11.899130 12.414157
## 1053_at 6.721099 8.005625 6.815063 7.067165
## 117_at 7.375908 7.488644 7.209453 9.775939
## 121_at 10.614526 10.664625 10.772892 10.623607
## 1255_g_at 6.044394 6.244126 5.311067 6.266787
## 1294_at 8.572511 10.301839 9.430871 8.914684
## 1316_at 7.302867 7.780048 8.203103 7.651769
## 1320_at 4.887525 6.456149 5.385431 6.230741
## 1405_i_at 8.355792 9.875289 7.210428 9.133656
## 1431_at 5.744161 6.407693 6.560715 5.617651
vdx
object by sample subsets if we wishvdx[,which(metadata$grade==3)]
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 148 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: VDX_5 VDX_6 ... VDX_913 (148 total)
## varLabels: samplename dataset ... e.os (21 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22283
## total)
## fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 17420468
## Annotation: hgu133a
Previously, we used a boxplot to visualise the expression levels of all genes in a given sample to look for trends across the dataset. Another use for a boxplot is to visualise the expression level of a particular gene with respect to the sample metadata
$
and use the formula syntax
table
in this case will tell us how many observations of each category are presentfactor
type if requiredfac <- metadata$er
table(fac)
## fac
## 0 1
## 135 209
boxplot(eValues[1,] ~ fac,
xlab="ER Status",
ylab="Expression Level",
col=c("steelblue","goldenrod"))
Performing a test to assess significance follows similar syntax
t.test
is the generic function to perform a t-test, and can be adapted to different circumstances
t.test
for more detailst.test(eValues[1,]~fac)
##
## Welch Two Sample t-test
##
## data: eValues[1, ] by fac
## t = -1.7476, df = 244.15, p-value = 0.08179
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2101311 0.0125555
## sample estimates:
## mean in group 0 mean in group 1
## 11.72515 11.82394
We could be lucky, and the first row in the expression matrix could be our favourite gene of interest! However, this is unlikely to be the case and we will have to figure out which row we want to plot
vdx
object; the feature datafData
function to access these datafeatures <- fData(vdx) #try to figure out the location in memory of this data.frame via @ and $
class(features)
## [1] "data.frame"
head(features[,1:5])
colnames(features)
## [1] "probe" "Gene.title"
## [3] "Gene.symbol" "Gene.ID"
## [5] "EntrezGene.ID" "UniGene.title"
## [7] "UniGene.symbol" "UniGene.ID"
## [9] "Nucleotide.Title" "GI"
## [11] "GenBank.Accession" "Platform_CLONEID"
## [13] "Platform_ORF" "Platform_SPOTID"
## [15] "Chromosome.location" "Chromosome.annotation"
## [17] "GO.Function" "GO.Process"
## [19] "GO.Component" "GO.Function.1"
## [21] "GO.Process.1" "GO.Component.1"
As we know, gene symbols (more-common gene names) can be accessed using the $
syntax; returning a vector
head(features$Gene.symbol)
## [1] "DDR1" "RFC2" "HSPA6" "PAX8" "GUCA1A" "UBA7"
We could inspect the data frame manually (e.g. using View(features)
) and identify the row number corresponding to our gene of interest. However, as aspiring R programmers, there is a better way
==
to test for exact matchingmatch
will return the index of the first matchgrep
can be used for partial matcheswhich(features$Gene.symbol == "BRCA1")
## [1] 4058 11245
match("BRCA1",features$Gene.symbol)
## [1] 4058
grep("BRCA1",features$Gene.symbol)
## [1] 4058 11245
grep("BRCA",features$Gene.symbol)
## [1] 4058 7869 11245 14103
ESR1
#FILL IN SOLUTIONS HERE
Later in the course we will see how to execute a differential expression analysis for RNA-seq data, and discuss some of the issues surrounding this. For now we will perform a simple two-sample t-test for all genes in our study, and derive a results table
genefilter
package which has a very convenient function for performing many t tests in parallelrowttests
will run a t-test for each row in a given matrix and produce an output table
statistic
; test statisticdm
; difference in meansp.value
; the p-valuerowttests
expects a factor as the second argument, so we have to use as.factor
?rowttests
library(genefilter)
tstats <- rowttests(eValues, as.factor(metadata$er))
head(tstats)
hist(tstats$statistic)
The rows are ordered in the same way as the input matrix
order
functionorder
will return a vector of the same length giving the permutation that rearranges that vector into ascending or descending ordersort
function shortcuts using the output of order
to rearrange the original vector by returning the sorted vectorx <- c(9, 3, 4, 2, 1, 6,5, 10, 8, 7)
x
## [1] 9 3 4 2 1 6 5 10 8 7
order(x)
## [1] 5 4 2 3 7 6 10 9 1 8
x[order(x)]
## [1] 1 2 3 4 5 6 7 8 9 10
sort(x)
## [1] 1 2 3 4 5 6 7 8 9 10
head(tstats[order(tstats$p.value,decreasing = FALSE),])
However, the table we get isn’t immediately useful unless we can recognise the manufacturer probe IDs
cbind
) the columns of test statistic with those from the feature matrixtable(rownames(tstats) == rownames(features))
##
## TRUE
## 22283
tstats.annotated <- cbind(features[,c("Gene.symbol","EntrezGene.ID","Chromosome.location")], tstats)
head(tstats.annotated)
Now when we order by p-value, the extra columns that we just added allow us to interpret the results more easily
tstats.ordered <- tstats.annotated[order(tstats$p.value,decreasing = FALSE),]
head(tstats.ordered)
We can also query this table to look for our favourite genes of interest
%in%
is a simplified way to perform matches to multiple items in a vectortstats.ordered[grep("ESR1",tstats.ordered$Gene.symbol),]
tstats.ordered[tstats.ordered$Gene.symbol %in% c("ESR1","GATA3","FOXA1"),]
csv
file (hint: use write.csv
)
p.adjust
to produce a vector of p-values that are adjusted. Add this as an extra column to your table of results and write as a file#FILL IN SOLUTIONS HERE
The object vdx
, which contains the data we’ve been using, is an ExpressionSet object (check using the class
function). ExpressionSet’s are used as input in a variety of bioinformatic packages and pipelines. As we found above, an ExpressionSet usually contains the following data:
Let’s generate some data and create our own ExpressionSet from scratch - rnorm
generates normally distrubruted values - matrix
takes a vector and arranges it into the specified number of rows and columns - rep
is used to repeat a vector or its elements - sample
samples the input vector, with or without replacement - AnnotatedDataFrame
is a special container that conveniently stores and manipulates data - Remeber that you can use help (?
) to understand unknown functions
eData <- matrix(rnorm(100000),nrow = 1000,ncol = 100)
mData <- AnnotatedDataFrame(data.frame(Sex=rep(c("M","F"),each=50),biomarker.X=rnorm(100)))
fData <- AnnotatedDataFrame(data.frame(Gene=1:1000,Chr=sample(1:23,1000,replace = T)))
exampleSet <- ExpressionSet(
assayData=eData,
phenoData=mData,
featureData=fData
)
p.adjust
)eData[sample(1:1000,1),1:50] <- rnorm(50) + 1
#FILL IN SOLUTIONS HERE