About R

If you haven’t learned the basics of R prior to attending this course, you should check out our R crash course for an overview of R’s syntax. It’s also a great refresher if you feel it has been a while since you last worked with R.

About the practicals for this workshop

  • The traditional way to enter R commands is via opening a Terminal or, or using the console in RStudio (bottom-left panel when RStudio opens for first time).
  • For this course we will instead be using a relatively new feature called R Notebooks.
  • An R notebook mixes plain text written in markdown with “chunks” of R code.

Markdown is a very simple way of writing a template to produce a pdf, HTML or word document. For example, the compiled version of this document is available online and is more convenient to browse here.

  • “Chunks” of R code can be added using the insert option from the tool bar, or the CTRL + ALT + I shortcut
  • Each line of R can be executed by clicking on the line and pressing CTRL and ENTER
  • Or you can execute the whole chunk by pressing CTRL + SHIFT + ENTER
  • Or you can press the green triangle on the right-hand side to run everything in the chunk
  • The code might have different options which dictate how the output is displayed in the compiled document (e.g. HTML)
    • e.g. you might see EVAL = FALSE or echo = FALSE
    • you don’t have to worry about this if stepping through the markdown interactively
print("Hello World")
[1] "Hello World"

When viewing the R notebooks directly, not the compiled documents, sections may have additional characters so as to format them nicely when compiled. For example:

This will be displayed in italic

This will be displayed in bold

  • this
  • is
  • a
  • list
    • this is a sub-list

You can also add hyperlinks, images and tables.

Lastly, you can even embed chunks of code written in other programming languages.

More help is available through RStudio Help -> Markdown Quick Reference or you can view a cheat sheet here.

To create markdown files for your own analysis; File -> New File -> R Markdown…

About the Bioconductor project

Established in 2001, Bioconductor provided a convenient method to distribute tools for the analysis and comprehension of high-throughput genomic data in R. Initially focused on microarrays, Bioconductor now has packages (read: software) to process data obtained from most modern data sources.

  • R is rarely used for the primary processing of modern data
    • R is far slower than many other programming languages due to it being an interpreted language (Interpreted vs Compiled)
    • R is extensively-used for visualisation, interpretation and inference once data has been parsed into a more manageable form, e.g., a csv.

On the Bioconductor website, you will find

For this session, we will introduce the Bioconductor project as a means of analysing high-throughput data

Installing a package

All Bioconductor software packages are listed under

  • bioconductor.org -> Install -> Packages -> Analysis software packages
    • Many thousands of packages have been added over the years, so I would suggest just googling “bioconductor [package_name]”
    • e.g. edgeR landing page
  • installation instructions are given, which involves running the biocLite command
    • this will install and update any additional dependencies
  • you only need to run this procedure once for each version of R
## You don't need to run this, edgeR should already be installed for the course
source("http://www.bioconductor.org/biocLite.R")
biocLite("edgeR")

Once installed, a Bioconductor package can be loaded in the usual way with the library function. All packages are required to have a vignette which gives detailed instructions on how to use the package and the workflow of commands. Some packages such as edgeR have very comprehensive user guides with lots of use-cases.

library(edgeR)
vignette("edgeR")
edgeRUsersGuide()

Package documentation can also be accessed via the Help tab in RStudio, which can also be invoked in the console using “?”

?edgeR

Structures for data analysis

Complex data structures are used in Bioconductor to represent high-throughput data, but we often have simple functions that we can use to access the data. We will use some example data available through Bioconductor to demonstrate how high-throughput data can be represented, and also to review some basic concepts in data manipulation in R.

  • the data are from a microarray experiment. We will be concentrating on more modern technologies in this class, but most of the R techniques required will be similar
  • experimental data packages are available through Bioconductor, and can be installed in the way we just described
    • the package should already be installed on your computer, so you won’t need to run this.
## No need to run this - for reference only!
biocLite("breastCancerVDX")

To make the dataset accessible in R, we first need to load the package. If we navigate to the documentation for breastCancerVDX in RStudio, we find that it provides an object called vdx which we load into R’s memory using the data function.

library(breastCancerVDX)
data(vdx)

The object vdx is a representation of breast cancer dataset that has been converted for use with standard Bioconductor tools. The package authors don’t envisage that we will want to view the entire dataset at once, so have provided a number of ways to interact with the data

  • typing the name of the object provides a summary, e.g.,
    • how many genes in the dataset
    • how many samples
vdx
ExpressionSet (storageMode: lockedEnvironment)
assayData: 22283 features, 344 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: VDX_3 VDX_5 ... VDX_2038 (344 total)
  varLabels: samplename dataset ... e.os (21 total)
  varMetadata: labelDescription
featureData
  featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22283 total)
  fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 17420468 
Annotation: hgu133a 

Accessing expression values

The expression values can be obtained by the exprs function:-

  • remember, <- is used for assignment to create a new variable
  • the data are stored in a matrix in R
    • it is a good idea to check the dimensions using dim, ncol, nrow etc.
eValues <- exprs(vdx) # also found at vdx@assayData$exprs
class(eValues)
[1] "matrix"
dim(eValues)
[1] 22283   344
ncol(eValues)
[1] 344
nrow(eValues)
[1] 22283
  • the row names are the manufacturer-assigned ID for a particular probe
  • the column names are the identifiers for each patient in the study
  • each entry is a normalised log\(_2\) intensity value for a particular gene in a given sample
    • we won’t talk about normalisation here, but basically the data has been transformed so that samples and/or genes can be compared
  • subsetting a matrix is done using the [row, column] notation
    • the function c is used to make a one-dimensional vector
    • the shortcut : can used to stand for a sequence of consecutive numbers
eValues[c(1,2,3),c(1,2,3,4)]
              VDX_3     VDX_5     VDX_6     VDX_7
1007_s_at 11.965135 11.798593 11.777625 11.538577
1053_at    7.895424  7.885696  7.949535  7.481396
117_at     8.259272  7.052025  8.225930  8.382408
eValues[1:3,1:4]
              VDX_3     VDX_5     VDX_6     VDX_7
1007_s_at 11.965135 11.798593 11.777625 11.538577
1053_at    7.895424  7.885696  7.949535  7.481396
117_at     8.259272  7.052025  8.225930  8.382408
  • subsetting can be chained together
  • we can also omit certain rows or columns from the output by prefixing the indices with a -
eValues[1:3,1:4][,2:3]
              VDX_5     VDX_6
1007_s_at 11.798593 11.777625
1053_at    7.885696  7.949535
117_at     7.052025  8.225930
eValues[1:3,1:4][,-(2:3)]
              VDX_3     VDX_7
1007_s_at 11.965135 11.538577
1053_at    7.895424  7.481396
117_at     8.259272  8.382408

Simple visualisations

The most basic plotting function in R is plot

  • using the plot function with a vector will plot the values of that vector against the index
    • what do you think is displayed in the plot below?
plot(eValues[1,])

  • one possible use is to compare the values in a vector with respect to a given factor
  • but we don’t know the clinical variables in our dataset yet (to come later)
  • a boxplot can also accept a matrix or data frame as an argument
  • what do you think the following plot shows?
boxplot(eValues,outline=FALSE)