About R

If you haven’t learned the basics of R prior to attending this course, you should check out our R crash course for an overview of R’s syntax. It’s also a great refresher if you feel it has been a while since you last worked with R.

About the practicals for this workshop

  • The traditional way to enter R commands is via opening a Terminal or, or using the console in RStudio (bottom-left panel when RStudio opens for first time).
  • For this course we will instead be using a relatively new feature called R Notebooks.
  • An R notebook mixes plain text written in markdown with “chunks” of R code.

Markdown is a very simple way of writing a template to produce a pdf, HTML or word document. For example, the compiled version of this document is available online and is more convenient to browse here.

  • “Chunks” of R code can be added using the insert option from the tool bar, or the CTRL + ALT + I shortcut
  • Each line of R can be executed by clicking on the line and pressing CTRL and ENTER
  • Or you can execute the whole chunk by pressing CTRL + SHIFT + ENTER
  • Or you can press the green triangle on the right-hand side to run everything in the chunk
  • The code might have different options which dictate how the output is displayed in the compiled document (e.g. HTML)
    • e.g. you might see EVAL = FALSE or echo = FALSE
    • you don’t have to worry about this if stepping through the markdown interactively
print("Hello World")
## [1] "Hello World"

When viewing the R notebooks directly, not the compiled documents, sections may have additional characters so as to format them nicely when compiled. This is markdown, a simple language that formats the text whle, crucially, being still readable in the raw state.

For example:

This will be displayed in italic

This will be displayed in bold

  • this
  • is
  • a
  • list
    • this is a sub-list

You can also add hyperlinks, images and tables.

Lastly, you can even embed chunks of code written in other programming languages.

a='Wow python'
## ['Wow', 'python']

More help is available through RStudio Help -> Markdown Quick Reference or you can view a cheat sheet here.

To create markdown files for your own analysis; File -> New File -> R Markdown…

About the Bioconductor project

Established in 2001, Bioconductor provided a convenient method to distribute tools for the analysis and comprehension of high-throughput genomic data in R. Initially focused on microarrays, Bioconductor now has packages (read: software) to process data obtained from most modern data sources.

  • R is rarely used for the primary processing of modern data
    • R is far slower than many other programming languages due to it being an interpreted language (Interpreted vs Compiled)
    • R is extensively-used for visualisation, interpretation and inference once data has been parsed into a more manageable form, e.g., a csv.

On the Bioconductor website, you will find

For this session, we will introduce the Bioconductor project as a means of analysing high-throughput data

Installing a package

All Bioconductor software packages are listed under

  • bioconductor.org -> Install -> Packages -> Analysis software packages
    • Many thousands of packages have been added over the years, so I would suggest just googling “bioconductor [package_name]”
    • e.g. edgeR landing page
  • installation instructions are given, which involves running the install command from the BiocManager package
    • this will install and update any additional dependencies
    • Running BiocManager::install forgoes needing to load the package manager into the work enviroment
  • you only need to run this procedure once for each version of R
## You don't need to run this, edgeR should already be installed for the course
if (!requireNamespace("BiocManager", quietly = TRUE))

Once installed, a Bioconductor package can be loaded in the usual way with the library function. All packages are required to have a vignette which gives detailed instructions on how to use the package and the workflow of commands. Some packages such as edgeR have very comprehensive user guides with lots of use-cases.


Package documentation can also be accessed via the Help tab in RStudio, which can also be invoked in the console using “?”


Structures for data analysis

Complex data structures are used in Bioconductor to represent high-throughput data, but we often have simple functions that we can use to access the data. We will use some example data available via Bioconductor to demonstrate how high-throughput data can be represented, and also to review some basic concepts in data manipulation in R.

  • This data is from a microarray experiment. We will be concentrating on more modern technologies in this class, but most of the R techniques required will be eaxctly the same or at least very similar.
  • experimental data packages are available through Bioconductor, and can be installed in the way we just described
    • the package should already be installed on your computer, so you won’t need to run this.
## No need to run this - for reference only!

To make the dataset accessible in R, we first need to load the package. If we navigate to the documentation for breastCancerVDX in RStudio, we find that it provides an object called vdx which we load into R’s memory using the data function.


The object vdx is a representation of breast cancer dataset that has been converted for use with standard Bioconductor tools. The package authors don’t envisage that we will want to view the entire dataset at once, so have provided a number of ways to interact with the data

  • typing the name of the object provides a summary, e.g.,
    • how many genes in the dataset
    • how many samples
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 344 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: VDX_3 VDX_5 ... VDX_2038 (344 total)
##   varLabels: samplename dataset ... e.os (21 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22283
##     total)
##   fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 17420468 
## Annotation: hgu133a

Accessing expression values

The expression values can be obtained by the exprs function:

  • remember, <- is used for assignment to create a new variable
  • the data are stored in a matrix in R
    • it is a good idea to check the dimensions using dim, ncol, nrow etc.
eValues <- exprs(vdx) # also found at vdx@assayData$exprs
## [1] "matrix"
## [1] 22283   344
## [1] 344
## [1] 22283
  • the row names are the manufacturer-assigned ID for a particular probe
  • the column names are the identifiers for each patient in the study
  • each entry is a normalised log\(_2\) intensity value for a particular gene in a given sample
    • we won’t talk about normalisation here, but basically the data has been transformed so that samples and/or genes can be compared
  • subsetting a matrix is done using the [row, column] notation
    • the function c is used to make a one-dimensional vector
    • the shortcut : can used to stand for a sequence of consecutive numbers
##               VDX_3     VDX_5     VDX_6     VDX_7
## 1007_s_at 11.965135 11.798593 11.777625 11.538577
## 1053_at    7.895424  7.885696  7.949535  7.481396
## 117_at     8.259272  7.052025  8.225930  8.382408
##               VDX_3     VDX_5     VDX_6     VDX_7
## 1007_s_at 11.965135 11.798593 11.777625 11.538577
## 1053_at    7.895424  7.885696  7.949535  7.481396
## 117_at     8.259272  7.052025  8.225930  8.382408
  • subsetting can be chained together
  • we can also omit certain rows or columns from the output by prefixing the indices with a -
##               VDX_5     VDX_6
## 1007_s_at 11.798593 11.777625
## 1053_at    7.885696  7.949535
## 117_at     7.052025  8.225930
##               VDX_3     VDX_7
## 1007_s_at 11.965135 11.538577
## 1053_at    7.895424  7.481396
## 117_at     8.259272  8.382408

Simple visualisations

The most basic plotting function in R is plot

  • using the plot function with a vector will plot the values of that vector against the index
    • what do you think is displayed in the plot below?