Reproducible Research

Mark Dunning

Last modified: 29 Jul 2015

Principles of Reproducible Research

step-two

Sidney Harris - New York Times

Why should we do reproducible research?

Five selfish reasons - Florian Markowetz Blog and slides

  1. Avoid disaster
  2. Easier to write papers
  3. Easier to talk to reviewers
  4. Continuity of your work in the lab
  5. Reputation

It is a hot topic at the moment

nyt-article

Hear the full account

What can we do about it?

Simple example in RStudio

compile

What is going on?

library(knitr)
spin(hair="rna-seq.R",knit=FALSE)

Not quite enough for a reproducible document

sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.2     tools_3.2.1     htmltools_0.2.6
##  [5] yaml_2.1.13     stringi_0.5-5   rmarkdown_0.7   knitr_1.10.5   
##  [9] stringr_1.0.0   digest_0.6.8    evaluate_0.7

Defining chunks

Create a markdown file from scratch

File - > New File - > R Markdown

---
title: "Untitled"
author: "Mark Dunning"
date: "16/06/2015"
output: html_document
---

Format of the file

md-format

Text formatting

See Markdown Quick Reference in RStudio

To be or not to be

Chunk options

Chunk options: eval

'''{r,eval=FALSE}
data <- read.delim("path.to.my.file")
'''

Chunk options: echo

Chunk options: results

for(i in 1:100){
  print(i)
}

Chunk options: message and warning

## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Loading required package: locfit
## locfit 1.5-9.1    2013-03-22
## Loading required package: lattice
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
##     Welcome to 'DESeq'. For improved performance, usability and
##     functionality, please consider migrating to 'DESeq2'.

Chunk options: message and warning

Chunk options: cache

Including plots

Control over plots

'''{r fig.height=2,fig.align='right', fig.height=4,fig.width=9}
plot(1:10, jitter(1:10))
'''

Running R code from the main text

.....the sample population consisted of  'r table(gender)[1]' females and 'r table(gender)[2]' males.....

…..the sample population consisted of 47 females and 50 males…..

.....the p-value of the t-test is 'r pval', which indicates that.....

…..the p-value of the t-test is 0.05, which indicates that…..

Running R code from the main text

.....the sample population consisted of  'r table(gender)[1]' females and 'r table(gender)[2]' males.....

…..the sample population consisted of 41 females and 54 males…..

.....the p-value of the t-test is 'r pval', which indicates that.....

…..the p-value of the t-test is 0.1, which indicates that…..

Conditional output

pval <- 0.1
.....The statistical test was 'r ifelse(pval < 0.05, "", "not")' significant....

The statistical test was not significant

pval <- 0.01
.....The statistical test was 'r ifelse(pval < 0.05, "", "not")' significant....

The statistical test was significant

Printing vectors

The months of the year are 'r month.name'

The months of the year are January, February, March, April, May, June, July, August, September, October, November, December

Exercise

References