One of the most complex aspects of learning to work with data in R is getting to grips with subsetting and manipulating data tables. The package dplyr (Wickham et al. 2018) was developed to make this process more intuitive than it is using standard base R processes. It also makes use of a new symbol %>%, called the “pipe”, which makes the code a bit tidier.

dplyr is one of suite of similar packages collectively known as the tidyverse.

This is a very brief introduction to the tidyverse way of writing R code. A more detailed introduction can be found in our online R course

We are introducing this because it makes many of the processes we will look at later much simpler. Importantly it also results in code that is much easier to read and understand.

The entire tidyverse suite can be loaded via the tidyverse package:

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.1     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   0.8.5
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Let’s have a quick look at this by playing with our sampleinfo table.

# Read the sample information into a data frame
sampleinfo <- read_tsv("data/SampleInfo.txt")
sampleinfo
## # A tibble: 12 x 4
##    FileName    Sample  CellType Status  
##    <chr>       <chr>   <chr>    <chr>   
##  1 MCL1.DG.bam MCL1.DG luminal  virgin  
##  2 MCL1.DH.bam MCL1.DH basal    virgin  
##  3 MCL1.DI.bam MCL1.DI basal    pregnant
##  4 MCL1.DJ.bam MCL1.DJ basal    pregnant
##  5 MCL1.DK.bam MCL1.DK basal    lactate 
##  6 MCL1.DL.bam MCL1.DL basal    lactate 
##  7 MCL1.LA.bam MCL1.LA basal    virgin  
##  8 MCL1.LB.bam MCL1.LB luminal  virgin  
##  9 MCL1.LC.bam MCL1.LC luminal  pregnant
## 10 MCL1.LD.bam MCL1.LD luminal  pregnant
## 11 MCL1.LE.bam MCL1.LE luminal  lactate 
## 12 MCL1.LF.bam MCL1.LF luminal  lactate

Suppose we wanted a new sample table that:

  1. Just includes the “basal” samples
  2. Only has the columns “CellType” and “Status”
  3. Renames the “CellType” column as “Cell”

Manipulating the table in base R

With base R we would do something like this

newTable <- sampleinfo
newTable <- newTable[newTable$CellType=="basal",]
newTable <- newTable[, c("CellType", "Status")]
colnames(newTable)[1] <- "Cell"
newTable
## # A tibble: 6 x 2
##   Cell  Status  
##   <chr> <chr>   
## 1 basal virgin  
## 2 basal pregnant
## 3 basal pregnant
## 4 basal lactate 
## 5 basal lactate 
## 6 basal virgin

dplyr

With dplyr we can use three new functions: filter, select and rename:

newTable <- sampleinfo
newTable <- filter(newTable, CellType=="basal")
newTable <- select(newTable, CellType, Status)
newTable <- rename(newTable, Cell=CellType)
newTable
## # A tibble: 6 x 2
##   Cell  Status  
##   <chr> <chr>   
## 1 basal virgin  
## 2 basal pregnant
## 3 basal pregnant
## 4 basal lactate 
## 5 basal lactate 
## 6 basal virgin

The idea is that the dplyr code is easier to read and interpret than the base R syntax.

There’s no need to quote the column names as dplyr intelligently interprets the arguments it’s passed as belonging to the data table columns.

The Pipe

Rather than repeatedly reassigning newTable <- f(newTable) as above, we can use the “pipe” - %>%. This takes the output of one function and “pipes” it into the first argument of the next function so that we don’t have to keep specifying the object we are working with:

newTable <- sampleinfo %>%
    filter(CellType=="basal") %>%
    select(CellType, Status) %>% 
    rename(Cell=CellType)
newTable
## # A tibble: 6 x 2
##   Cell  Status  
##   <chr> <chr>   
## 1 basal virgin  
## 2 basal pregnant
## 3 basal pregnant
## 4 basal lactate 
## 5 basal lactate 
## 6 basal virgin

This is a fairly trivial example and the benefits may not be immediately obvious, but once you get used to using dplyr (and the other related “tidyverse” packages, such as stringr) you’ll find it much more powerful and easy to use than base R.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.