One of the most complex aspects of learning to work with data in R is getting to grips with subsetting and manipulating data tables. The package dplyr (Wickham et al. 2018) was developed to make this process more intuitive than it is using standard base R processes. It also makes use of a new symbol %>%, called the “pipe,” which makes the code a bit tidier.

dplyr is one of suite of similar packages collectively known as the tidyverse.

This is a very brief introduction to the tidyverse way of writing R code. A more detailed introduction can be found in our online R course

We are introducing this because it makes many of the processes we will look at later much simpler. Importantly it also results in code that is much easier to read and understand.

The entire tidyverse suite can be loaded via the tidyverse package:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Let’s have a quick look at this by playing with our sampleinfo table.

# Read the sample information into a data frame
sampleinfo <- read_tsv("data/samplesheet.tsv")
sampleinfo
## # A tibble: 12 x 4
##    SampleName Replicate Status     TimePoint
##    <chr>          <dbl> <chr>      <chr>    
##  1 SRR7657878         1 Infected   d11      
##  2 SRR7657881         2 Infected   d11      
##  3 SRR7657880         3 Infected   d11      
##  4 SRR7657874         1 Infected   d33      
##  5 SRR7657882         2 Infected   d33      
##  6 SRR7657872         3 Infected   d33      
##  7 SRR7657877         1 Uninfected d11      
##  8 SRR7657876         2 Uninfected d11      
##  9 SRR7657879         3 Uninfected d11      
## 10 SRR7657883         1 Uninfected d33      
## 11 SRR7657873         2 Uninfected d33      
## 12 SRR7657875         3 Uninfected d33

Suppose we wanted a new sample table that:

  1. Just includes the “d11” samples
  2. Only has the columns “SampleName” and “Status”
  3. Renames the “Status” column as “SampleGroup”

Manipulating the table in base R

With base R we would do something like this

newTable <- sampleinfo
newTable <- newTable[newTable$TimePoint=="d11",]
newTable <- newTable[, c("SampleName", "Status")]
colnames(newTable)[2] <- "SampleGroup"
newTable
## # A tibble: 6 x 2
##   SampleName SampleGroup
##   <chr>      <chr>      
## 1 SRR7657878 Infected   
## 2 SRR7657881 Infected   
## 3 SRR7657880 Infected   
## 4 SRR7657877 Uninfected 
## 5 SRR7657876 Uninfected 
## 6 SRR7657879 Uninfected

dplyr

With dplyr we can use three new functions: filter, select and rename:

newTable <- sampleinfo
newTable <- filter(newTable, TimePoint=="d11")
newTable <- select(newTable, SampleName, Status)
newTable <- rename(newTable, SampleGroup=Status)
newTable
## # A tibble: 6 x 2
##   SampleName SampleGroup
##   <chr>      <chr>      
## 1 SRR7657878 Infected   
## 2 SRR7657881 Infected   
## 3 SRR7657880 Infected   
## 4 SRR7657877 Uninfected 
## 5 SRR7657876 Uninfected 
## 6 SRR7657879 Uninfected

The idea is that the dplyr code is easier to read and interpret than the base R syntax.

There’s no need to quote the column names as dplyr intelligently interprets the arguments it’s passed as belonging to the data table columns.

The Pipe

Rather than repeatedly reassigning newTable <- f(newTable) as above, we can use the “pipe” - %>%. This takes the output of one function and “pipes” it into the first argument of the next function so that we don’t have to keep specifying the object we are working with:

newTable <- sampleinfo %>%
    filter(TimePoint=="d11") %>%
    select(SampleName, Status) %>% 
    rename(SampleGroup=Status)
newTable
## # A tibble: 6 x 2
##   SampleName SampleGroup
##   <chr>      <chr>      
## 1 SRR7657878 Infected   
## 2 SRR7657881 Infected   
## 3 SRR7657880 Infected   
## 4 SRR7657877 Uninfected 
## 5 SRR7657876 Uninfected 
## 6 SRR7657879 Uninfected

This is a fairly trivial example and the benefits may not be immediately obvious, but once you get used to using dplyr (and the other related “tidyverse” packages, such as stringr) you’ll find it much more powerful and easy to use than base R.


References

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.