dplyr
and tidyverseOne of the most complex aspects of learning to work with data in R
is getting to grips with subsetting and manipulating data tables. The package dplyr
(Wickham et al. 2018) was developed to make this process more intuitive than it is using standard base R
processes. It also makes use of a new symbol %>%
, called the “pipe,” which makes the code a bit tidier.
dplyr
is one of suite of similar packages collectively known as the tidyverse.
This is a very brief introduction to the tidyverse
way of writing R code. A more detailed introduction can be found in our online R course
We are introducing this because it makes many of the processes we will look at later much simpler. Importantly it also results in code that is much easier to read and understand.
The entire tidyverse suite can be loaded via the tidyverse
package:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Let’s have a quick look at this by playing with our sampleinfo
table.
# Read the sample information into a data frame
sampleinfo <- read_tsv("data/samplesheet.tsv")
sampleinfo
## # A tibble: 12 x 4
## SampleName Replicate Status TimePoint
## <chr> <dbl> <chr> <chr>
## 1 SRR7657878 1 Infected d11
## 2 SRR7657881 2 Infected d11
## 3 SRR7657880 3 Infected d11
## 4 SRR7657874 1 Infected d33
## 5 SRR7657882 2 Infected d33
## 6 SRR7657872 3 Infected d33
## 7 SRR7657877 1 Uninfected d11
## 8 SRR7657876 2 Uninfected d11
## 9 SRR7657879 3 Uninfected d11
## 10 SRR7657883 1 Uninfected d33
## 11 SRR7657873 2 Uninfected d33
## 12 SRR7657875 3 Uninfected d33
Suppose we wanted a new sample table that:
With base R we would do something like this
newTable <- sampleinfo
newTable <- newTable[newTable$TimePoint=="d11",]
newTable <- newTable[, c("SampleName", "Status")]
colnames(newTable)[2] <- "SampleGroup"
newTable
## # A tibble: 6 x 2
## SampleName SampleGroup
## <chr> <chr>
## 1 SRR7657878 Infected
## 2 SRR7657881 Infected
## 3 SRR7657880 Infected
## 4 SRR7657877 Uninfected
## 5 SRR7657876 Uninfected
## 6 SRR7657879 Uninfected
dplyr
With dplyr
we can use three new functions: filter
, select
and rename
:
newTable <- sampleinfo
newTable <- filter(newTable, TimePoint=="d11")
newTable <- select(newTable, SampleName, Status)
newTable <- rename(newTable, SampleGroup=Status)
newTable
## # A tibble: 6 x 2
## SampleName SampleGroup
## <chr> <chr>
## 1 SRR7657878 Infected
## 2 SRR7657881 Infected
## 3 SRR7657880 Infected
## 4 SRR7657877 Uninfected
## 5 SRR7657876 Uninfected
## 6 SRR7657879 Uninfected
The idea is that the dplyr
code is easier to read and interpret than the base R syntax.
There’s no need to quote the column names as dplyr intelligently interprets the arguments it’s passed as belonging to the data table columns.
Rather than repeatedly reassigning newTable <- f(newTable)
as above, we can use the “pipe” - %>%
. This takes the output of one function and “pipes” it into the first argument of the next function so that we don’t have to keep specifying the object we are working with:
newTable <- sampleinfo %>%
filter(TimePoint=="d11") %>%
select(SampleName, Status) %>%
rename(SampleGroup=Status)
newTable
## # A tibble: 6 x 2
## SampleName SampleGroup
## <chr> <chr>
## 1 SRR7657878 Infected
## 2 SRR7657881 Infected
## 3 SRR7657880 Infected
## 4 SRR7657877 Uninfected
## 5 SRR7657876 Uninfected
## 6 SRR7657879 Uninfected
This is a fairly trivial example and the benefits may not be immediately obvious, but once you get used to using dplyr
(and the other related “tidyverse” packages, such as stringr
) you’ll find it much more powerful and easy to use than base R.