dplyr
and tidyverseOne of the most complex aspects of learning to work with data in R
is getting to grips with subsetting and manipulating data tables. The package dplyr
(Wickham et al. 2018) was developed to make this process more intuitive than it is using standard base R
processes. It also makes use of a new symbol %>%
, called the “pipe”, which makes the code a bit tidier.
dplyr
is one of suite of similar packages collectively known as the tidyverse.
This is a very brief introduction to the tidyverse
way of writing R code. A more detailed introduction can be found in our online R course
We are introducing this because it makes many of the processes we will look at later much simpler. Importantly it also results in code that is much easier to read and understand.
The entire tidyverse suite can be loaded via the tidyverse
package:
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.1 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 0.8.5
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Let’s have a quick look at this by playing with our sampleinfo
table.
# Read the sample information into a data frame
sampleinfo <- read_tsv("data/SampleInfo.txt")
sampleinfo
## # A tibble: 12 x 4
## FileName Sample CellType Status
## <chr> <chr> <chr> <chr>
## 1 MCL1.DG.bam MCL1.DG luminal virgin
## 2 MCL1.DH.bam MCL1.DH basal virgin
## 3 MCL1.DI.bam MCL1.DI basal pregnant
## 4 MCL1.DJ.bam MCL1.DJ basal pregnant
## 5 MCL1.DK.bam MCL1.DK basal lactate
## 6 MCL1.DL.bam MCL1.DL basal lactate
## 7 MCL1.LA.bam MCL1.LA basal virgin
## 8 MCL1.LB.bam MCL1.LB luminal virgin
## 9 MCL1.LC.bam MCL1.LC luminal pregnant
## 10 MCL1.LD.bam MCL1.LD luminal pregnant
## 11 MCL1.LE.bam MCL1.LE luminal lactate
## 12 MCL1.LF.bam MCL1.LF luminal lactate
Suppose we wanted a new sample table that:
With base R we would do something like this
newTable <- sampleinfo
newTable <- newTable[newTable$CellType=="basal",]
newTable <- newTable[, c("CellType", "Status")]
colnames(newTable)[1] <- "Cell"
newTable
## # A tibble: 6 x 2
## Cell Status
## <chr> <chr>
## 1 basal virgin
## 2 basal pregnant
## 3 basal pregnant
## 4 basal lactate
## 5 basal lactate
## 6 basal virgin
dplyr
With dplyr
we can use three new functions: filter
, select
and rename
:
newTable <- sampleinfo
newTable <- filter(newTable, CellType=="basal")
newTable <- select(newTable, CellType, Status)
newTable <- rename(newTable, Cell=CellType)
newTable
## # A tibble: 6 x 2
## Cell Status
## <chr> <chr>
## 1 basal virgin
## 2 basal pregnant
## 3 basal pregnant
## 4 basal lactate
## 5 basal lactate
## 6 basal virgin
The idea is that the dplyr
code is easier to read and interpret than the base R syntax.
There’s no need to quote the column names as dplyr intelligently interprets the arguments it’s passed as belonging to the data table columns.
Rather than repeatedly reassigning newTable <- f(newTable)
as above, we can use the “pipe” - %>%
. This takes the output of one function and “pipes” it into the first argument of the next function so that we don’t have to keep specifying the object we are working with:
newTable <- sampleinfo %>%
filter(CellType=="basal") %>%
select(CellType, Status) %>%
rename(Cell=CellType)
newTable
## # A tibble: 6 x 2
## Cell Status
## <chr> <chr>
## 1 basal virgin
## 2 basal pregnant
## 3 basal pregnant
## 4 basal lactate
## 5 basal lactate
## 6 basal virgin
This is a fairly trivial example and the benefits may not be immediately obvious, but once you get used to using dplyr
(and the other related “tidyverse” packages, such as stringr
) you’ll find it much more powerful and easy to use than base R.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.