Introduction

In this session we will introduce the concept of ‘piping’ to help with creating workflows from chains of manipulations on our data. We’ll also look at a couple of other useful dplyr verbs.

  • Introducing piping
  • filter verb
  • arrange verb

Load the tidyverse

If you haven’t already done so, or are working in a new session, you’ll need to load the core packages from the tidyverse.

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0       ✔ purrr   0.3.1  
## ✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Piping

In the previous exercise we ended up making a series of manipulations to the patients dataset.

patients <- read_tsv("patient-data.txt")
patients <- mutate(patients, Smokes = Smokes %in% c("TRUE", "Yes"))
patients <- mutate(patients, Height = as.numeric(str_remove(Height, pattern = "cm$")))
patients <- mutate(patients, Weight = as.numeric(str_remove(Weight, pattern = "kg$")))
patients <- mutate(patients, BMI = Weight / (Height / 100) ** 2)
patients <- mutate(patients, Overweight = BMI > 25)

Each statement includes an assignment to overwrite the data frame on which we are operating. Surely this could be written in a more succinct and elegant manner.

The tidyverse imports a very useful operator, %>% from the magrittr package. This is the ‘pipe’ operator and works a bit like the Unix pipe operator allowing the output from one operation to be “piped” in as the input to another operation.

Let’s look at one of those cleaning operations on the patients dataset to see how piping works.

patients <- read_tsv("patient-data.txt")
## Parsed with column specification:
## cols(
##   ID = col_character(),
##   Name = col_character(),
##   Sex = col_character(),
##   Smokes = col_character(),
##   Height = col_character(),
##   Weight = col_character(),
##   Birth = col_date(format = ""),
##   State = col_character(),
##   Grade_Level = col_double(),
##   Died = col_logical(),
##   Count = col_double(),
##   Date.Entered.Study = col_date(format = "")
## )
mutate(patients, Height = as.numeric(str_remove(Height, pattern = "cm$")))
## # A tibble: 100 x 12
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade_Level
##    <chr> <chr> <chr> <chr>   <dbl> <chr>  <date>     <chr>       <dbl>
##  1 AC/A… Mich… Male  FALSE    183. 76.57… 1972-02-06 Geor…           2
##  2 AC/A… Derek Male  FALSE    179. 80.43… 1972-06-15 Colo…           2
##  3 AC/A… Todd  Male  FALSE    169. 75.48… 1972-07-09 New …           2
##  4 AC/A… Rona… Male  FALSE    176. 94.54… 1972-08-17 Colo…           1
##  5 AC/A… Chri… Fema… FALSE    164. 71.78… 1973-06-12 Geor…           2
##  6 AC/A… Dana  Fema… TRUE     158. 69.9kg 1973-07-01 Indi…           2
##  7 AC/A… Erin  Fema… FALSE    162. 68.85… 1972-03-26 New …           1
##  8 AC/A… Rach… Fema… No       166. 70.44… 1973-05-11 Colo…           1
##  9 AC/A… Rona… Male  FALSE    181. 76.9kg 1971-12-31 Geor…           1
## 10 AC/A… Bryan Male  FALSE    167. 79.06… 1973-07-19 New …           2
## # … with 90 more rows, and 3 more variables: Died <lgl>, Count <dbl>,
## #   Date.Entered.Study <date>

Instead of passing the patients data frame into the mutate function as it’s first argument we could use the %>% operator as follows.

patients %>% mutate(Height = as.numeric(str_remove(Height, pattern = "cm$")))
## # A tibble: 100 x 12
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade_Level
##    <chr> <chr> <chr> <chr>   <dbl> <chr>  <date>     <chr>       <dbl>
##  1 AC/A… Mich… Male  FALSE    183. 76.57… 1972-02-06 Geor…           2
##  2 AC/A… Derek Male  FALSE    179. 80.43… 1972-06-15 Colo…           2
##  3 AC/A… Todd  Male  FALSE    169. 75.48… 1972-07-09 New …           2
##  4 AC/A… Rona… Male  FALSE    176. 94.54… 1972-08-17 Colo…           1
##  5 AC/A… Chri… Fema… FALSE    164. 71.78… 1973-06-12 Geor…           2
##  6 AC/A… Dana  Fema… TRUE     158. 69.9kg 1973-07-01 Indi…           2
##  7 AC/A… Erin  Fema… FALSE    162. 68.85… 1972-03-26 New …           1
##  8 AC/A… Rach… Fema… No       166. 70.44… 1973-05-11 Colo…           1
##  9 AC/A… Rona… Male  FALSE    181. 76.9kg 1971-12-31 Geor…           1
## 10 AC/A… Bryan Male  FALSE    167. 79.06… 1973-07-19 New …           2
## # … with 90 more rows, and 3 more variables: Died <lgl>, Count <dbl>,
## #   Date.Entered.Study <date>

The basic form of a piped operation is:

x %>% f(y) is equivalent to f(x, y)

Piping becomes really useful when there are a number of steps involved in transforming a dataset.

patients <- read.delim("patient-data.txt") %>%
  as_tibble %>%
  mutate(Sex = as_factor(str_trim(Sex))) %>%
  mutate(Height = as.numeric(str_remove(Height, pattern = "cm$")))

The usual way of developing a workflow is to build it up one step at a time, testing the output produced at each stage.

Exercise: re-writing a workflow using pipes

See separate markdown document.

Filtering rows

The filter verb allows you to choose rows from a data frame that match some specified criteria. The criteria are based on values of variables and can make use of comparison operators such as ==, >, < and !=.

For example to filter the patients dataset so it only contains males.

patients <- read_tsv("patient-data-cleaned.txt")
## Parsed with column specification:
## cols(
##   ID = col_character(),
##   Name = col_character(),
##   Sex = col_character(),
##   Smokes = col_character(),
##   Height = col_double(),
##   Weight = col_double(),
##   Birth = col_date(format = ""),
##   State = col_character(),
##   Grade = col_double(),
##   Died = col_logical(),
##   Count = col_double(),
##   Date.Entered.Study = col_date(format = ""),
##   Age = col_double(),
##   BMI = col_double(),
##   Overweight = col_logical()
## )
filter(patients, Sex == "Male")
## # A tibble: 45 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  2 AC/A… Derek Male  Non-S…   179.   80.4 1972-06-15 Colo…     2 FALSE
##  3 AC/A… Todd  Male  Non-S…   169.   75.5 1972-07-09 New …     2 FALSE
##  4 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo…     1 FALSE
##  5 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
##  6 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
##  7 AC/A… Kenn… Male  Smoker   175.   92.2 1972-03-22 Colo…     3 FALSE
##  8 AC/A… Marc  Male  Non-S…   181.   72.3 1972-11-22 New …    NA TRUE 
##  9 AC/A… Bran… Male  Non-S…   169.   73.3 1971-11-22 New …     3 FALSE
## 10 AC/A… Trav… Male  Non-S…   176.   97.7 1973-04-14 New …     2 FALSE
## # … with 35 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

The equivalent in base R is much less intuitive.

patients[patients$Sex == "Male",]
## # A tibble: 45 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  2 AC/A… Derek Male  Non-S…   179.   80.4 1972-06-15 Colo…     2 FALSE
##  3 AC/A… Todd  Male  Non-S…   169.   75.5 1972-07-09 New …     2 FALSE
##  4 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo…     1 FALSE
##  5 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
##  6 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
##  7 AC/A… Kenn… Male  Smoker   175.   92.2 1972-03-22 Colo…     3 FALSE
##  8 AC/A… Marc  Male  Non-S…   181.   72.3 1972-11-22 New …    NA TRUE 
##  9 AC/A… Bran… Male  Non-S…   169.   73.3 1971-11-22 New …     3 FALSE
## 10 AC/A… Trav… Male  Non-S…   176.   97.7 1973-04-14 New …     2 FALSE
## # … with 35 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

We can also use the != operator.

filter(patients, Sex != "Female")
## # A tibble: 45 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  2 AC/A… Derek Male  Non-S…   179.   80.4 1972-06-15 Colo…     2 FALSE
##  3 AC/A… Todd  Male  Non-S…   169.   75.5 1972-07-09 New …     2 FALSE
##  4 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo…     1 FALSE
##  5 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
##  6 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
##  7 AC/A… Kenn… Male  Smoker   175.   92.2 1972-03-22 Colo…     3 FALSE
##  8 AC/A… Marc  Male  Non-S…   181.   72.3 1972-11-22 New …    NA TRUE 
##  9 AC/A… Bran… Male  Non-S…   169.   73.3 1971-11-22 New …     3 FALSE
## 10 AC/A… Trav… Male  Non-S…   176.   97.7 1973-04-14 New …     2 FALSE
## # … with 35 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

We can filter for a set of values using the %in% operator.

filter(patients, State %in% c("Florida", "Georgia", "Illinois"))
## # A tibble: 16 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  2 AC/A… Chri… Fema… Non-S…   164.   71.8 1973-06-12 Geor…     2 TRUE 
##  3 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
##  4 AC/A… Kimb… Fema… Non-S…   160.   72.4 1972-05-04 Geor…     2 TRUE 
##  5 AC/A… Bran… Male  Smoker   182.   79.5 1972-05-09 Geor…     1 TRUE 
##  6 AC/A… Aaron Male  Non-S…   166.   76.7 1972-07-15 Geor…     1 TRUE 
##  7 AC/A… Mich… Fema… Non-S…   158.   67.4 1972-05-12 Geor…     3 TRUE 
##  8 AC/A… Kris… Fema… Non-S…   163.   65.2 1973-05-23 Geor…     2 FALSE
##  9 AC/A… Linda Fema… Non-S…   165.   70.8 1972-02-07 Geor…    NA TRUE 
## 10 AC/S… Sher… Fema… Non-S…   159.   64.9 1973-02-04 Geor…     2 TRUE 
## 11 AC/S… Jimmy Male  Non-S…   171.   81.7 1973-08-11 Geor…     2 TRUE 
## 12 AC/S… Sarah Fema… Non-S…   160.   68.2 1972-04-21 Geor…     3 TRUE 
## 13 AC/S… Rich… Male  Non-S…   173.   67.6 1972-02-10 Geor…     1 FALSE
## 14 AC/S… Suza… Fema… Non-S…   159.   70.4 1973-10-06 Geor…     2 TRUE 
## 15 AC/S… Phil… Male  Non-S…   177.   88.7 1971-11-30 Geor…     3 TRUE 
## 16 AC/S… Wendy Fema… Non-S…   159.   67.0 1972-06-25 Geor…     1 TRUE 
## # … with 5 more variables: Count <dbl>, Date.Entered.Study <date>,
## #   Age <dbl>, BMI <dbl>, Overweight <lgl>

Partial matches can be made using the str_detect function from the stringr package. Note this is similar to the grepl function in base R.

For example, let’s select all the patients whose name begins with a ‘B’.

filter(patients, str_detect(Name, "^B"))
## # A tibble: 13 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
##  2 AC/A… Bran… Male  Non-S…   169.   73.3 1971-11-22 New …     3 FALSE
##  3 AC/A… Bran… Male  Smoker   182.   79.5 1972-05-09 Geor…     1 TRUE 
##  4 AC/A… Brett Male  Smoker   177.   74.0 1972-02-24 Indi…     2 FALSE
##  5 AC/A… Brent Male  Non-S…   175.   83.6 1973-05-07 New …     3 TRUE 
##  6 AC/A… Bran… Fema… Non-S…   160.   68.0 1972-12-08 New …     2 FALSE
##  7 AC/A… Barb… Fema… Non-S…   160.   65.9 1973-01-31 New …    NA TRUE 
##  8 AC/A… Bran… Male  Non-S…   179.   97.0 1973-01-18 Indi…     1 TRUE 
##  9 AC/S… Bran… Fema… Non-S…   157.   66.6 1972-04-11 Colo…     3 TRUE 
## 10 AC/S… Barb… Fema… Smoker   162.   66.5 1972-02-21 New …     3 FALSE
## 11 AC/S… Benj… Male  Non-S…   177.   90.8 1973-02-03 Indi…     3 FALSE
## 12 AC/S… Bren… Fema… Non-S…   158.   69.8 1972-05-21 Cali…     3 FALSE
## 13 AC/S… Barb… Fema… Smoker   163.   64.5 1973-06-25 Indi…     2 FALSE
## # … with 5 more variables: Count <dbl>, Date.Entered.Study <date>,
## #   Age <dbl>, BMI <dbl>, Overweight <lgl>

Note that the str_detect function returns a logical vector - this is important since the criterion for filtering must evaluate to TRUE or FALSE.

Also note that the second argument to str_detect is a regular expression. An alternative function from stringr we could have used in this case is str_starts; with this we no longer need to ‘^’ symbol in our regular expression.

filter(patients, str_starts(Name, "B"))
## # A tibble: 13 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
##  2 AC/A… Bran… Male  Non-S…   169.   73.3 1971-11-22 New …     3 FALSE
##  3 AC/A… Bran… Male  Smoker   182.   79.5 1972-05-09 Geor…     1 TRUE 
##  4 AC/A… Brett Male  Smoker   177.   74.0 1972-02-24 Indi…     2 FALSE
##  5 AC/A… Brent Male  Non-S…   175.   83.6 1973-05-07 New …     3 TRUE 
##  6 AC/A… Bran… Fema… Non-S…   160.   68.0 1972-12-08 New …     2 FALSE
##  7 AC/A… Barb… Fema… Non-S…   160.   65.9 1973-01-31 New …    NA TRUE 
##  8 AC/A… Bran… Male  Non-S…   179.   97.0 1973-01-18 Indi…     1 TRUE 
##  9 AC/S… Bran… Fema… Non-S…   157.   66.6 1972-04-11 Colo…     3 TRUE 
## 10 AC/S… Barb… Fema… Smoker   162.   66.5 1972-02-21 New …     3 FALSE
## 11 AC/S… Benj… Male  Non-S…   177.   90.8 1973-02-03 Indi…     3 FALSE
## 12 AC/S… Bren… Fema… Non-S…   158.   69.8 1972-05-21 Cali…     3 FALSE
## 13 AC/S… Barb… Fema… Smoker   163.   64.5 1973-06-25 Indi…     2 FALSE
## # … with 5 more variables: Count <dbl>, Date.Entered.Study <date>,
## #   Age <dbl>, BMI <dbl>, Overweight <lgl>

We can filter on logical variables straightforwardly.

filter(patients, !Died)
## # A tibble: 46 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  2 AC/A… Derek Male  Non-S…   179.   80.4 1972-06-15 Colo…     2 FALSE
##  3 AC/A… Todd  Male  Non-S…   169.   75.5 1972-07-09 New …     2 FALSE
##  4 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo…     1 FALSE
##  5 AC/A… Dana  Fema… Smoker   158.   69.9 1973-07-01 Indi…     2 FALSE
##  6 AC/A… Erin  Fema… Non-S…   162.   68.8 1972-03-26 New …     1 FALSE
##  7 AC/A… Rach… Fema… Non-S…   166.   70.4 1973-05-11 Colo…     1 FALSE
##  8 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
##  9 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
## 10 AC/A… Pame… Fema… Non-S…   166.   67.3 1971-11-14 New …     1 FALSE
## # … with 36 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

Also we can add extra conditions, separating them with a ,.

filter(patients, Sex == "Male", Died)
## # A tibble: 23 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Marc  Male  Non-S…   181.   72.3 1972-11-22 New …    NA TRUE 
##  2 AC/A… Jimmy Male  Non-S…   179.   75.5 1972-03-17 New …     2 TRUE 
##  3 AC/A… Thom… Male  Smoker   170.   90.6 1972-04-18 New …     1 TRUE 
##  4 AC/A… Geor… Male  Non-S…   168.   82.1 1972-11-04 New …     1 TRUE 
##  5 AC/A… Stev… Male  Non-S…   181.   83.9 1973-10-19 New …     3 TRUE 
##  6 AC/A… Bran… Male  Smoker   182.   79.5 1972-05-09 Geor…     1 TRUE 
##  7 AC/A… Aaron Male  Non-S…   166.   76.7 1972-07-15 Geor…     1 TRUE 
##  8 AC/A… Brent Male  Non-S…   175.   83.6 1973-05-07 New …     3 TRUE 
##  9 AC/A… Joel  Male  Non-S…   166.   76.8 1972-08-23 New …    NA TRUE 
## 10 AC/A… Aaron Male  Non-S…   181.   83.6 1972-03-30 Cali…     3 TRUE 
## # … with 13 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

In this example all males who have died are selected. The , is the equivalent of using the Boolean operator &.

filter(patients, Sex == "Male" & Died)
## # A tibble: 23 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Marc  Male  Non-S…   181.   72.3 1972-11-22 New …    NA TRUE 
##  2 AC/A… Jimmy Male  Non-S…   179.   75.5 1972-03-17 New …     2 TRUE 
##  3 AC/A… Thom… Male  Smoker   170.   90.6 1972-04-18 New …     1 TRUE 
##  4 AC/A… Geor… Male  Non-S…   168.   82.1 1972-11-04 New …     1 TRUE 
##  5 AC/A… Stev… Male  Non-S…   181.   83.9 1973-10-19 New …     3 TRUE 
##  6 AC/A… Bran… Male  Smoker   182.   79.5 1972-05-09 Geor…     1 TRUE 
##  7 AC/A… Aaron Male  Non-S…   166.   76.7 1972-07-15 Geor…     1 TRUE 
##  8 AC/A… Brent Male  Non-S…   175.   83.6 1973-05-07 New …     3 TRUE 
##  9 AC/A… Joel  Male  Non-S…   166.   76.8 1972-08-23 New …    NA TRUE 
## 10 AC/A… Aaron Male  Non-S…   181.   83.6 1972-03-30 Cali…     3 TRUE 
## # … with 13 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

The equivalent in regular R is more verbose and less easy to read.

patients[patients$Sex == "Male" & patients$Died,]
## # A tibble: 23 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Marc  Male  Non-S…   181.   72.3 1972-11-22 New …    NA TRUE 
##  2 AC/A… Jimmy Male  Non-S…   179.   75.5 1972-03-17 New …     2 TRUE 
##  3 AC/A… Thom… Male  Smoker   170.   90.6 1972-04-18 New …     1 TRUE 
##  4 AC/A… Geor… Male  Non-S…   168.   82.1 1972-11-04 New …     1 TRUE 
##  5 AC/A… Stev… Male  Non-S…   181.   83.9 1973-10-19 New …     3 TRUE 
##  6 AC/A… Bran… Male  Smoker   182.   79.5 1972-05-09 Geor…     1 TRUE 
##  7 AC/A… Aaron Male  Non-S…   166.   76.7 1972-07-15 Geor…     1 TRUE 
##  8 AC/A… Brent Male  Non-S…   175.   83.6 1973-05-07 New …     3 TRUE 
##  9 AC/A… Joel  Male  Non-S…   166.   76.8 1972-08-23 New …    NA TRUE 
## 10 AC/A… Aaron Male  Non-S…   181.   83.6 1972-03-30 Cali…     3 TRUE 
## # … with 13 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

We can use the | Boolean operator to select patients above a given weight or BMI.

filter(patients, Weight > 90 | BMI > 28)
## # A tibble: 22 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo…     1 FALSE
##  2 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
##  3 AC/A… Kimb… Fema… Non-S…   160.   72.4 1972-05-04 Geor…     2 TRUE 
##  4 AC/A… Kenn… Male  Smoker   175.   92.2 1972-03-22 Colo…     3 FALSE
##  5 AC/A… Trav… Male  Non-S…   176.   97.7 1973-04-14 New …     2 FALSE
##  6 AC/A… Thom… Male  Smoker   170.   90.6 1972-04-18 New …     1 TRUE 
##  7 AC/A… Geor… Male  Non-S…   168.   82.1 1972-11-04 New …     1 TRUE 
##  8 AC/A… Jeff… Male  Non-S…   182.   96.9 1972-12-10 Colo…     2 FALSE
##  9 AC/A… Kris… Fema… Non-S…   160.   71.9 1973-09-28 New …     2 TRUE 
## 10 AC/A… Sher… Fema… Non-S…   157.   69.6 1973-07-20 Cali…     2 TRUE 
## # … with 12 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

Mixing both Boolean operators and , is also possible.

filter(patients, Weight > 90 | BMI > 28, Sex == "Female")
## # A tibble: 4 x 15
##   ID    Name  Sex   Smokes Height Weight Birth      State Grade Died  Count
##   <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl> <dbl>
## 1 AC/A… Kimb… Fema… Non-S…   160.   72.4 1972-05-04 Geor…     2 TRUE  -2.41
## 2 AC/A… Kris… Fema… Non-S…   160.   71.9 1973-09-28 New …     2 TRUE   0.74
## 3 AC/A… Sher… Fema… Non-S…   157.   69.6 1973-07-20 Cali…     2 TRUE  -0.7 
## 4 AC/A… April Fema… Non-S…   161.   73.6 1972-02-14 Indi…     3 FALSE -0.82
## # … with 4 more variables: Date.Entered.Study <date>, Age <dbl>,
## #   BMI <dbl>, Overweight <lgl>

Exercise: filtering rows

See separate markdown document.

Sorting rows

Another dplyr verb that works on rows in a table is arrange. This is used to sort rows in a dataset based on one or more variables.

For example, let’s say we want to sort our patients by height.

arrange(patients, Height)
## # A tibble: 100 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/S… Bran… Fema… Non-S…   157.   66.6 1972-04-11 Colo…     3 TRUE 
##  2 AC/A… Sher… Fema… Non-S…   157.   69.6 1973-07-20 Cali…     2 TRUE 
##  3 AC/A… Mich… Fema… Non-S…   158.   67.4 1972-05-12 Geor…     3 TRUE 
##  4 AC/S… Bren… Fema… Non-S…   158.   69.8 1972-05-21 Cali…     3 FALSE
##  5 AC/A… Dana  Fema… Smoker   158.   69.9 1973-07-01 Indi…     2 FALSE
##  6 AC/S… Laura Fema… Non-S…   158.   69.7 1972-06-08 New …     3 TRUE 
##  7 AC/S… Mary  Fema… Non-S…   159.   65.1 1973-06-01 Colo…     1 FALSE
##  8 AC/S… Wendy Fema… Non-S…   159.   67.0 1972-06-25 Geor…     1 TRUE 
##  9 AC/S… Erica Fema… Non-S…   159.   70.0 1973-10-04 Cali…     2 TRUE 
## 10 AC/S… Sher… Fema… Non-S…   159.   64.9 1973-02-04 Geor…     2 TRUE 
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

This has arranged the rows in order of ascending height. What if we wanted descending order of height?

arrange(patients, desc(Height))
## # A tibble: 100 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Char… Male  Non-S…   185.   87.2 1973-07-19 New …     1 FALSE
##  2 AC/S… Doug… Male  Smoker   185.   73.6 1973-06-11 New …     3 FALSE
##  3 AC/S… Joel  Male  Non-S…   184.   90.4 1972-06-11 Cali…     3 TRUE 
##  4 AC/A… Stev… Male  Smoker   183.   83.4 1973-10-05 New …     2 FALSE
##  5 AC/S… Russ… Male  Non-S…   183.   82.5 1972-10-30 New …     3 TRUE 
##  6 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  7 AC/A… Jeff… Male  Non-S…   182.   96.9 1972-12-10 Colo…     2 FALSE
##  8 AC/A… Bran… Male  Smoker   182.   79.5 1972-05-09 Geor…     1 TRUE 
##  9 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
## 10 AC/A… Stev… Male  Non-S…   181.   83.9 1973-10-19 New …     3 TRUE 
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

We can use sort using multiple variables, e.g. first by Grade in descending order, then by Sex and then Smokes.

arrange(patients, desc(Grade), Sex, Smokes)
## # A tibble: 100 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Tere… Fema… Non-S…   163.   70.5 1973-04-22 Indi…     3 TRUE 
##  2 AC/A… Mich… Fema… Non-S…   158.   67.4 1972-05-12 Geor…     3 TRUE 
##  3 AC/A… Wendy Fema… Non-S…   164.   66.7 1971-12-29 New …     3 TRUE 
##  4 AC/A… April Fema… Non-S…   161.   73.6 1972-02-14 Indi…     3 FALSE
##  5 AC/S… Vale… Fema… Non-S…   162.   70.0 1972-04-09 Cali…     3 TRUE 
##  6 AC/S… Bran… Fema… Non-S…   157.   66.6 1972-04-11 Colo…     3 TRUE 
##  7 AC/S… Sarah Fema… Non-S…   160.   68.2 1972-04-21 Geor…     3 TRUE 
##  8 AC/S… Carr… Fema… Non-S…   164.   71.5 1973-03-18 New …     3 FALSE
##  9 AC/S… Bren… Fema… Non-S…   158.   69.8 1972-05-21 Cali…     3 FALSE
## 10 AC/S… Laura Fema… Non-S…   158.   69.7 1972-06-08 New …     3 TRUE 
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

Sorting is commonly used in workflows usually as one of the last steps before presentation or writing out the resulting table to a file.

The following concise and easy-to-read workflow includes steps that use all of the dplyr verbs we have covered so far.

candidates <- patients %>%
  filter(!Died) %>%
  select(ID, Name, Sex, Smokes, Height, Weight) %>%
  mutate(BMI = Weight / (Height / 100) ** 2) %>%
  mutate(Overweight = BMI > 25) %>%
  arrange(BMI)
candidates
## # A tibble: 46 x 8
##    ID        Name    Sex    Smokes     Height Weight   BMI Overweight
##    <chr>     <chr>   <chr>  <chr>       <dbl>  <dbl> <dbl> <lgl>     
##  1 AC/SG/193 Douglas Male   Smoker       185.   73.6  21.4 FALSE     
##  2 AC/SG/099 Richard Male   Non-Smoker   173.   67.6  22.7 FALSE     
##  3 AC/AH/001 Michael Male   Non-Smoker   183.   76.6  22.9 FALSE     
##  4 AC/AH/210 Alicia  Female Smoker       170.   66.7  23.1 FALSE     
##  5 AC/AH/086 Jeffrey Male   Smoker       180.   75.7  23.3 FALSE     
##  6 AC/AH/045 Ronald  Male   Non-Smoker   181.   76.9  23.4 FALSE     
##  7 AC/AH/164 Brett   Male   Smoker       177.   74.0  23.6 FALSE     
##  8 AC/AH/114 Mark    Male   Non-Smoker   178.   74.8  23.7 FALSE     
##  9 AC/AH/077 Andrew  Male   Non-Smoker   174.   72.2  23.8 FALSE     
## 10 AC/SG/173 Barbara Female Smoker       163.   64.5  24.2 FALSE     
## # … with 36 more rows