Recap

Tidy data

A tidy dataset is a data frame (or table) for which the following are true:

  1. Each variable has its own column
  2. Each observation has its own row
  3. Each value has its own cell

dplyr basics

There are 5 key dplyr functions with which the vast majority of data manipulations can be achieved.

  • select – select variables (columns) of interest
  • filter – pick observations (rows) based on their values
  • arrange – reorder the observations based on one or more variables
  • mutate – modify existing or create new variables by applying functions to existing variables
  • summarize – collapse many values to a single summary

We’ve covered the first four and will focus now on summarizing.

Summarizing

library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## Registered S3 method overwritten by 'rvest':
##   method            from
##   read_xml.response xml2
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
patients <- read_tsv("data/patient-data-cleaned.txt")
## Parsed with column specification:
## cols(
##   ID = col_character(),
##   Name = col_character(),
##   Sex = col_character(),
##   Smokes = col_character(),
##   Height = col_double(),
##   Weight = col_double(),
##   Birth = col_date(format = ""),
##   State = col_character(),
##   Grade = col_double(),
##   Died = col_logical(),
##   Count = col_double(),
##   Date.Entered.Study = col_date(format = ""),
##   Age = col_double(),
##   BMI = col_double(),
##   Overweight = col_logical()
## )
patients
## # A tibble: 100 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  2 AC/A… Derek Male  Non-S…   179.   80.4 1972-06-15 Colo…     2 FALSE
##  3 AC/A… Todd  Male  Non-S…   169.   75.5 1972-07-09 New …     2 FALSE
##  4 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo…     1 FALSE
##  5 AC/A… Chri… Fema… Non-S…   164.   71.8 1973-06-12 Geor…     2 TRUE 
##  6 AC/A… Dana  Fema… Smoker   158.   69.9 1973-07-01 Indi…     2 FALSE
##  7 AC/A… Erin  Fema… Non-S…   162.   68.8 1972-03-26 New …     1 FALSE
##  8 AC/A… Rach… Fema… Non-S…   166.   70.4 1973-05-11 Colo…     1 FALSE
##  9 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
## 10 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

How do you compute the mean height of patients in the dataset?

mean(patients$Height)
## [1] 167.9267

This has collapsed a vector of height values to a summary, i.e. the average value. The dplyr function or ‘verb’ for doing this is summarize (or summarise).

summarize(patients, mean(Height))
## # A tibble: 1 x 1
##   `mean(Height)`
##            <dbl>
## 1           168.

This is not particularly useful on it’s own but you can compute multiple summarized values at the same time.

summarize(patients, mean(Height), sd(Height))
## # A tibble: 1 x 2
##   `mean(Height)` `sd(Height)`
##            <dbl>        <dbl>
## 1           168.         8.04

We can rename the column headings.

summarize(patients, mean = mean(Height), standard_deviation = sd(Height))
## # A tibble: 1 x 2
##    mean standard_deviation
##   <dbl>              <dbl>
## 1  168.               8.04

We might also want to compute summaries on multiple columns. Here we can use the summarize_at or summarize_if functions that work in a very similar way to mutate_at and mutate_if that we’ve come across earlier

summarize_at(patients, vars(Height, Weight), mean)
## # A tibble: 1 x 2
##   Height Weight
##    <dbl>  <dbl>
## 1   168.   74.9

We can also use the same helper functions we’ve used before with select.

summarise_at(patients, vars(ends_with('eight')), mean)
## # A tibble: 1 x 3
##   Height Weight Overweight
##    <dbl>  <dbl>      <dbl>
## 1   168.   74.9       0.77

Something interesting and quite useful just happened with the Overweight variable. It is a logical yet still a mean value was computed. What does the resulting value represent? (answer: the proportion of patients that are overweight).

If we want to compute the mean of all columns containing numeric values we could use summarize_if.

summarize_if(patients, is.numeric, mean)
## # A tibble: 1 x 6
##   Height Weight Grade  Count   Age   BMI
##    <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1   168.   74.9    NA -0.107  43.1  26.5

Why does the Grade column give a result of NA?

What do we need to do to fix this?

summarize_if(patients, is.numeric, mean, na.rm = TRUE)
## # A tibble: 1 x 6
##   Height Weight Grade  Count   Age   BMI
##    <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1   168.   74.9  2.05 -0.107  43.1  26.5

We can compute multiple summarized values on multiple columns.

summarize_at(patients, vars(Height, Weight), list(mean, sd))
## # A tibble: 1 x 4
##   Height_fn1 Weight_fn1 Height_fn2 Weight_fn2
##        <dbl>      <dbl>      <dbl>      <dbl>
## 1       168.       74.9       8.04       8.56

And again but with more ascetically pleasing column headings.

summarize_at(patients, vars(Height, Weight), list(mean = mean, sd = sd))
## # A tibble: 1 x 4
##   Height_mean Weight_mean Height_sd Weight_sd
##         <dbl>       <dbl>     <dbl>     <dbl>
## 1        168.        74.9      8.04      8.56

Grouping

summarize only really becomes useful when working with subsets of the data, or groups. Here we use the group_by function and it really helps if we do so in the context of a workflow using the pipe mechanism (%>%).

For example, let’s compute the average height of males and females separately.

patients %>%
  summarize(mean(Height))
## # A tibble: 1 x 1
##   `mean(Height)`
##            <dbl>
## 1           168.
patients %>%
  group_by(Sex) %>%
  summarize(mean(Height))
## # A tibble: 2 x 2
##   Sex    `mean(Height)`
##   <chr>           <dbl>
## 1 Female           162.
## 2 Male             175.
patients %>%
  group_by(Sex) %>%
  summarize(`Average height` = mean(Height))
## # A tibble: 2 x 2
##   Sex    `Average height`
##   <chr>             <dbl>
## 1 Female             162.
## 2 Male               175.
patients %>%
  group_by(Sex) %>%
  summarize_at(vars(Height, Weight), mean)
## # A tibble: 2 x 3
##   Sex    Height Weight
##   <chr>   <dbl>  <dbl>
## 1 Female   162.   68.9
## 2 Male     175.   82.2
patients %>%
  group_by(Sex) %>%
  summarize_at(vars(Height, Weight), list(mean = mean, stddev = sd))
## # A tibble: 2 x 5
##   Sex    Height_mean Weight_mean Height_stddev Weight_stddev
##   <chr>        <dbl>       <dbl>         <dbl>         <dbl>
## 1 Female        162.        68.9          2.84          2.78
## 2 Male          175.        82.2          5.88          7.47

Grouping by multiple variables

patients %>%
  group_by(Sex, State) %>%
  summarize(
    Mean_height = mean(Height),
    Stddev_height = sd(Height)
  )
## # A tibble: 12 x 4
## # Groups:   Sex [2]
##    Sex    State      Mean_height Stddev_height
##    <chr>  <chr>            <dbl>         <dbl>
##  1 Female California        162.          4.41
##  2 Female Colorado          162.          3.28
##  3 Female Georgia           161.          2.61
##  4 Female Indiana           162.          1.53
##  5 Female New Jersey        162.          3.67
##  6 Female New York          161.          2.01
##  7 Male   California        180.          5.19
##  8 Male   Colorado          174.          5.34
##  9 Male   Georgia           176.          6.42
## 10 Male   Indiana           175.          4.24
## 11 Male   New Jersey        173.          5.49
## 12 Male   New York          177.          6.93

Change the order of the grouping variables if it makes more sense to group first by State and then by Sex.

patients %>%
  group_by(State, Sex) %>%
  summarize(
    Mean_height = mean(Height),
    Stddev_height = sd(Height)
  )
## # A tibble: 12 x 4
## # Groups:   State [6]
##    State      Sex    Mean_height Stddev_height
##    <chr>      <chr>        <dbl>         <dbl>
##  1 California Female        162.          4.41
##  2 California Male          180.          5.19
##  3 Colorado   Female        162.          3.28
##  4 Colorado   Male          174.          5.34
##  5 Georgia    Female        161.          2.61
##  6 Georgia    Male          176.          6.42
##  7 Indiana    Female        162.          1.53
##  8 Indiana    Male          175.          4.24
##  9 New Jersey Female        162.          3.67
## 10 New Jersey Male          173.          5.49
## 11 New York   Female        161.          2.01
## 12 New York   Male          177.          6.93

It would also be helpful to know how many observations of each group there are.

patients %>%
  group_by(State, Sex) %>%
  summarize(
    Mean_height = mean(Height),
    Stddev_height = sd(Height),
    N = n()
  )
## # A tibble: 12 x 5
## # Groups:   State [6]
##    State      Sex    Mean_height Stddev_height     N
##    <chr>      <chr>        <dbl>         <dbl> <int>
##  1 California Female        162.          4.41     7
##  2 California Male          180.          5.19     3
##  3 Colorado   Female        162.          3.28     7
##  4 Colorado   Male          174.          5.34     8
##  5 Georgia    Female        161.          2.61     9
##  6 Georgia    Male          176.          6.42     7
##  7 Indiana    Female        162.          1.53    12
##  8 Indiana    Male          175.          4.24     6
##  9 New Jersey Female        162.          3.67    10
## 10 New Jersey Male          173.          5.49    10
## 11 New York   Female        161.          2.01    10
## 12 New York   Male          177.          6.93    11

And finally we don’t really need all those decimal places.

patients %>%
  group_by(State, Sex) %>%
  summarize(
    Mean_height = mean(Height),
    Stddev_height = sd(Height),
    N = n()
  ) %>%
  mutate_at(vars(Mean_height, Stddev_height), round, digits = 1)
## # A tibble: 12 x 5
## # Groups:   State [6]
##    State      Sex    Mean_height Stddev_height     N
##    <chr>      <chr>        <dbl>         <dbl> <int>
##  1 California Female        162.           4.4     7
##  2 California Male          180.           5.2     3
##  3 Colorado   Female        162.           3.3     7
##  4 Colorado   Male          174.           5.3     8
##  5 Georgia    Female        161            2.6     9
##  6 Georgia    Male          176.           6.4     7
##  7 Indiana    Female        162.           1.5    12
##  8 Indiana    Male          175.           4.2     6
##  9 New Jersey Female        162.           3.7    10
## 10 New Jersey Male          173.           5.5    10
## 11 New York   Female        161.           2      10
## 12 New York   Male          177.           6.9    11

Exercise

Compute the median weight and BMI and the interquartile ranges for smokers and non-smokers.

patients %>%
  group_by(Smokes) %>%
  summarize_at(vars(Weight, BMI), list(median = median, IQR = IQR))
## # A tibble: 2 x 5
##   Smokes     Weight_median BMI_median Weight_IQR BMI_IQR
##   <chr>              <dbl>      <dbl>      <dbl>   <dbl>
## 1 Non-Smoker          71.8       26.6      13.1     2.54
## 2 Smoker              72.9       25.6       8.89    3.17

Compute the numbers of patients with a given Grade separately for each State.

patients %>%
  group_by(State, Grade) %>%
  summarize(N = n())
## # A tibble: 21 x 3
## # Groups:   State [6]
##    State      Grade     N
##    <chr>      <dbl> <int>
##  1 California     1     1
##  2 California     2     3
##  3 California     3     6
##  4 Colorado       1     5
##  5 Colorado       2     5
##  6 Colorado       3     5
##  7 Georgia        1     5
##  8 Georgia        2     7
##  9 Georgia        3     3
## 10 Georgia       NA     1
## # … with 11 more rows

There is a more concise way of doing this using the count function.

count(patients, State, Grade)
## # A tibble: 21 x 3
##    State      Grade     n
##    <chr>      <dbl> <int>
##  1 California     1     1
##  2 California     2     3
##  3 California     3     6
##  4 Colorado       1     5
##  5 Colorado       2     5
##  6 Colorado       3     5
##  7 Georgia        1     5
##  8 Georgia        2     7
##  9 Georgia        3     3
## 10 Georgia       NA     1
## # … with 11 more rows

Ungrouping

It is sometimes necessary to use ungroup when summarizing based on grouping by multiple variables particularly when the results are going to be further manipulated.

summary <- patients %>%
  group_by(Sex, State) %>%
  summarize(Height = mean(Height), Weight = mean(Weight))
summary
## # A tibble: 12 x 4
## # Groups:   Sex [2]
##    Sex    State      Height Weight
##    <chr>  <chr>       <dbl>  <dbl>
##  1 Female California   162.   69.7
##  2 Female Colorado     162.   70.0
##  3 Female Georgia      161.   68.7
##  4 Female Indiana      162.   69.4
##  5 Female New Jersey   162.   68.1
##  6 Female New York     161.   68.0
##  7 Male   California   180.   82.1
##  8 Male   Colorado     174.   87.1
##  9 Male   Georgia      176.   78.2
## 10 Male   Indiana      175.   84.0
## 11 Male   New Jersey   173.   80.6
## 12 Male   New York     177.   81.7
summary %>%
  summarize_if(is.numeric, median)
## # A tibble: 2 x 3
##   Sex    Height Weight
##   <chr>   <dbl>  <dbl>
## 1 Female   162.   69.0
## 2 Male     175.   81.9

What happened here?

It seems that the summary object is still grouped in some way.

groups(summary)
## [[1]]
## Sex
class(summary)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Let’s repeat the original operation but this time explicitly remove the grouping after sumarizing.

summary <- patients %>%
  group_by(Sex, State) %>%
  summarize(Height = mean(Height), Weight = mean(Weight)) %>%
  ungroup()
summary
## # A tibble: 12 x 4
##    Sex    State      Height Weight
##    <chr>  <chr>       <dbl>  <dbl>
##  1 Female California   162.   69.7
##  2 Female Colorado     162.   70.0
##  3 Female Georgia      161.   68.7
##  4 Female Indiana      162.   69.4
##  5 Female New Jersey   162.   68.1
##  6 Female New York     161.   68.0
##  7 Male   California   180.   82.1
##  8 Male   Colorado     174.   87.1
##  9 Male   Georgia      176.   78.2
## 10 Male   Indiana      175.   84.0
## 11 Male   New Jersey   173.   80.6
## 12 Male   New York     177.   81.7
groups(summary)
## NULL
class(summary)
## [1] "tbl_df"     "tbl"        "data.frame"
summary %>%
  summarize_if(is.numeric, median)
## # A tibble: 1 x 2
##   Height Weight
##    <dbl>  <dbl>
## 1   168.   74.1

Why do group_by and summarize behave like this, i.e. retaining all but the last grouping variable? There are some scenarios where this can helpful.

diabetes <- read_tsv("data/diabetes.txt")
## Parsed with column specification:
## cols(
##   ID = col_character(),
##   Date = col_date(format = ""),
##   Glucose = col_double(),
##   BP = col_double()
## )
diabetes
## # A tibble: 1,316 x 4
##    ID        Date       Glucose    BP
##    <chr>     <date>       <dbl> <dbl>
##  1 AC/AH/001 2011-03-07     100    98
##  2 AC/AH/001 2011-03-14     110    89
##  3 AC/AH/001 2011-03-24      94    88
##  4 AC/AH/001 2011-03-31     111    92
##  5 AC/AH/001 2011-04-03      94    83
##  6 AC/AH/001 2011-05-21     110    93
##  7 AC/AH/001 2011-06-24     105    79
##  8 AC/AH/001 2011-07-11      88    86
##  9 AC/AH/001 2011-07-11     101    92
## 10 AC/AH/001 2011-07-13     112    88
## # … with 1,306 more rows
diabetes <- diabetes %>%
  separate(Date, into = c("year", "month", "day"), sep = "-")
diabetes
## # A tibble: 1,316 x 6
##    ID        year  month day   Glucose    BP
##    <chr>     <chr> <chr> <chr>   <dbl> <dbl>
##  1 AC/AH/001 2011  03    07        100    98
##  2 AC/AH/001 2011  03    14        110    89
##  3 AC/AH/001 2011  03    24         94    88
##  4 AC/AH/001 2011  03    31        111    92
##  5 AC/AH/001 2011  04    03         94    83
##  6 AC/AH/001 2011  05    21        110    93
##  7 AC/AH/001 2011  06    24        105    79
##  8 AC/AH/001 2011  07    11         88    86
##  9 AC/AH/001 2011  07    11        101    92
## 10 AC/AH/001 2011  07    13        112    88
## # … with 1,306 more rows
daily <- group_by(diabetes, year, month, day)
groups(daily)
## [[1]]
## year
## 
## [[2]]
## month
## 
## [[3]]
## day
per_day <- summarize(daily, measurements = n())
groups(per_day)
## [[1]]
## year
## 
## [[2]]
## month
per_day
## # A tibble: 200 x 4
## # Groups:   year, month [8]
##    year  month day   measurements
##    <chr> <chr> <chr>        <int>
##  1 2011  02    16               5
##  2 2011  02    17              10
##  3 2011  02    18               7
##  4 2011  02    19               1
##  5 2011  02    20               4
##  6 2011  02    21               2
##  7 2011  02    22               6
##  8 2011  02    23               4
##  9 2011  02    24               4
## 10 2011  02    25               8
## # … with 190 more rows
per_month <- summarize(per_day, measurements = sum(measurements))
groups(per_month)
## [[1]]
## year
per_month
## # A tibble: 8 x 3
## # Groups:   year [1]
##   year  month measurements
##   <chr> <chr>        <int>
## 1 2011  02              71
## 2 2011  03             211
## 3 2011  04             205
## 4 2011  05             195
## 5 2011  06             208
## 6 2011  07             204
## 7 2011  08             206
## 8 2011  09              16
per_year <- summarize(per_month, measurements = sum(measurements))
groups(per_year)
## NULL
per_year
## # A tibble: 1 x 2
##   year  measurements
##   <chr>        <int>
## 1 2011          1316

Each time we peel away one layer of grouping.

Tip: always use ungroup after grouping on multiple variables to avoid potential problems further down the line.

Grouping with mutate

group_by is most commonly used with summarize but other functions such as mutate can be applied separately to subsets of the data in the same way.

As an example, we assign the rank of the individuals by height (tallest = 1, second tallest = 2, etc.) within each state. We need to group by State then sort into descending height order and then assign the rank. For the last part we use another helpful function, row_number.

ranked_by_height <- patients %>%
  group_by(State) %>%
  arrange(desc(Height)) %>%
  mutate(rank = row_number())
ranked_by_height
## # A tibble: 100 x 16
## # Groups:   State [6]
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Char… Male  Non-S…   185.   87.2 1973-07-19 New …     1 FALSE
##  2 AC/S… Doug… Male  Smoker   185.   73.6 1973-06-11 New …     3 FALSE
##  3 AC/S… Joel  Male  Non-S…   184.   90.4 1972-06-11 Cali…     3 TRUE 
##  4 AC/A… Stev… Male  Smoker   183.   83.4 1973-10-05 New …     2 FALSE
##  5 AC/S… Russ… Male  Non-S…   183.   82.5 1972-10-30 New …     3 TRUE 
##  6 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  7 AC/A… Jeff… Male  Non-S…   182.   96.9 1972-12-10 Colo…     2 FALSE
##  8 AC/A… Bran… Male  Smoker   182.   79.5 1972-05-09 Geor…     1 TRUE 
##  9 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
## 10 AC/A… Stev… Male  Non-S…   181.   83.9 1973-10-19 New …     3 TRUE 
## # … with 90 more rows, and 6 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>,
## #   rank <int>

Note that because we haven’t carried out a summarize operation the grouping is still intact.

class(ranked_by_height)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
groups(ranked_by_height)
## [[1]]
## State

We should probably remove that if we plan to do some further operations on this data frame.

ranked_by_height <- ungroup(ranked_by_height)
class(ranked_by_height)
## [1] "tbl_df"     "tbl"        "data.frame"
groups(ranked_by_height)
## NULL