A tidy dataset is a data frame (or table) for which the following are true:
There are 5 key dplyr functions with which the vast majority of data manipulations can be achieved.
select
– select variables (columns) of interestfilter
– pick observations (rows) based on their valuesarrange
– reorder the observations based on one or more variablesmutate
– modify existing or create new variables by applying functions to existing variablessummarize
– collapse many values to a single summaryWe’ve covered the first four and will focus now on summarizing.
library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## Registered S3 method overwritten by 'rvest':
## method from
## read_xml.response xml2
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
patients <- read_tsv("data/patient-data-cleaned.txt")
## Parsed with column specification:
## cols(
## ID = col_character(),
## Name = col_character(),
## Sex = col_character(),
## Smokes = col_character(),
## Height = col_double(),
## Weight = col_double(),
## Birth = col_date(format = ""),
## State = col_character(),
## Grade = col_double(),
## Died = col_logical(),
## Count = col_double(),
## Date.Entered.Study = col_date(format = ""),
## Age = col_double(),
## BMI = col_double(),
## Overweight = col_logical()
## )
patients
## # A tibble: 100 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Chri… Fema… Non-S… 164. 71.8 1973-06-12 Geor… 2 TRUE
## 6 AC/A… Dana Fema… Smoker 158. 69.9 1973-07-01 Indi… 2 FALSE
## 7 AC/A… Erin Fema… Non-S… 162. 68.8 1972-03-26 New … 1 FALSE
## 8 AC/A… Rach… Fema… Non-S… 166. 70.4 1973-05-11 Colo… 1 FALSE
## 9 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 10 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
How do you compute the mean height of patients in the dataset?
mean(patients$Height)
## [1] 167.9267
This has collapsed a vector of height values to a summary, i.e. the average value. The dplyr
function or ‘verb’ for doing this is summarize
(or summarise
).
summarize(patients, mean(Height))
## # A tibble: 1 x 1
## `mean(Height)`
## <dbl>
## 1 168.
This is not particularly useful on it’s own but you can compute multiple summarized values at the same time.
summarize(patients, mean(Height), sd(Height))
## # A tibble: 1 x 2
## `mean(Height)` `sd(Height)`
## <dbl> <dbl>
## 1 168. 8.04
We can rename the column headings.
summarize(patients, mean = mean(Height), standard_deviation = sd(Height))
## # A tibble: 1 x 2
## mean standard_deviation
## <dbl> <dbl>
## 1 168. 8.04
We might also want to compute summaries on multiple columns. Here we can use the summarize_at
or summarize_if
functions that work in a very similar way to mutate_at
and mutate_if
that we’ve come across earlier
summarize_at(patients, vars(Height, Weight), mean)
## # A tibble: 1 x 2
## Height Weight
## <dbl> <dbl>
## 1 168. 74.9
We can also use the same helper functions we’ve used before with select
.
summarise_at(patients, vars(ends_with('eight')), mean)
## # A tibble: 1 x 3
## Height Weight Overweight
## <dbl> <dbl> <dbl>
## 1 168. 74.9 0.77
Something interesting and quite useful just happened with the Overweight variable. It is a logical yet still a mean value was computed. What does the resulting value represent? (answer: the proportion of patients that are overweight).
If we want to compute the mean of all columns containing numeric values we could use summarize_if
.
summarize_if(patients, is.numeric, mean)
## # A tibble: 1 x 6
## Height Weight Grade Count Age BMI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 168. 74.9 NA -0.107 43.1 26.5
Why does the Grade column give a result of NA
?
What do we need to do to fix this?
summarize_if(patients, is.numeric, mean, na.rm = TRUE)
## # A tibble: 1 x 6
## Height Weight Grade Count Age BMI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 168. 74.9 2.05 -0.107 43.1 26.5
We can compute multiple summarized values on multiple columns.
summarize_at(patients, vars(Height, Weight), list(mean, sd))
## # A tibble: 1 x 4
## Height_fn1 Weight_fn1 Height_fn2 Weight_fn2
## <dbl> <dbl> <dbl> <dbl>
## 1 168. 74.9 8.04 8.56
And again but with more ascetically pleasing column headings.
summarize_at(patients, vars(Height, Weight), list(mean = mean, sd = sd))
## # A tibble: 1 x 4
## Height_mean Weight_mean Height_sd Weight_sd
## <dbl> <dbl> <dbl> <dbl>
## 1 168. 74.9 8.04 8.56
summarize
only really becomes useful when working with subsets of the data, or groups. Here we use the group_by
function and it really helps if we do so in the context of a workflow using the pipe mechanism (%>%
).
For example, let’s compute the average height of males and females separately.
patients %>%
summarize(mean(Height))
## # A tibble: 1 x 1
## `mean(Height)`
## <dbl>
## 1 168.
patients %>%
group_by(Sex) %>%
summarize(mean(Height))
## # A tibble: 2 x 2
## Sex `mean(Height)`
## <chr> <dbl>
## 1 Female 162.
## 2 Male 175.
patients %>%
group_by(Sex) %>%
summarize(`Average height` = mean(Height))
## # A tibble: 2 x 2
## Sex `Average height`
## <chr> <dbl>
## 1 Female 162.
## 2 Male 175.
patients %>%
group_by(Sex) %>%
summarize_at(vars(Height, Weight), mean)
## # A tibble: 2 x 3
## Sex Height Weight
## <chr> <dbl> <dbl>
## 1 Female 162. 68.9
## 2 Male 175. 82.2
patients %>%
group_by(Sex) %>%
summarize_at(vars(Height, Weight), list(mean = mean, stddev = sd))
## # A tibble: 2 x 5
## Sex Height_mean Weight_mean Height_stddev Weight_stddev
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Female 162. 68.9 2.84 2.78
## 2 Male 175. 82.2 5.88 7.47
patients %>%
group_by(Sex, State) %>%
summarize(
Mean_height = mean(Height),
Stddev_height = sd(Height)
)
## # A tibble: 12 x 4
## # Groups: Sex [2]
## Sex State Mean_height Stddev_height
## <chr> <chr> <dbl> <dbl>
## 1 Female California 162. 4.41
## 2 Female Colorado 162. 3.28
## 3 Female Georgia 161. 2.61
## 4 Female Indiana 162. 1.53
## 5 Female New Jersey 162. 3.67
## 6 Female New York 161. 2.01
## 7 Male California 180. 5.19
## 8 Male Colorado 174. 5.34
## 9 Male Georgia 176. 6.42
## 10 Male Indiana 175. 4.24
## 11 Male New Jersey 173. 5.49
## 12 Male New York 177. 6.93
Change the order of the grouping variables if it makes more sense to group first by State and then by Sex.
patients %>%
group_by(State, Sex) %>%
summarize(
Mean_height = mean(Height),
Stddev_height = sd(Height)
)
## # A tibble: 12 x 4
## # Groups: State [6]
## State Sex Mean_height Stddev_height
## <chr> <chr> <dbl> <dbl>
## 1 California Female 162. 4.41
## 2 California Male 180. 5.19
## 3 Colorado Female 162. 3.28
## 4 Colorado Male 174. 5.34
## 5 Georgia Female 161. 2.61
## 6 Georgia Male 176. 6.42
## 7 Indiana Female 162. 1.53
## 8 Indiana Male 175. 4.24
## 9 New Jersey Female 162. 3.67
## 10 New Jersey Male 173. 5.49
## 11 New York Female 161. 2.01
## 12 New York Male 177. 6.93
It would also be helpful to know how many observations of each group there are.
patients %>%
group_by(State, Sex) %>%
summarize(
Mean_height = mean(Height),
Stddev_height = sd(Height),
N = n()
)
## # A tibble: 12 x 5
## # Groups: State [6]
## State Sex Mean_height Stddev_height N
## <chr> <chr> <dbl> <dbl> <int>
## 1 California Female 162. 4.41 7
## 2 California Male 180. 5.19 3
## 3 Colorado Female 162. 3.28 7
## 4 Colorado Male 174. 5.34 8
## 5 Georgia Female 161. 2.61 9
## 6 Georgia Male 176. 6.42 7
## 7 Indiana Female 162. 1.53 12
## 8 Indiana Male 175. 4.24 6
## 9 New Jersey Female 162. 3.67 10
## 10 New Jersey Male 173. 5.49 10
## 11 New York Female 161. 2.01 10
## 12 New York Male 177. 6.93 11
And finally we don’t really need all those decimal places.
patients %>%
group_by(State, Sex) %>%
summarize(
Mean_height = mean(Height),
Stddev_height = sd(Height),
N = n()
) %>%
mutate_at(vars(Mean_height, Stddev_height), round, digits = 1)
## # A tibble: 12 x 5
## # Groups: State [6]
## State Sex Mean_height Stddev_height N
## <chr> <chr> <dbl> <dbl> <int>
## 1 California Female 162. 4.4 7
## 2 California Male 180. 5.2 3
## 3 Colorado Female 162. 3.3 7
## 4 Colorado Male 174. 5.3 8
## 5 Georgia Female 161 2.6 9
## 6 Georgia Male 176. 6.4 7
## 7 Indiana Female 162. 1.5 12
## 8 Indiana Male 175. 4.2 6
## 9 New Jersey Female 162. 3.7 10
## 10 New Jersey Male 173. 5.5 10
## 11 New York Female 161. 2 10
## 12 New York Male 177. 6.9 11
Compute the median weight and BMI and the interquartile ranges for smokers and non-smokers.
patients %>%
group_by(Smokes) %>%
summarize_at(vars(Weight, BMI), list(median = median, IQR = IQR))
## # A tibble: 2 x 5
## Smokes Weight_median BMI_median Weight_IQR BMI_IQR
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Non-Smoker 71.8 26.6 13.1 2.54
## 2 Smoker 72.9 25.6 8.89 3.17
Compute the numbers of patients with a given Grade separately for each State.
patients %>%
group_by(State, Grade) %>%
summarize(N = n())
## # A tibble: 21 x 3
## # Groups: State [6]
## State Grade N
## <chr> <dbl> <int>
## 1 California 1 1
## 2 California 2 3
## 3 California 3 6
## 4 Colorado 1 5
## 5 Colorado 2 5
## 6 Colorado 3 5
## 7 Georgia 1 5
## 8 Georgia 2 7
## 9 Georgia 3 3
## 10 Georgia NA 1
## # … with 11 more rows
There is a more concise way of doing this using the count
function.
count(patients, State, Grade)
## # A tibble: 21 x 3
## State Grade n
## <chr> <dbl> <int>
## 1 California 1 1
## 2 California 2 3
## 3 California 3 6
## 4 Colorado 1 5
## 5 Colorado 2 5
## 6 Colorado 3 5
## 7 Georgia 1 5
## 8 Georgia 2 7
## 9 Georgia 3 3
## 10 Georgia NA 1
## # … with 11 more rows
It is sometimes necessary to use ungroup
when summarizing based on grouping by multiple variables particularly when the results are going to be further manipulated.
summary <- patients %>%
group_by(Sex, State) %>%
summarize(Height = mean(Height), Weight = mean(Weight))
summary
## # A tibble: 12 x 4
## # Groups: Sex [2]
## Sex State Height Weight
## <chr> <chr> <dbl> <dbl>
## 1 Female California 162. 69.7
## 2 Female Colorado 162. 70.0
## 3 Female Georgia 161. 68.7
## 4 Female Indiana 162. 69.4
## 5 Female New Jersey 162. 68.1
## 6 Female New York 161. 68.0
## 7 Male California 180. 82.1
## 8 Male Colorado 174. 87.1
## 9 Male Georgia 176. 78.2
## 10 Male Indiana 175. 84.0
## 11 Male New Jersey 173. 80.6
## 12 Male New York 177. 81.7
summary %>%
summarize_if(is.numeric, median)
## # A tibble: 2 x 3
## Sex Height Weight
## <chr> <dbl> <dbl>
## 1 Female 162. 69.0
## 2 Male 175. 81.9
What happened here?
It seems that the summary
object is still grouped in some way.
groups(summary)
## [[1]]
## Sex
class(summary)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
Let’s repeat the original operation but this time explicitly remove the grouping after sumarizing.
summary <- patients %>%
group_by(Sex, State) %>%
summarize(Height = mean(Height), Weight = mean(Weight)) %>%
ungroup()
summary
## # A tibble: 12 x 4
## Sex State Height Weight
## <chr> <chr> <dbl> <dbl>
## 1 Female California 162. 69.7
## 2 Female Colorado 162. 70.0
## 3 Female Georgia 161. 68.7
## 4 Female Indiana 162. 69.4
## 5 Female New Jersey 162. 68.1
## 6 Female New York 161. 68.0
## 7 Male California 180. 82.1
## 8 Male Colorado 174. 87.1
## 9 Male Georgia 176. 78.2
## 10 Male Indiana 175. 84.0
## 11 Male New Jersey 173. 80.6
## 12 Male New York 177. 81.7
groups(summary)
## NULL
class(summary)
## [1] "tbl_df" "tbl" "data.frame"
summary %>%
summarize_if(is.numeric, median)
## # A tibble: 1 x 2
## Height Weight
## <dbl> <dbl>
## 1 168. 74.1
Why do group_by
and summarize
behave like this, i.e. retaining all but the last grouping variable? There are some scenarios where this can helpful.
diabetes <- read_tsv("data/diabetes.txt")
## Parsed with column specification:
## cols(
## ID = col_character(),
## Date = col_date(format = ""),
## Glucose = col_double(),
## BP = col_double()
## )
diabetes
## # A tibble: 1,316 x 4
## ID Date Glucose BP
## <chr> <date> <dbl> <dbl>
## 1 AC/AH/001 2011-03-07 100 98
## 2 AC/AH/001 2011-03-14 110 89
## 3 AC/AH/001 2011-03-24 94 88
## 4 AC/AH/001 2011-03-31 111 92
## 5 AC/AH/001 2011-04-03 94 83
## 6 AC/AH/001 2011-05-21 110 93
## 7 AC/AH/001 2011-06-24 105 79
## 8 AC/AH/001 2011-07-11 88 86
## 9 AC/AH/001 2011-07-11 101 92
## 10 AC/AH/001 2011-07-13 112 88
## # … with 1,306 more rows
diabetes <- diabetes %>%
separate(Date, into = c("year", "month", "day"), sep = "-")
diabetes
## # A tibble: 1,316 x 6
## ID year month day Glucose BP
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 AC/AH/001 2011 03 07 100 98
## 2 AC/AH/001 2011 03 14 110 89
## 3 AC/AH/001 2011 03 24 94 88
## 4 AC/AH/001 2011 03 31 111 92
## 5 AC/AH/001 2011 04 03 94 83
## 6 AC/AH/001 2011 05 21 110 93
## 7 AC/AH/001 2011 06 24 105 79
## 8 AC/AH/001 2011 07 11 88 86
## 9 AC/AH/001 2011 07 11 101 92
## 10 AC/AH/001 2011 07 13 112 88
## # … with 1,306 more rows
daily <- group_by(diabetes, year, month, day)
groups(daily)
## [[1]]
## year
##
## [[2]]
## month
##
## [[3]]
## day
per_day <- summarize(daily, measurements = n())
groups(per_day)
## [[1]]
## year
##
## [[2]]
## month
per_day
## # A tibble: 200 x 4
## # Groups: year, month [8]
## year month day measurements
## <chr> <chr> <chr> <int>
## 1 2011 02 16 5
## 2 2011 02 17 10
## 3 2011 02 18 7
## 4 2011 02 19 1
## 5 2011 02 20 4
## 6 2011 02 21 2
## 7 2011 02 22 6
## 8 2011 02 23 4
## 9 2011 02 24 4
## 10 2011 02 25 8
## # … with 190 more rows
per_month <- summarize(per_day, measurements = sum(measurements))
groups(per_month)
## [[1]]
## year
per_month
## # A tibble: 8 x 3
## # Groups: year [1]
## year month measurements
## <chr> <chr> <int>
## 1 2011 02 71
## 2 2011 03 211
## 3 2011 04 205
## 4 2011 05 195
## 5 2011 06 208
## 6 2011 07 204
## 7 2011 08 206
## 8 2011 09 16
per_year <- summarize(per_month, measurements = sum(measurements))
groups(per_year)
## NULL
per_year
## # A tibble: 1 x 2
## year measurements
## <chr> <int>
## 1 2011 1316
Each time we peel away one layer of grouping.
Tip: always use ungroup
after grouping on multiple variables to avoid potential problems further down the line.
mutate
group_by
is most commonly used with summarize
but other functions such as mutate
can be applied separately to subsets of the data in the same way.
As an example, we assign the rank of the individuals by height (tallest = 1, second tallest = 2, etc.) within each state. We need to group by State then sort into descending height order and then assign the rank. For the last part we use another helpful function, row_number
.
ranked_by_height <- patients %>%
group_by(State) %>%
arrange(desc(Height)) %>%
mutate(rank = row_number())
ranked_by_height
## # A tibble: 100 x 16
## # Groups: State [6]
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Char… Male Non-S… 185. 87.2 1973-07-19 New … 1 FALSE
## 2 AC/S… Doug… Male Smoker 185. 73.6 1973-06-11 New … 3 FALSE
## 3 AC/S… Joel Male Non-S… 184. 90.4 1972-06-11 Cali… 3 TRUE
## 4 AC/A… Stev… Male Smoker 183. 83.4 1973-10-05 New … 2 FALSE
## 5 AC/S… Russ… Male Non-S… 183. 82.5 1972-10-30 New … 3 TRUE
## 6 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 7 AC/A… Jeff… Male Non-S… 182. 96.9 1972-12-10 Colo… 2 FALSE
## 8 AC/A… Bran… Male Smoker 182. 79.5 1972-05-09 Geor… 1 TRUE
## 9 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 10 AC/A… Stev… Male Non-S… 181. 83.9 1973-10-19 New … 3 TRUE
## # … with 90 more rows, and 6 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>,
## # rank <int>
Note that because we haven’t carried out a summarize
operation the grouping is still intact.
class(ranked_by_height)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
groups(ranked_by_height)
## [[1]]
## State
We should probably remove that if we plan to do some further operations on this data frame.
ranked_by_height <- ungroup(ranked_by_height)
class(ranked_by_height)
## [1] "tbl_df" "tbl" "data.frame"
groups(ranked_by_height)
## NULL