Last week you cleaned up the patient-data.txt
file using mutate
and various stringr
functions. This week we would like you to do the same again, but this time use the “pipe” (%>%
) to do the clean up as a single workflow.
Assign the final cleaned data frame to a new object.
There are various ways that this can be achieved the solution below is an example.
library(tidyverse)
cleaned_patients <- read.delim("data/patient-data.txt", stringsAsFactors = FALSE) %>%
as_tibble() %>%
mutate_all(str_trim) %>% # we can remove the leading/trailing spaces from all columns at once
mutate(Sex = as.factor(Sex)) %>%
mutate(Smokes = Smokes %in% c("TRUE", "Yes")) %>%
mutate_at(vars(Height, Weight), str_remove, pattern = "kg|cm") %>% # the `|` in the pattern means OR
mutate_at(vars(Height, Weight), as.numeric) %>%
mutate(Birth = as.Date(Birth)) %>% # The `date` vector type works nicely now thanks to the package lubridate
mutate(State = str_replace(State, "Californa", "California")) %>%
mutate(State = str_to_title(State)) %>% # fix "New york", "New jersey" and "indiana"
mutate(State = as.factor(State)) %>%
mutate(Grade_Level = na_if(Grade_Level, "99")) %>% # We'll assume the `99`s are supposed to be missing data
mutate(Grade_Level = as.factor(Grade_Level)) %>% # although these are numbers, Grade is really a category
mutate(Died = as.logical(Died)) %>%
mutate(Date.Entered.Study = as.Date(Date.Entered.Study))
cleaned_patients
## # A tibble: 100 x 12
## ID Name Sex Smokes Height Weight Birth State Grade_Level
## <chr> <chr> <fct> <lgl> <dbl> <dbl> <date> <fct> <fct>
## 1 AC/A… Mich… Male FALSE 183. 76.6 1972-02-06 Geor… 2
## 2 AC/A… Derek Male FALSE 179. 80.4 1972-06-15 Colo… 2
## 3 AC/A… Todd Male FALSE 169. 75.5 1972-07-09 New … 2
## 4 AC/A… Rona… Male FALSE 176. 94.5 1972-08-17 Colo… 1
## 5 AC/A… Chri… Fema… FALSE 164. 71.8 1973-06-12 Geor… 2
## 6 AC/A… Dana Fema… TRUE 158. 69.9 1973-07-01 Indi… 2
## 7 AC/A… Erin Fema… FALSE 162. 68.8 1972-03-26 New … 1
## 8 AC/A… Rach… Fema… FALSE 166. 70.4 1973-05-11 Colo… 1
## 9 AC/A… Rona… Male FALSE 181. 76.9 1971-12-31 Geor… 1
## 10 AC/A… Bryan Male FALSE 167. 79.1 1973-07-19 New … 2
## # … with 90 more rows, and 3 more variables: Died <lgl>, Count <chr>,
## # Date.Entered.Study <date>
For this exercise use the clean data table patient-data-cleaned.txt
.
We would like to output a subset of the data to a new file. To write to a file you will need to use a function we haven’t shown you, so you’ll have to check the readr
package help pages to find the tidyverse write functions.
Use pipes to do this as a workflow.
The table should:
select
commandState
column for step (d), it is best to leave the select
until the endrename
but renaming can also be done during the select
command<-
) it to an object.read_tsv("data/patient-data-cleaned.txt") %>%
mutate(`Weight (g)` = Weight * 1000) %>%
filter(str_detect(State, "New") & Smokes == "Non-Smoker") %>%
select(ID, Name, Sex, Smokes, BMI, Grade, `Height (cm)` = Height, `Weight (g)`) %>%
write_tsv("results/East_Coast_NonSmokers.txt")
For this exercise use the clean data table patients-data-cleaned.txt
.
We would like you to modify the table and generate a scatter plot. Use pipes to do this as a workflow, including the ggplot as part of the workflow.
Create an xy scatter plot with Height on the x and Weight on the y where:
Overweight
is already a logical vector, so we can filter directly, there is no need to use == TRUE
labs
in the ggplot?read_tsv("data/patient-data-cleaned.txt") %>%
mutate(`Height (m)` = Height * 0.01) %>%
filter(State %in% c("Colorado", "Georgia")) %>%
filter(Overweight) %>%
mutate(Sex = as.factor(Sex)) %>%
ggplot(aes(x = `Height (m)`, y = Weight)) +
geom_point(aes(colour = Sex)) +
facet_wrap(~ State)