The exercise uses a more realistic dataset, building on the patients data frame we’ve already been working with.
The patients are all part of a diabetes study and have had their blood glucose concentration and diastolic blood pressure measured on several dates.
This part of the exercise combines grouping, summarisation and joining operations to connect the diabetes study data to the patients table we’ve already been working with.
Read the data from the file diabetes.txt
into a new object.
library(tidyverse)
diabetes <- read_tsv("data/diabetes.txt")
diabetes
## # A tibble: 1,316 x 4
## ID Date Glucose BP
## <chr> <date> <dbl> <dbl>
## 1 AC/AH/001 2011-03-07 100 98
## 2 AC/AH/001 2011-03-14 110 89
## 3 AC/AH/001 2011-03-24 94 88
## 4 AC/AH/001 2011-03-31 111 92
## 5 AC/AH/001 2011-04-03 94 83
## 6 AC/AH/001 2011-05-21 110 93
## 7 AC/AH/001 2011-06-24 105 79
## 8 AC/AH/001 2011-07-11 88 86
## 9 AC/AH/001 2011-07-11 101 92
## 10 AC/AH/001 2011-07-13 112 88
## # … with 1,306 more rows
The goal is to compare the blood pressure of smokers and non-smokers.
First, calculate the average blood pressure for each individual in the diabetes
data frame.
diabetes_av <- diabetes %>%
group_by(ID) %>%
summarize(MeanBP=mean(BP))
Now use one of the join functions to combine these average blood pressure measurements with the patients
data frame containing information on whether the patient is a smoker.
patients <- read_tsv("data/patient-data-cleaned.txt") %>%
left_join(diabetes_av)
Finally, calculate the average blood pressure for smokers and non-smokers on the resulting, combined data frame.
patients %>%
group_by(Smokes) %>%
summarize(MeanBP=mean(MeanBP))
## # A tibble: 2 x 2
## Smokes MeanBP
## <chr> <dbl>
## 1 Non-Smoker 82.0
## 2 Smoker 84.6
Can you write this whole operation as a single dplyr chain?
read_tsv("data/diabetes.txt") %>%
group_by(ID) %>%
summarize(MeanBP=mean(BP)) %>%
left_join(read_tsv("data/patient-data-cleaned.txt")) %>%
group_by(Smokes) %>%
summarize(MeanBP=mean(MeanBP))
## # A tibble: 2 x 2
## Smokes MeanBP
## <chr> <dbl>
## 1 Non-Smoker 82.0
## 2 Smoker 84.6
In these exercises we look at adjusting the scales.
Using the patient dataset from earlier, generate a scatter plot of BMI versus Weight
patients <- read_tsv("data/patient-data-cleaned.txt")
scPlot <- patients %>%
ggplot(aes(x = BMI, y = Weight)) +
geom_point()
scPlot
scPlot +
scale_x_continuous(breaks = seq(20, 40, by = 10), limits=c(20, 40)) +
scale_y_continuous(breaks = seq(60, 100, by = 5)) +
labs(y="Weight (kg)")
patients %>%
mutate(Age=factor(Age)) %>%
ggplot(aes(x = Age, y = BMI)) +
geom_violin(aes(fill = Age)) +
scale_fill_manual(values = c("darkkhaki", "indianred3", "skyblue3"))
patients %>%
ggplot(aes(x = BMI, y = Weight)) +
geom_point(aes(colour = Height)) +
scale_colour_gradient2(low="blue",
mid="grey",
high="yellow",
midpoint = mean(patients$Height))