1. Combining data tables

The exercise uses a more realistic dataset, building on the patients data frame we’ve already been working with.

The patients are all part of a diabetes study and have had their blood glucose concentration and diastolic blood pressure measured on several dates.

This part of the exercise combines grouping, summarisation and joining operations to connect the diabetes study data to the patients table we’ve already been working with.

Read the data from the file diabetes.txt into a new object.

library(tidyverse)
diabetes <- read_tsv("data/diabetes.txt")
diabetes
## # A tibble: 1,316 x 4
##    ID        Date       Glucose    BP
##    <chr>     <date>       <dbl> <dbl>
##  1 AC/AH/001 2011-03-07     100    98
##  2 AC/AH/001 2011-03-14     110    89
##  3 AC/AH/001 2011-03-24      94    88
##  4 AC/AH/001 2011-03-31     111    92
##  5 AC/AH/001 2011-04-03      94    83
##  6 AC/AH/001 2011-05-21     110    93
##  7 AC/AH/001 2011-06-24     105    79
##  8 AC/AH/001 2011-07-11      88    86
##  9 AC/AH/001 2011-07-11     101    92
## 10 AC/AH/001 2011-07-13     112    88
## # … with 1,306 more rows

The goal is to compare the blood pressure of smokers and non-smokers.

First, calculate the average blood pressure for each individual in the diabetes data frame.

diabetes_av <- diabetes  %>%  
    group_by(ID) %>%  
    summarize(MeanBP=mean(BP))

Now use one of the join functions to combine these average blood pressure measurements with the patients data frame containing information on whether the patient is a smoker.

patients <- read_tsv("data/patient-data-cleaned.txt") %>%  
    left_join(diabetes_av)

Finally, calculate the average blood pressure for smokers and non-smokers on the resulting, combined data frame.

patients  %>%  
    group_by(Smokes) %>%  
    summarize(MeanBP=mean(MeanBP))
## # A tibble: 2 x 2
##   Smokes     MeanBP
##   <chr>       <dbl>
## 1 Non-Smoker   82.0
## 2 Smoker       84.6

Can you write this whole operation as a single dplyr chain?

read_tsv("data/diabetes.txt") %>%  
    group_by(ID) %>%  
    summarize(MeanBP=mean(BP)) %>%  
    left_join(read_tsv("data/patient-data-cleaned.txt")) %>%  
    group_by(Smokes) %>%  
    summarize(MeanBP=mean(MeanBP))
## # A tibble: 2 x 2
##   Smokes     MeanBP
##   <chr>       <dbl>
## 1 Non-Smoker   82.0
## 2 Smoker       84.6

2. Customising Scales

In these exercises we look at adjusting the scales.

Using the patient dataset from earlier, generate a scatter plot of BMI versus Weight

patients <- read_tsv("data/patient-data-cleaned.txt")

scPlot <- patients %>%  
    ggplot(aes(x = BMI, y = Weight)) + 
    geom_point()
scPlot

  1. With the plot above, from exercise 1:
    • Adjust the BMI axis to show only labels for 20, 30, 40
    • Adjust the weight axis to show breaks for 60 to 100 in steps of 5
    • Specify the units in y axis label.
scPlot +
    scale_x_continuous(breaks = seq(20, 40, by = 10), limits=c(20, 40)) +
    scale_y_continuous(breaks = seq(60, 100, by = 5)) +
    labs(y="Weight (kg)")

  1. Create a violin plot of BMI by Age where violins are filled with three colours of your choosing. There is pdf with all of the named colours in R here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
patients  %>%  
    mutate(Age=factor(Age))  %>%  
    ggplot(aes(x = Age, y = BMI)) +
        geom_violin(aes(fill = Age)) +
        scale_fill_manual(values = c("darkkhaki", "indianred3", "skyblue3"))

  1. Create a scatterplot of BMI versus Weight. Colour the points according to height. Adjust the colour gradient:
    • Set the midpoint to the mean Height
    • Set the midpoint colour to grey
    • Set the extremes to blue (low) and yellow (high).
patients %>%  
    ggplot(aes(x = BMI, y = Weight)) + 
    geom_point(aes(colour = Height)) +
    scale_colour_gradient2(low="blue", 
                           mid="grey",
                           high="yellow",
                           midpoint = mean(patients$Height))