These exercises require you to generate plots of various kinds. This document shows both the plots that you should obtain and the solutions.

Not everything was necessarily covered during the teaching session, you may need to use R help pages, the Cheat Sheets, or search the web to figure out the answers.

Part I – geoms and aesthetics

These first few exercises will run through some of the simple principles of creating a ggplot2 object, assigning aesthetics mappings and geoms.

  1. Read in the cleaned patients dataset, patient-data-cleaned.txt, into a new object called patients.
library(tidyverse)
patients <- read_tsv("data/patient-data-cleaned.txt")
patients
## # A tibble: 100 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  2 AC/A… Derek Male  Non-S…   179.   80.4 1972-06-15 Colo…     2 FALSE
##  3 AC/A… Todd  Male  Non-S…   169.   75.5 1972-07-09 New …     2 FALSE
##  4 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo…     1 FALSE
##  5 AC/A… Chri… Fema… Non-S…   164.   71.8 1973-06-12 Geor…     2 TRUE 
##  6 AC/A… Dana  Fema… Smoker   158.   69.9 1973-07-01 Indi…     2 FALSE
##  7 AC/A… Erin  Fema… Non-S…   162.   68.8 1972-03-26 New …     1 FALSE
##  8 AC/A… Rach… Fema… Non-S…   166.   70.4 1973-05-11 Colo…     1 FALSE
##  9 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
## 10 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

Scatterplots

  1. Generate a scatter plot of BMI versus Weight using the patient dataset .
ggplot(data = patients, mapping = aes(x = BMI, y = Weight)) +
  geom_point()

  1. Extending the plot from exercise 2, add a colour scale to the scatterplot based on the Height variable.
ggplot(data = patients, mapping = aes(x = BMI, y = Weight, colour = Height)) +
  geom_point()

  1. Using an additional geom, add an extra layer of a fit line to the solution from exercise 3.
ggplot(data = patients, mapping = aes(x = BMI, y = Weight, colour = Height)) +
  geom_point() +
  geom_smooth()

  1. Does the fit in question 5 look good? Look at the help page for geom_smooth and adjust the method to fit a straight line without standard error bounds.
ggplot(data = patients, mapping = aes(x = BMI, y = Weight, colour = Height)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Boxplots and Violin plots

  1. Generate a boxplot of BMIs comparing smokers and non-smokers.
ggplot(data = patients, mapping = aes(x = Smokes, y = BMI)) +
  geom_boxplot()

  1. Following from the boxplot comparing smokers and non-smokers in exercise 6, colour boxplot edges by Sex.
ggplot(data = patients, mapping = aes(x = Smokes, y = BMI, colour = Sex)) +
  geom_boxplot()

  1. Produce a similar boxplot of BMIs but this time group data by Sex and colour the interior of the box (not the outline) by Age.

Note: Having loaded the data using read_tsv, the Age column has been set to dbl (short for double, a numeric vector type) as it only contains numbers. This makes it a continuous variable. In order to split the boxplot by age and colour each one according to Age, it is necessary to change age to be a categorical variable. We can do this by changing the Age column into a different vector type: a factor.

patients$Age <- factor(patients$Age)
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, fill = Age)) +
  geom_boxplot()

  1. Regenerate the solution to exercise 8 but this time using a violin plot.
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, fill = Age)) +
  geom_violin()

Histogram and Density plots

  1. Generate a histogram of BMIs with each bar coloured blue, choosing a suitable bin width.
ggplot(data = patients, mapping = aes(x = BMI)) +
  geom_histogram(fill = "blue", binwidth = 0.5)

  1. Instead of a histogram, generate a density plot of BMI
ggplot(data = patients, mapping = aes(x = BMI)) +
  geom_density()

  1. Generate density plots of BMIs coloured by Sex.

Hint: alpha can be used to control transparency.

ggplot(data = patients, mapping = aes(x = BMI)) +
  geom_density(aes(fill = Sex), alpha = 0.5)

Line plots - time series data

Time series data is often represented using line graphs. Here we will look at the data in the diabetes.txt file.

  1. Read in the cleaned patients dataset, diabetes.txt, into a new object called diabetes. There are lots of patients in this study, let’s just look at a few of them - subset the table to only keep patients with ID’s AC/AH/001, AC/AH/017, and AC/AH/020.
diabetes <- read_tsv("data/diabetes.txt")
diabetes <- diabetes[diabetes$ID%in%c("AC/AH/001", "AC/AH/017", "AC/AH/020"),]
diabetes
## # A tibble: 45 x 4
##    ID        Date       Glucose    BP
##    <chr>     <date>       <dbl> <dbl>
##  1 AC/AH/001 2011-03-07     100    98
##  2 AC/AH/001 2011-03-14     110    89
##  3 AC/AH/001 2011-03-24      94    88
##  4 AC/AH/001 2011-03-31     111    92
##  5 AC/AH/001 2011-04-03      94    83
##  6 AC/AH/001 2011-05-21     110    93
##  7 AC/AH/001 2011-06-24     105    79
##  8 AC/AH/001 2011-07-11      88    86
##  9 AC/AH/001 2011-07-11     101    92
## 10 AC/AH/001 2011-07-13     112    88
## # … with 35 more rows
Alternative methods for filtering:

Using subset and three conditions checks with | for OR

diabetes_subset <- subset(diabetes, diabetes$ID=="AC/AH/001" |diabetes$ID=="AC/AH/017" |diabetes$ID=="AC/AH/020")

Using grepl, a regular expression (regex) and square brackets

to_match_pts <- c("AC/AH/001|AC/AH/017|AC/AH/020")
diabetes <- diabetes[grepl(to_match_pts, diabetes$ID ),]

Using filter from dplyr - we’ll cover this later in the course

diabetes_subset <- filter(diabetes, ID == "AC/AH/001" | ID == "AC/AH/017" | ID == "AC/AH/020")
  1. Create a line plot that allows us to examine the change in Glucose levels for each patient by Date.
ggplot(diabetes, aes(x=Date, y=Glucose, group=ID)) +
  geom_line(aes(colour=ID))

  1. Add points to the plot from 14 to make the measurement time-points clearer. Instead of a dot change the shape to a hollow circle and colour the interior of the point.
ggplot(diabetes, aes(x=Date, y=Glucose, group=ID)) +
  geom_line(aes(colour=ID)) +
  geom_point(aes(fill=ID), shape=21)