Session 2 exercises

Part I – ggplot - Facets

The first part requires you to create plots with faceting. Each of the plots you are asked to create is shown below so that you can compare the end result with your own.

Read in the cleaned patients dataset, patient-data-cleaned.txt.

library(tidyverse)
patients <- read_tsv("data/patient-data-cleaned.txt")
patients

## # A tibble: 100 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  2 AC/A… Derek Male  Non-S…   179.   80.4 1972-06-15 Colo…     2 FALSE
##  3 AC/A… Todd  Male  Non-S…   169.   75.5 1972-07-09 New …     2 FALSE
##  4 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo…     1 FALSE
##  5 AC/A… Chri… Fema… Non-S…   164.   71.8 1973-06-12 Geor…     2 TRUE 
##  6 AC/A… Dana  Fema… Smoker   158.   69.9 1973-07-01 Indi…     2 FALSE
##  7 AC/A… Erin  Fema… Non-S…   162.   68.8 1972-03-26 New …     1 FALSE
##  8 AC/A… Rach… Fema… Non-S…   166.   70.4 1973-05-11 Colo…     1 FALSE
##  9 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
## 10 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

Using the patient dataset generate a scatter plot of BMI versus Weight, add a colour scale to the scatterplot based on the Height variable, and split the plot into a grid of plots separated by Smoking status and Sex.

ggplot(data = patients, mapping = aes(x = BMI, y = Weight, colour = Height)) +
  geom_point() +
  facet_grid(Sex ~ Smokes)

Generate a boxplot of BMIs comparing smokers and non-smokers, colour boxplot by Sex, and include a separate facet for people of different age.

ggplot(data = patients, mapping = aes(x = Smokes, y = BMI, fill = Sex)) +
  geom_boxplot() +
  facet_wrap(~ Age)

Produce a similar boxplot of BMIs but this time group data by Sex, colour by Age and facet by Smoking status.

patients$Age <- factor(patients$Age)
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, fill = Age)) +
  geom_boxplot() +
  facet_wrap(~ Smokes)

Regenerate the solution to exercise 3 but this time using a violin plot.

ggplot(data = patients, mapping = aes(x = Sex, y = BMI, fill = Age)) +
  geom_violin() +
  facet_wrap(~ Smokes)

Generate density plots of BMIs coloured by Sex, and split the plot by Grade.

ggplot(data = patients, mapping = aes(x = BMI)) +
  geom_density(aes(fill = Sex), alpha = 0.5) +
  facet_wrap(~ Grade)

Part II – tidyr - Tidying a dataset

Read in the simulated clinical dataset, clinical-data.txt.

library(tidyverse)
clinical_data <- read_tsv("data/clinical-data.txt")
clinical_data

## # A tibble: 10 x 7
##    Subject   Placebo.1 Placebo.2 Drug1.1 Drug1.2 Drug2.1 Drug2.2
##    <chr>         <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Patient1       49.8      53.8    48.4    48.4    40.8    38.3
##  2 Patient2       46.8      49.8    49.6    41.6    39.1    41.9
##  3 Patient3       48.7      48.1    40.5    49.2    40.3    35.1
##  4 Patient4       51.7      48.1    38.3    41.1    40.7    41.2
##  5 Patient5       48.9      48.3    43.1    39.4    43.3    34.9
##  6 Patient6       53.5      44.7    47.5    42.9    39.5    35.6
##  7 Patient7       53.6      47.0    49.2    46.4    37.4    38.8
##  8 Patient8       46.2      43.2    47.3    38.3    44.0    33.8
##  9 Patient9       50.5      56.2    43.4    48.6    41.6    34.4
## 10 Patient10      47.0      44.8    44.9    50.1    39.0    36.2

What are the variables in this data set?

Currently the columns are Placebo.1, Placebo.2…Drug1.1 etc., however, “Placebo..” and “Drug..” are values not variables. Really there should be two variables, one called something like Treatment containing values of ‘Placebo..’, ‘Drug..’, and another called something like Value or Measure with the numbers. Possibly, the Treatment values are “Placebo”, “Drug1” and “Drug2” and the number after the ‘.’ indicates a replicate, but we don’t know this for sure.

Transform the data into a tidy form using the gather function from the tidyr package.

clinical_data <- gather(clinical_data, key = "Treatment", value = "Value", -Subject)

Display the range of values for each drug and placebo treatment as a box plot

ggplot(clinical_data, mapping = aes(x = Treatment, y = Value)) +
  geom_boxplot()

Session 2 exercises - Solutions

Last modified: 30 May 2019

Part I – ggplot - Facets

Part II – tidyr - Tidying a dataset