These exercises require you to generate plots of various kinds. This document shows both the plots that you should obtain and the solutions.
Not everything was necessarily covered during the teaching session, you may need to use R help pages, the Cheat Sheets, or search the web to figure out the answers.
These first few exercises will run through some of the simple principles of creating a ggplot2 object, assigning aesthetics mappings and geoms.
patient-data-cleaned.txt
, into a new object called patients
.library(tidyverse)
patients <- read_tsv("data/patient-data-cleaned.txt")
patients
## # A tibble: 100 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Chri… Fema… Non-S… 164. 71.8 1973-06-12 Geor… 2 TRUE
## 6 AC/A… Dana Fema… Smoker 158. 69.9 1973-07-01 Indi… 2 FALSE
## 7 AC/A… Erin Fema… Non-S… 162. 68.8 1972-03-26 New … 1 FALSE
## 8 AC/A… Rach… Fema… Non-S… 166. 70.4 1973-05-11 Colo… 1 FALSE
## 9 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 10 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
ggplot(data = patients, mapping = aes(x = BMI, y = Weight)) +
geom_point()
ggplot(data = patients, mapping = aes(x = BMI, y = Weight, colour = Height)) +
geom_point()
ggplot(data = patients, mapping = aes(x = BMI, y = Weight, colour = Height)) +
geom_point() +
geom_smooth()
geom_smooth
and adjust the method to fit a straight line without standard error bounds.ggplot(data = patients, mapping = aes(x = BMI, y = Weight, colour = Height)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
ggplot(data = patients, mapping = aes(x = Smokes, y = BMI)) +
geom_boxplot()
ggplot(data = patients, mapping = aes(x = Smokes, y = BMI, colour = Sex)) +
geom_boxplot()
Note: Having loaded the data using read_tsv
, the Age
column has been set to dbl
(short for double
, a numeric
vector type) as it only contains numbers. This makes it a continuous variable. In order to split the boxplot by age and colour each one according to Age, it is necessary to change age to be a categorical variable. We can do this by changing the Age
column into a different vector type: a factor
.
patients$Age <- factor(patients$Age)
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, fill = Age)) +
geom_boxplot()
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, fill = Age)) +
geom_violin()
ggplot(data = patients, mapping = aes(x = BMI)) +
geom_histogram(fill = "blue", binwidth = 0.5)
ggplot(data = patients, mapping = aes(x = BMI)) +
geom_density()
Hint: alpha can be used to control transparency.
ggplot(data = patients, mapping = aes(x = BMI)) +
geom_density(aes(fill = Sex), alpha = 0.5)
Time series data is often represented using line graphs. Here we will look at the data in the diabetes.txt
file.
diabetes.txt
, into a new object called diabetes
. There are lots of patients in this study, let’s just look at a few of them - subset the table to only keep patients with ID’s AC/AH/001, AC/AH/017, and AC/AH/020.diabetes <- read_tsv("data/diabetes.txt")
diabetes <- diabetes[diabetes$ID%in%c("AC/AH/001", "AC/AH/017", "AC/AH/020"),]
diabetes
## # A tibble: 45 x 4
## ID Date Glucose BP
## <chr> <date> <dbl> <dbl>
## 1 AC/AH/001 2011-03-07 100 98
## 2 AC/AH/001 2011-03-14 110 89
## 3 AC/AH/001 2011-03-24 94 88
## 4 AC/AH/001 2011-03-31 111 92
## 5 AC/AH/001 2011-04-03 94 83
## 6 AC/AH/001 2011-05-21 110 93
## 7 AC/AH/001 2011-06-24 105 79
## 8 AC/AH/001 2011-07-11 88 86
## 9 AC/AH/001 2011-07-11 101 92
## 10 AC/AH/001 2011-07-13 112 88
## # … with 35 more rows
Using subset
and three conditions checks with |
for OR
diabetes_subset <- subset(diabetes, diabetes$ID=="AC/AH/001" |diabetes$ID=="AC/AH/017" |diabetes$ID=="AC/AH/020")
Using grepl
, a regular expression (regex) and square brackets
to_match_pts <- c("AC/AH/001|AC/AH/017|AC/AH/020")
diabetes <- diabetes[grepl(to_match_pts, diabetes$ID ),]
Using filter
from dplyr
- we’ll cover this later in the course
diabetes_subset <- filter(diabetes, ID == "AC/AH/001" | ID == "AC/AH/017" | ID == "AC/AH/020")
ggplot(diabetes, aes(x=Date, y=Glucose, group=ID)) +
geom_line(aes(colour=ID))
ggplot(diabetes, aes(x=Date, y=Glucose, group=ID)) +
geom_line(aes(colour=ID)) +
geom_point(aes(fill=ID), shape=21)