These exercises require you to generate plots of various kinds. This document shows both the plots that you should obtain and the solutions.
Not everything was necessarily covered during the teaching session, you may need to use R help pages, the Cheat Sheets, or search the web to figure out the answers.
These first few exercises will run through some of the simple principles of creating a ggplot2 object, assigning aesthetics mappings and geoms.
patient-data-cleaned.txt, into a new object called patients.library(tidyverse)
patients <- read_tsv("data/patient-data-cleaned.txt")
patients
## # A tibble: 100 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Chri… Fema… Non-S… 164. 71.8 1973-06-12 Geor… 2 TRUE
## 6 AC/A… Dana Fema… Smoker 158. 69.9 1973-07-01 Indi… 2 FALSE
## 7 AC/A… Erin Fema… Non-S… 162. 68.8 1972-03-26 New … 1 FALSE
## 8 AC/A… Rach… Fema… Non-S… 166. 70.4 1973-05-11 Colo… 1 FALSE
## 9 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 10 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
ggplot(data = patients, mapping = aes(x = BMI, y = Weight)) +
geom_point()
ggplot(data = patients, mapping = aes(x = BMI, y = Weight, colour = Height)) +
geom_point()
ggplot(data = patients, mapping = aes(x = BMI, y = Weight, colour = Height)) +
geom_point() +
geom_smooth()
geom_smooth and adjust the method to fit a straight line without standard error bounds.ggplot(data = patients, mapping = aes(x = BMI, y = Weight, colour = Height)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
ggplot(data = patients, mapping = aes(x = Smokes, y = BMI)) +
geom_boxplot()
ggplot(data = patients, mapping = aes(x = Smokes, y = BMI, colour = Sex)) +
geom_boxplot()
Note: Having loaded the data using read_tsv, the Age column has been set to dbl (short for double, a numeric vector type) as it only contains numbers. This makes it a continuous variable. In order to split the boxplot by age and colour each one according to Age, it is necessary to change age to be a categorical variable. We can do this by changing the Age column into a different vector type: a factor.
patients$Age <- factor(patients$Age)
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, fill = Age)) +
geom_boxplot()
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, fill = Age)) +
geom_violin()
ggplot(data = patients, mapping = aes(x = BMI)) +
geom_histogram(fill = "blue", binwidth = 0.5)
ggplot(data = patients, mapping = aes(x = BMI)) +
geom_density()
Hint: alpha can be used to control transparency.
ggplot(data = patients, mapping = aes(x = BMI)) +
geom_density(aes(fill = Sex), alpha = 0.5)
Time series data is often represented using line graphs. Here we will look at the data in the diabetes.txt file.
diabetes.txt, into a new object called diabetes. There are lots of patients in this study, let’s just look at a few of them - subset the table to only keep patients with ID’s AC/AH/001, AC/AH/017, and AC/AH/020.diabetes <- read_tsv("data/diabetes.txt")
diabetes <- diabetes[diabetes$ID%in%c("AC/AH/001", "AC/AH/017", "AC/AH/020"),]
diabetes
## # A tibble: 45 x 4
## ID Date Glucose BP
## <chr> <date> <dbl> <dbl>
## 1 AC/AH/001 2011-03-07 100 98
## 2 AC/AH/001 2011-03-14 110 89
## 3 AC/AH/001 2011-03-24 94 88
## 4 AC/AH/001 2011-03-31 111 92
## 5 AC/AH/001 2011-04-03 94 83
## 6 AC/AH/001 2011-05-21 110 93
## 7 AC/AH/001 2011-06-24 105 79
## 8 AC/AH/001 2011-07-11 88 86
## 9 AC/AH/001 2011-07-11 101 92
## 10 AC/AH/001 2011-07-13 112 88
## # … with 35 more rows
Using subset and three conditions checks with | for OR
diabetes_subset <- subset(diabetes, diabetes$ID=="AC/AH/001" |diabetes$ID=="AC/AH/017" |diabetes$ID=="AC/AH/020")
Using grepl, a regular expression (regex) and square brackets
to_match_pts <- c("AC/AH/001|AC/AH/017|AC/AH/020")
diabetes <- diabetes[grepl(to_match_pts, diabetes$ID ),]
Using filter from dplyr - we’ll cover this later in the course
diabetes_subset <- filter(diabetes, ID == "AC/AH/001" | ID == "AC/AH/017" | ID == "AC/AH/020")
ggplot(diabetes, aes(x=Date, y=Glucose, group=ID)) +
geom_line(aes(colour=ID))
ggplot(diabetes, aes(x=Date, y=Glucose, group=ID)) +
geom_line(aes(colour=ID)) +
geom_point(aes(fill=ID), shape=21)