Introduction

In this session we will use the popular ggplot2 package to visualize data that has already been transformed into a tidy tabular format that is suitable for use in creating plots.

A tidy dataset is a data frame (or table) for which the following are true:

  1. Each variable has its own column
  2. Each observation has its own row
  3. Each value has its own cell

Load the tidyverse

Load the core packages from the tidyverse, including ggplot2.

library(tidyverse)

[Instructors note: draw attention to the packages that are loaded, relate these to this and other sections of the course.]

Aside: function name conflicts (optional)

With the many hundreds of R packages available it is inevitable that there will be functions with the same name being used by more than one package.

For example, filter function in dplyr has masked filter function from the stats package that already loaded as part of base R.

?filter

Navigate to each help section and show how the signatures differ and explain how using the one you hadn’t intended will likely cause your code to fail.

Use R studio’s file completion to show that the filter function we get by default.

To use the filter function from the stats package we need to include the stats namespace prefix as follows:

stats::filter(presidents, rep(1, 3))

The patients dataset

Read in the patient data

Read in the patients data frame using read_tsv from the readr package.

patients <- read_tsv("data/patient-data-cleaned.txt")
## Parsed with column specification:
## cols(
##   ID = col_character(),
##   Name = col_character(),
##   Sex = col_character(),
##   Smokes = col_character(),
##   Height = col_double(),
##   Weight = col_double(),
##   Birth = col_date(format = ""),
##   State = col_character(),
##   Grade = col_double(),
##   Died = col_logical(),
##   Count = col_double(),
##   Date.Entered.Study = col_date(format = ""),
##   Age = col_double(),
##   BMI = col_double(),
##   Overweight = col_logical()
## )

read_tsv imports tab-delimited files (tsv = tab-separated values).

There is an equivalent function in readr called read_csv for importing CSV files.

Aside: readr vs base R equivalents

Why use the readr functions compared with the more familiar read.table, read.delim and read.csv functions from base R?

  • They are typically much faster (~10x) for large datasets
  • They produce tibbles, which have much nicer printing and other advantages when chaining operations
  • They do not convert character vectors to factors
  • They do not change column names, e.g. where column names contain spaces or special characters

Getting to know the patients dataset

Let’s have a look at the patients dataset before trying to visualize it.

View in RStudio by clicking on it in the Environment tab or by typing

View(patients)

It is a special type of data frame called a tibble.

class(patients)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Printing a tibble shows each of the column types.

Print the patients data frame at the console.

patients
## # A tibble: 100 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <chr> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  2 AC/A… Derek Male  Non-S…   179.   80.4 1972-06-15 Colo…     2 FALSE
##  3 AC/A… Todd  Male  Non-S…   169.   75.5 1972-07-09 New …     2 FALSE
##  4 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo…     1 FALSE
##  5 AC/A… Chri… Fema… Non-S…   164.   71.8 1973-06-12 Geor…     2 TRUE 
##  6 AC/A… Dana  Fema… Smoker   158.   69.9 1973-07-01 Indi…     2 FALSE
##  7 AC/A… Erin  Fema… Non-S…   162.   68.8 1972-03-26 New …     1 FALSE
##  8 AC/A… Rach… Fema… Non-S…   166.   70.4 1973-05-11 Colo…     1 FALSE
##  9 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
## 10 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

This dataset required quite a bit of cleaning to get into the state where we can do some proper exploratory data analysis and visualization.

Later we will look at issues with real-world datasets and the need for cleaning these in the part of the course on tidying data and some of the tools in the tidyverse collection of packges that help in doing so.

Reminder: a “tidy” dataset in one in which

  • each column is a variable
  • each row is an observation

What are the variables in the patients dataset?

What are the observations?

What types of variables do we have?

read_tsv has read these in as <chr>, <fct>, <dbl>, <date> and <lgl>. What do these mean?

class(patients$Sex)
## [1] "character"
class(patients$Height)
## [1] "numeric"
class(patients$Overweight)
## [1] "logical"

What types should each variable be?

  • Categorical: Sex, Smokes, State

  • Continuous: Height, Weight, BMI

Do any need to be changed?

What about Grade?

  • Categorial, but ordered, i.e. ordinal

What about Age?

  • Certainly discrete in this dataset (only whole numbers), again we might want to treat these as ordinal

We also have some columns read in as logical and date variables. Logical variables are treated by ggplot2 as categorial.

Most of the time ggplot2 will treat variables in the way you’d hope, e.g. creating a bar plot showing the number of patients in each grade even though this is currently an integer. But sometimes it will cause problems so it’s a good idea to sort this sort of thing out right from the outset.

R uses a special type of variable called a factor to handle categorical values. Factors are variables that have a fixed and known set of possible values – these are referred to as the levels of a factor, e.g. Female and Male for the Sex variable in the patients dataset.

Let’s turn some of our columns that should be categorical variables into factors.

patients$Sex <- factor(patients$Sex)
patients$Sex
##   [1] Male   Male   Male   Male   Female Female Female Female Male   Male  
##  [11] Female Female Male   Female Female Male   Male   Male   Male   Male  
##  [21] Male   Female Male   Female Female Male   Male   Male   Male   Male  
##  [31] Female Male   Male   Female Male   Female Male   Female Male   Female
##  [41] Male   Female Female Female Female Female Female Female Male   Female
##  [51] Female Female Female Female Male   Female Female Female Male   Male  
##  [61] Female Male   Female Male   Male   Female Male   Female Male   Female
##  [71] Female Male   Male   Female Male   Female Male   Male   Female Female
##  [81] Female Female Male   Female Female Female Female Male   Female Male  
##  [91] Male   Male   Female Male   Female Female Female Female Female Female
## Levels: Female Male
levels(patients$Sex)
## [1] "Female" "Male"
patients
## # A tibble: 100 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <fct> <chr>   <dbl>  <dbl> <date>     <chr> <dbl> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor…     2 FALSE
##  2 AC/A… Derek Male  Non-S…   179.   80.4 1972-06-15 Colo…     2 FALSE
##  3 AC/A… Todd  Male  Non-S…   169.   75.5 1972-07-09 New …     2 FALSE
##  4 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo…     1 FALSE
##  5 AC/A… Chri… Fema… Non-S…   164.   71.8 1973-06-12 Geor…     2 TRUE 
##  6 AC/A… Dana  Fema… Smoker   158.   69.9 1973-07-01 Indi…     2 FALSE
##  7 AC/A… Erin  Fema… Non-S…   162.   68.8 1972-03-26 New …     1 FALSE
##  8 AC/A… Rach… Fema… Non-S…   166.   70.4 1973-05-11 Colo…     1 FALSE
##  9 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor…     1 FALSE
## 10 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New …     2 FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>

Rather than repeat the above line for each of several columns we can use the mutate_at function from dplyr to do this in a single operation. We will cover this later in the course.

patients <- mutate_at(patients, vars(Sex, Smokes, State, Grade, Age), factor)
patients
## # A tibble: 100 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <fct> <fct>   <dbl>  <dbl> <date>     <fct> <fct> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor… 2     FALSE
##  2 AC/A… Derek Male  Non-S…   179.   80.4 1972-06-15 Colo… 2     FALSE
##  3 AC/A… Todd  Male  Non-S…   169.   75.5 1972-07-09 New … 2     FALSE
##  4 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo… 1     FALSE
##  5 AC/A… Chri… Fema… Non-S…   164.   71.8 1973-06-12 Geor… 2     TRUE 
##  6 AC/A… Dana  Fema… Smoker   158.   69.9 1973-07-01 Indi… 2     FALSE
##  7 AC/A… Erin  Fema… Non-S…   162.   68.8 1972-03-26 New … 1     FALSE
##  8 AC/A… Rach… Fema… Non-S…   166.   70.4 1973-05-11 Colo… 1     FALSE
##  9 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor… 1     FALSE
## 10 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New … 2     FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <fct>, BMI <dbl>, Overweight <lgl>

Our first ggplot - a scatter plot

First we’ll create a scatter plot of height vs weight.

ggplot(data = patients) +
  geom_point(mapping = aes(x = Height, y = Weight))

We specified 3 things to create this plot.

  1. The data – needs to be a data frame
  2. The type of plot – this is called a geom in ggplot2 terminology
  3. The mapping of variables in the data to visual properties (called aesthetics in ggplot2) of objects in the plot

In this case the type of plot is a geom_point, ggplot2’s function for a scatter plot, and the height and weight variables are mapped to the x and y coordinates, i.e. the positions of points on the scatter plot.

Other aesthetics could include the size, shape and colour of those points.

Look up the geom_point help in RStudio.

?geom_point

Aesthetic mappings

In the geom_point help page scroll down to the aesthetics section to see what aesthetics are actually required (x and y).

What about those other aesthetics?

Colours are usually fun – let’s try to use a colour aesthetic.

We need to set a mapping of a variable to colour just like we did for x and y.

ggplot(data = patients) +
  geom_point(mapping = aes(x = Height, y = Weight, colour = Sex))

What about colouring based on a continuous variable?

ggplot(data = patients) +
  geom_point(mapping = aes(x = Height, y = Weight, colour = BMI))

Let’s try some of the other aesthetics.

ggplot(data = patients) +
  geom_point(mapping = aes(x = Height, y = Weight, shape = Sex))

Some aesthetics can only be used with categorical variables, e.g. shape.

ggplot(data = patients) +
  geom_point(mapping = aes(x = Height, y = Weight, shape = BMI))
## Error: A continuous variable can not be mapped to shape

Not all aesthetics are useful, at least not for this plot.

ggplot(data = patients) +
  geom_point(mapping = aes(x = Height, y = Weight, size = Sex))
## Warning: Using size for a discrete variable is not advised.

The warning hints that we should be using a continuous variable for size instead.

ggplot(data = patients) +
  geom_point(mapping = aes(x = Height, y = Weight, size = BMI))

The transparency of the points can be set to reflect a variable using the alpha aesthetic.

ggplot(data = patients) +
  geom_point(mapping = aes(x = Height, y = Weight, alpha = BMI))

Colours are more commonly used than shapes, sizes or transparency.

Virtually every aspect of the plots we’ve created can be customized. We will learn how to change the colours used a bit later.

ggplot2 grammar

The “gg” in ggplot2 stands for “grammar of graphics”. This grammar is a coherent system for describing and building graphs.

The basic template is:

ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

This template can be used to create any of the different types of graph possible with ggplot2.

Another plot type - bar chart

Look up ggplot2 package help in RStudio to see what other geoms are available (find the ggplot2 package in the Packages tab normally found in the bottom right hand corner).

Let’s create a bar chart of patient’s ages.

geom_bar is the geom we need and it requires a single aesthetic mapping of the categorical variable of interest to x.

ggplot(data = patients) +
  geom_bar(mapping = aes(x = Age))

The dark grey bars are a big ugly - what if we want each bar to be a different colour?

ggplot(data = patients) +
  geom_bar(mapping = aes(x = Age, colour = Age))

Colouring the edges wasn’t quite what we had in mind. Look at the help for geom_bar to see what other aesthetic we should have used.

ggplot(data = patients) +
  geom_bar(mapping = aes(x = Age, fill = Age))

What happens if we fill with something other than age, e.g. sex?

We get a stacked bar plot.

ggplot(data = patients) +
  geom_bar(mapping = aes(x = Age, fill = Sex))

Note the similarity in what we did here to what we did with the scatter plot - there is a common grammar.

What if want all the bars to be the same colour but not dark grey, e.g. blue?

ggplot(data = patients) +
  geom_bar(mapping = aes(x = Age, fill = "blue"))

That doesn’t look right - why not?

You can set the aesthetics to a fixed value but this needs to be outside the mapping.

ggplot(data = patients) +
  geom_bar(mapping = aes(x = Age), fill = "blue")

Setting this inside the aes() mapping told ggplot2 to map the colour aesthetic to some variable in the data frame that actually is a constant value “blue” for all observations.

Multiple layers

Consider again the ggplot2 grammar:

ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Why do we have the ggplot part and geom part joined by a “+” symbol?

The ‘+’ operator has been overridden by ggplot2 to add layers and other components (scales, themes, etc.) to the plot.

We can have multiple geoms in the same plot. Each is a different layer.

Let’s create a plot with two geoms.

ggplot(data = patients) +
  geom_point(mapping = aes(x = Height, y = Weight)) +
  geom_smooth(mapping = aes(x = Height, y = Weight))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

What is geom_smooth doing? Look up the help page.

Note that the shaded area represents the standard error bounds on the fitted model.

There is some annoying duplication of code here. We can avoid this by putting the mappings in the ggplot function instead.

ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Aesthetics set in the ggplot function are treated as global mappings for the plot and are inherited by all geoms added to the plot.

Let’s say we still want to colour the points by sex.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Well, that wasn’t expected, although it’s pretty neat. We only wanted to apply the colour aesthetic to the points.

We can add to the aesthetics in the geom function to apply them just to that geom.

ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
  geom_point(mapping = aes(colour = Sex)) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Or if you have got your scatter plot just right, and you’d rather not interfere with it while trying out adding another layer, you can use set the inherit.aes option to FALSE.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  geom_smooth(mapping = aes(x = Height, y = Weight), inherit.aes = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Some other plot types (geoms)

Let’s explore some other types of plots.

Line plot

ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
  geom_line()

Doesn’t make much sense for this dataset but good for time series data.

Use multiple geoms (layering) to have both points and lines.

ggplot(data = patients, mapping = aes(x = Birth, y = BMI)) +
  geom_point() +
  geom_line()

ggplot2 seems to be doing something clever with dates here.

ggplot(data = patients, mapping = aes(x = Birth, y = BMI, colour = Sex)) +
  geom_point() +
  geom_line()

Histogram

ggplot(data = patients) +
  geom_histogram(mapping = aes(x = Height))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The warning message hints at picking a more optimal number of bins.

ggplot(data = patients) +
  geom_histogram(mapping = aes(x = Height), binwidth = 2.5)

Or can set the number of bins.

ggplot(data = patients) +
  geom_histogram(mapping = aes(x = Height), bins = 15)

These histograms are not aesthetically very pleasing - how about some better aesthetics?

ggplot(data = patients) +
  geom_histogram(mapping = aes(x = Height), bins = 15, colour = "firebrick", fill = "grey70")

The help page for geom_histogram talks about an alternative called geom_freqpoly.

When might this be preferred?

ggplot(data = patients) +
  geom_freqpoly(mapping = aes(x = BMI, colour = Smokes), bins = 10)

Box plot

ggplot(data = patients) +
  geom_boxplot(mapping = aes(x = Smokes, y = Weight))

See geom_boxplot help to explain how the box and whiskers are constructed.

BMI is perhaps a better thing to compare because the Weight is correlated with Height and Sex.

ggplot(data = patients) +
  geom_boxplot(mapping = aes(x = Smokes, y = BMI))

Compare males and females separately.

ggplot(data = patients) +
  geom_boxplot(mapping = aes(x = Sex, y = BMI, colour = Smokes))

Can we see the points at the same time?

ggplot(data = patients, mapping = aes(x = Sex, y = BMI)) +
  geom_boxplot() +
  geom_point()

The geom_point help points to geom_jitter as more suitable when one of the variables is categorical.

ggplot(data = patients, mapping = aes(x = Sex, y = BMI)) +
  geom_boxplot() +
  geom_jitter()

Reduce the spread/jitter.

ggplot(data = patients, mapping = aes(x = Sex, y = BMI)) +
  geom_boxplot() +
  geom_jitter(width = 0.25)

Violin plot

Variation on box plot in which density is shown along the sides of the “violin”.

ggplot(data = patients, mapping = aes(x = Sex, y = BMI)) +
  geom_violin()

ggplot(data = patients, mapping = aes(x = Sex, y = BMI, fill = Smokes)) +
  geom_violin(trim = FALSE)

Density plot

Deliberately not covering this here as it is required for a couple of the exercises and we want you to work it out for yourself using the help pages and what you’ve learned so far about the common grammar for all ggplots.

The ggplot object - a peek under the hood

Let’s the build the plot up in stages.

plot <- ggplot(data = patients)

What is the plot object we’ve just created?

Look at the environment pane in RStudio (usually in top right corner). The plot object is a list of 9.

is.list(plot)
## [1] TRUE

It’s actually a special type of list.

class(plot)
## [1] "gg"     "ggplot"

What are the 9 things in the list?

names(plot)
## [1] "data"        "layers"      "scales"      "mapping"     "theme"      
## [6] "coordinates" "facet"       "plot_env"    "labels"

So far we’ve constructed ggplots from data, layers (geoms) and aesthetic mappings. We’ll be looking at how to customize plots later, including changing the labels, scales and theme.

plot$data
## # A tibble: 100 x 15
##    ID    Name  Sex   Smokes Height Weight Birth      State Grade Died 
##    <chr> <chr> <fct> <fct>   <dbl>  <dbl> <date>     <fct> <fct> <lgl>
##  1 AC/A… Mich… Male  Non-S…   183.   76.6 1972-02-06 Geor… 2     FALSE
##  2 AC/A… Derek Male  Non-S…   179.   80.4 1972-06-15 Colo… 2     FALSE
##  3 AC/A… Todd  Male  Non-S…   169.   75.5 1972-07-09 New … 2     FALSE
##  4 AC/A… Rona… Male  Non-S…   176.   94.5 1972-08-17 Colo… 1     FALSE
##  5 AC/A… Chri… Fema… Non-S…   164.   71.8 1973-06-12 Geor… 2     TRUE 
##  6 AC/A… Dana  Fema… Smoker   158.   69.9 1973-07-01 Indi… 2     FALSE
##  7 AC/A… Erin  Fema… Non-S…   162.   68.8 1972-03-26 New … 1     FALSE
##  8 AC/A… Rach… Fema… Non-S…   166.   70.4 1973-05-11 Colo… 1     FALSE
##  9 AC/A… Rona… Male  Non-S…   181.   76.9 1971-12-31 Geor… 1     FALSE
## 10 AC/A… Bryan Male  Non-S…   167.   79.1 1973-07-19 New … 2     FALSE
## # … with 90 more rows, and 5 more variables: Count <dbl>,
## #   Date.Entered.Study <date>, Age <fct>, BMI <dbl>, Overweight <lgl>

What does our plot look like at this point?

plot

No layers or mapping yet.

plot$mapping
## Aesthetic mapping: 
## <empty>
plot$layers
## list()

Lets add an aesthetic mapping.

plot <- ggplot(data = patients, mapping = aes(x = Height, y = Weight))
plot

ggplot2 has automatically added scales for x and y based on the range of height and weight values. Still nothing plotted as we haven’t yet specified the type of plot (geom) to add as a layer.

plot$mapping
## Aesthetic mapping: 
## * `x` -> `Height`
## * `y` -> `Weight`
plot$layers
## list()
plot <- plot + geom_point()
plot

plot$layers
## [[1]]
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity

Each geom is associated with a statistical transformation, called a stat in ggplot2. This is an algorithm used to calculate new values for the graph. In this case, we are creating a scatter plot and the x and y values don’t need to be transformed, just plotted on the x and y axes, hence stat_identity which leaves the data unchanged.

But recall the bar charts we created earlier. How did ggplot2 know what height to make each bar? It counted the number of observations for each categorical value (level) using stat_count.

We’ll touch on statistical transformations again a bit later.

Facets

A useful feature of ggplot2 is faceting. This allows you to split your plot into subplots, or facets, based on one or more categorical variables. Each subplot displays a subset of the data.

There are two faceting functions, facet_grid and facet_wrap.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Smokes)) +
  geom_point()

ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
  geom_point() +
  facet_wrap(~ Smokes)

The ~ Smokes argument provided to facet_wrap is known as a formula in R. Formulae are used to build models in R.

Faceting is perhaps more useful when there are several categories helping to distinguish between them by seeing the data in separate subplots.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Grade)) +
  geom_point()

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Grade)) +
  geom_point() +
  facet_wrap(~ Grade)

We can tell ggplot2 to fit the subplots onto a single row.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Grade)) +
  geom_point() +
  facet_wrap(~ Grade, nrow = 1)

Or we can force it to use a specified number of columns.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Grade)) +
  geom_point() +
  facet_wrap(~ Grade, ncol = 3)

Here you can see that it ‘wraps’ plots across more than one row if necessary.

We can facet on more than one variable.

ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
  geom_point() +
  facet_wrap(~ Sex + Smokes)

Faceting on two variables can also be done as a grid using facet_grid and arguably produces clearer labelling of the subplots.

ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
  geom_point() +
  facet_grid(Sex ~ Smokes)

Compare this with using a single plot and aesthetics.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Smokes, shape = Sex)) +
  geom_point()

It is possible to only facet on one variable in either the rows or columns dimension using facet_grid.

ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
  geom_point() +
  facet_grid(. ~ Sex)

This is the same as using facet_wrap with nrow = 1 so not terribly useful but if you want the plots to be stacked in a single column with labels down the sides then you can do the following.

ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
  geom_point() +
  facet_grid(Sex ~ .)

It is also possible to facet on more more than two variables with facet_wrap although usually this leads to very cluttered plots; it is probably better to stick to using one or two variables for faceting.

ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
  geom_point() +
  facet_wrap(~ Sex + Smokes + Age)

We can use other aesthetic mappings for grouping data alongside faceting.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  facet_wrap(~ Smokes)

Exercises part I - geoms and aesthetics

See separate R markdown document.

Customization

Labels

ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
  geom_point(mapping = aes(colour = Sex)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Patients from ECSA12 study",
    subtitle = "Linear relationship between height and weight",
    x = "Height (cm)",
    y = "Weight (kg)"
  )

Scales

Take a look at the x and y scales in the above plot. ggplot2 has chosen the x and y scales and where to put breaks and ticks.

plot <- ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point(mapping = aes(colour = Sex))
names(plot)
## [1] "data"        "layers"      "scales"      "mapping"     "theme"      
## [6] "coordinates" "facet"       "plot_env"    "labels"

One of the components of the plot is called scales.

ggplot2 automatically adds default scales behind the scenes equivalent to the following:

plot <- ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()

Note that we have three aesthetics and ggplot2 adds a scale for each.

plot$mapping
## Aesthetic mapping: 
## * `x`      -> `Height`
## * `y`      -> `Weight`
## * `colour` -> `Sex`

The x and y variables are continuous hence ggplot2 adds a continuous scale for each. Colour is a discrete variable in this case so ggplot2 adds a discrete scale for colour.

Generalizing, the scales that are required follow the naming scheme:

scale_<NAME_OF_AESTHETIC>_<NAME_OF_SCALE>

Look at the help pages to see what we can change about the y axis.

?scale_y_continuous

We’re going to modify the scales for the following plot.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  scale_y_continuous()

First, we’ll change the breaks, i.e. where ggplot2 puts ticks and numeric labels, on the y axis.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  scale_y_continuous(breaks = seq(60, 100, by = 5))

We can also adjust the extents of the y axis.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  scale_y_continuous(limits = c(60, 100))

We can change the minor breaks, e.g. to add more lines that act as guides. These are shown as thin white lines when using the default theme (we’ll take a look at alternative themes a bit later).

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  scale_y_continuous(limits = c(60, 100), minor_breaks = seq(60, 100, by = 1))

Or we can remove the minor breaks entirely.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  scale_y_continuous(limits = c(60, 100), minor_breaks = NULL)

Maybe we don’t want any breaks at all.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  scale_y_continuous(breaks = NULL)

By default the scales are expanded by 5% of the range on either side. We can add more space round the edges.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  scale_x_continuous(expand = expand_scale(mult = 0.1)) +
  scale_y_continuous(expand = expand_scale(mult = 0.1))

We can move draw the axis on the other side of the plot – not sure why you’d want to do this but with ggplot2 just about anything is possible.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  scale_x_continuous(position = "top")

Colours

Use scale_colour_manual to manually set the colours for a categorical variable.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  scale_colour_manual(values = c("green", "purple"))

ggplot2 provides a set of colour palettes from ColorBrewer.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = State)) +
  geom_point() +
  scale_colour_brewer(palette = "Set1")

Look at the help page for scale_colour_brewer to see what other colour palettes are available, then try some.

?scale_colour_brewer

Interestingly you can set other attributes other than just the colours at the same time.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  scale_colour_brewer(palette = "Set1", name = NULL, labels = c("women", "men"))

What you are doing here is setting your own set of mappings from levels in the data to aesthetic values.

You can select the colours used in a gradient for a continuous variable.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = BMI)) +
  geom_point() +
  scale_colour_gradient(low = "black", high = "blue")

In some cases it might make sense to specify two colour gradients either side of a mid-point.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = BMI)) +
  geom_point() +
  scale_colour_gradient2(low = "red", mid = "grey60", high = "blue", midpoint = median(patients$BMI))

As before we can override the default labels and other aspects of the BMI scale within the scale function.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = BMI)) +
  geom_point() +
  scale_colour_gradient(
    low = "lightblue", high = "darkblue",
    name = "Body mass index",
    breaks = seq(20, 30, by = 2.5)
  )

Order of categories

Quite often we want to reorder the categories. For example, we might want males to be displayed on the left hand side in the following plot.

ggplot(data = patients, mapping = aes(x = Sex, y = BMI, colour = Sex)) +
  geom_boxplot() +
  geom_jitter(width = 0.25, alpha = 0.4) +
  scale_colour_brewer(palette = "Set1")

Categorical variables are treated as special types of variables in R known as factors. A factor has levels, e.g. Female and Male for the Sex variable in the patients dataset.

ggplot2 displays categories in the order given by the levels of the factor. If a character variable is presented to ggplot2 as if it were a factor, ggplot2 converts it into a factor temporarily in order to create the plot, and the categories will be in alphabetical order.

We can reorder the categories by explicitly creating a factor with the levels set in our preferred order.

patients$Sex
##   [1] Male   Male   Male   Male   Female Female Female Female Male   Male  
##  [11] Female Female Male   Female Female Male   Male   Male   Male   Male  
##  [21] Male   Female Male   Female Female Male   Male   Male   Male   Male  
##  [31] Female Male   Male   Female Male   Female Male   Female Male   Female
##  [41] Male   Female Female Female Female Female Female Female Male   Female
##  [51] Female Female Female Female Male   Female Female Female Male   Male  
##  [61] Female Male   Female Male   Male   Female Male   Female Male   Female
##  [71] Female Male   Male   Female Male   Female Male   Male   Female Female
##  [81] Female Female Male   Female Female Female Female Male   Female Male  
##  [91] Male   Male   Female Male   Female Female Female Female Female Female
## Levels: Female Male
patients$Sex <- factor(patients$Sex, levels = c("Male", "Female"))
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, colour = Sex)) +
  geom_boxplot() +
  geom_jitter(width = 0.25, alpha = 0.4) +
  scale_colour_brewer(palette = "Set1")

The legend in this plot is not really necessary. We’ll see how to remove this in the next section on themes.

For plots where a legend is required, changing the order of levels in a factor will also change the ordering in the legend.

ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  scale_colour_brewer(palette = "Set1")

Themes

Themes can be used to customize non-data elements of your plot.

The black and white theme produces plots that are more suitable for articles that will be printed.

ggplot(data = patients, mapping = aes(x = Sex, y = BMI, colour = Sex)) +
  geom_boxplot() +
  geom_jitter(width = 0.25, alpha = 0.4) +
  scale_colour_brewer(palette = "Set1") +
  theme_bw()

What other themes are there? Use RStudio auto-complete to see, try some of these and see how they look. The library ggthemes provides a few more.

library(ggthemes)
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, colour = Sex)) +
  geom_boxplot() +
  geom_jitter(width = 0.25, alpha = 0.4) +
  scale_colour_brewer(palette = "Set1") +
  theme_gdocs()

The theme function can be used to modify individual components of a theme.

For example, the following will turn off the legend.

ggplot(data = patients, mapping = aes(x = Sex, y = BMI, colour = Sex)) +
  geom_boxplot() +
  geom_jitter(width = 0.25, alpha = 0.4) +
  scale_colour_brewer(palette = "Set1") +
  theme_minimal() +
  theme(legend.position = "none")

Or we could place the legend somewhere else.

ggplot(data = patients, mapping = aes(x = Sex, y = BMI, colour = Sex)) +
  geom_boxplot() +
  geom_jitter(width = 0.25, alpha = 0.4) +
  scale_colour_brewer(palette = "Set1") +
  theme_minimal() +
  theme(legend.position = "bottom")

Use the help page for the theme function to see what else can be changed (quite a lot!).

?theme
ggplot(patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
  geom_point() +
  scale_colour_brewer(palette = "Set1") +
  theme_bw() +
  theme(
    axis.title = element_text(size = 10, colour = "darkblue"),
    axis.text = element_text(size = 8),
    legend.title = element_blank()
  )

Statistical transformations

Many graphs, like scatterplots, plot the raw values from your dataset; other graphs, like bar charts, calculate new values to plot.

  • bar charts and histograms (geom_bar, geom_histogram, geom_freqpoly) bin your data and then plot bin counts
  • smoothing functions (geom_smooth) fit a model to your data and then plot predictions from the model
  • box plots (geom_boxplot) compute a robust summary of the distribution and display a specially formatted box

The algorithm used to calculate new values for a graph is called a stat

Take a look at the help page for geom_bar.

?geom_bar

Note that it uses a “count” stat.

Sometimes you might want to override a stat, e.g. when you want to display a bar chart but already have counts.

state_counts <- count(patients, State)
state_counts
## # A tibble: 6 x 2
##   State          n
##   <fct>      <int>
## 1 California    10
## 2 Colorado      15
## 3 Georgia       16
## 4 Indiana       18
## 5 New Jersey    20
## 6 New York      21
ggplot(data = state_counts, mapping = aes(x = State, y = n)) +
  geom_bar(stat = "identity")

Compare this with the following bar chart for which we have observations for each patient rather than a summary of the numbers of each state.

ggplot(data = patients, mapping = aes(x = State)) +
  geom_bar(stat = "count")

By default geom_bar applies the statistical transformation, stat_count, in much the same way as we did above to create the state_counts summary table.

Position adjustments

Earlier we created the following stacked bar chart by setting the fill aesthetic in a geom_bar to a categorical variable. Let’s revisit this.

ggplot(data = patients, mapping = aes(x = Age, fill = Sex)) +
  geom_bar()

We can control the way stacking is performed by specifying a position adjustment using the position argument.

For example, to place the bars for the separate categories side-by-side we can use position = "dodge".

ggplot(data = patients, mapping = aes(x = Age, fill = Sex)) +
  geom_bar(position = "dodge")

Try some of the other values for the position adjustment, e.g. "fill" and "identity",

Another type of position adjustment is "jitter". This is much more useful for scatter plots than for bar charts. We already used geom_jitter to separate points added to a box plot and using position = "jitter" in geom_point() is similar.

position = "jitter" adds a small amount of random noise to each point and is especially useful for data sets that contain multiple observations with the same x and y values, e.g. where these are discrete. The patients dataset isn’t really a good example of where this would be needed so let’s take a look at the mpg data frame that ggplot2 provides as a test dataset.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point()

The mpg data frame has 234 observations but there are clearly not 234 points on this scatter plot; there are more than one observation with the same value of displ and hwy.

position = "jitter" separates the overlapping points.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(position = "jitter")

Saving plots

Use ggsave to save the last plot you displayed.

ggsave("myplot.png")

Alternatively, ggsave can be passed an explicit plot object.

plot <- ggplot(data = patients, mapping = aes(x = Smokes, y = BMI, colour = Smokes)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.4, width = 0.25) +
  facet_wrap(~ Sex) +
  scale_colour_brewer(palette = "Set1") +
  theme(legend.position = "none")

ggsave("myplot.png", plot)

The type of plot file, e.g. pdf, svg, png, etc., and the dimensions can be specified, e.g.

ggsave("myplot.pdf", width = 8, height = 6)

Exercises part II - scales, statistical transformations and themes

See separate R markdown document.