What you will learn on this course

  • How to clean "messy" datasets to make them more amenable to exploratory data analysis

  • How to manipulate and transform tabular data in R using dplyr

  • How to visualize data using the popular ggplot2 package

  • Some of the Tidyverse collection of R packages designed for data science

Why not just use Excel?

Spreadsheets are a common entry point for many types of analysis and Excel is used widely but


  • can be unwieldy and difficult to deal with large amounts of data

  • error prone (e.g. gene symbols turning into dates)

  • tedious and time consuming to repeatedly process multiple files

  • how can you, or someone else, repeat what you did several months or years down the line?

Aim of the course

The course aims to translate how we think of data in spreadsheets to a series of operations that can be performed and chained together in R


The problem with R

There are many hundreds (thousands!) of functions for us to choose from to achieve our goals and everyone has their own set of favourites

e.g. joining data from two tables (data frames) based on a common variable or key

# base R
merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

# data.table package
dt1 <- data.table(df1, key = "CustomerId")
dt2 <- data.table(df2, key = "CustomerId")
dt1[dt2]

# plyr package
join(df1, df2, by = "CustomerId", type = "left")

# dplyr package
left_join(df1, df2, by = "CustomerId")

The problem with R

There are many hundreds (thousands!) of functions for us to choose from to achieve our goals and everyone has their own set of favourites


You know what you want to do but how do you find the right function to use?


This course introduces an increasely popular set of tools that can help us to explore data in a consistent and pipeline-able manner

    → the "tidyverse"

Tidyverse tools covered in this course


  • readr – reading tabular data into a data frame in R
  • tidyr – tools for creating tidy data frames
  • dplyr – a consistent set of verbs for solving most data manipulation challenges
  • ggplot2 – a system for declaratively creating plots based on the Grammar of Graphics
  • stringr – string matching, extraction, replacement and joining operations

Course outline

Session Date Topic

1

26th April

  • Visualization with ggplot2 - Grammar of Graphics, basic plots

2

3rd May

  • Visualization with ggplot2 - Facetting
  • Tidying and transforming data - Tidy Data: tidyr intro and dplyr select

3

10th May

  • Tidying and transforming data - Cleaning Data: stringr and dplyr mutate

4

17th May

  • Workflows - piping and dplyr arrange, filter

5

24th May

  • Summarizing, grouping and combining data

6

31st May

  • Customizing plots

How we teach the course

  • "Live coding" in RStudio (no more slides!)

  • Exercises in R markdown documents combining narrative text and code chunks

  • Post-it notes

  • Feedback questionnaire
    • Really does help us improve the course for next time

The Patients dataset

Some data manipulations we will perform

  • Cleaning and tidying the very messy original form of the patients dataset

  • Selecting a subset of columns to create a smaller data frame

  • Creating new columns (variables) from existing ones, e.g. calculating body mass index (BMI) from height and weight

  • Sorting by specified variables

  • Filtering rows (observations)

  • Chaining operations together in workflows

  • Grouping and summarizing observations, e.g. calculating mean BMI for smokers and non-smokers

  • Combining data from two or more tables

Some of the plots we will create

Some of the plots we will create

Some of the plots we will create

Getting started

Install the tidyverse packages

install.packages("tidyverse")

Load the core tidyverse packages

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Reading the patients dataset into R

patients <- read_tsv("patient-data-cleaned.txt")
## Parsed with column specification:
## cols(
##   ID = col_character(),
##   Name = col_character(),
##   Sex = col_character(),
##   Smokes = col_character(),
##   Height = col_double(),
##   Weight = col_double(),
##   Birth = col_date(format = ""),
##   State = col_character(),
##   Grade = col_double(),
##   Died = col_logical(),
##   Count = col_double(),
##   Date.Entered.Study = col_date(format = ""),
##   Age = col_double(),
##   BMI = col_double(),
##   Overweight = col_logical()
## )

Resources