This document is work in progress

Introduction

Load the tidyverse

library(tidyverse)

Read the patients data into R environment

Read in the patients data frame using read_tsv from the readr package. read_tsv imports tab-delimited files (tsv = tab-separated values).

patients <- read_tsv("data/patient-data-cleaned.txt") %>%  
    select(Height, Weight, -BMI)

Writing our own functions

Basic syntax to generate a function

Lets look at a tidyverse function definition

top_n
## function (x, n, wt) 
## {
##     wt <- enquo(wt)
##     if (quo_is_missing(wt)) {
##         vars <- tbl_vars(x)
##         wt_name <- vars[length(vars)]
##         inform(glue("Selecting by ", wt_name))
##         wt <- sym(wt_name)
##     }
##     if (!is_scalar_integerish(n)) {
##         abort("`n` must be a scalar integer")
##     }
##     if (n > 0) {
##         quo <- quo(filter(x, min_rank(desc(!!wt)) <= !!n))
##     }
##     else {
##         quo <- quo(filter(x, min_rank(!!wt) <= !!abs(n)))
##     }
##     eval_tidy(quo)
## }
## <bytecode: 0x55a7531f5018>
## <environment: namespace:dplyr>
  • The command to create a function is function
  • This is followed by the arguments of the function in brackets
  • Finally the code that the function will execute is between curly brackets

So the general syntax is:

Function_Name <- function(arg1, arg2, arg3, ...){
    Code_Line_1
    Code_Line_2
    ...
}

Let’s create a simple function. This one has no arguments.

my_first_function  <- function(){
    message("Hello World!")
}

my_first_function()
## Hello World!

Let’s make it a bit more useful by adding an argument

sayHello <- function(name){
    message("Hello ", name, "!")
}

sayHello("Ash")
## Hello Ash!
sayHello("Matt")
## Hello Matt!

Functions with output

So far our function just printed a message to the console, but most functions return a value.

In R a function can only return a single object, this can be of any data type, e.g. data.frame or a vector or even a plot, but only one can be returned (there is a way around this using lists, but we won’t discuss them here).

Let’s create a function to calculate BMI.

BMI is defined as weight (kg) / square of height (m) –> kg/m2

bmi <- function(height, weight){
    weight / (height ^ 2)
}
bmi(1.75, 78)
## [1] 25.46939

Let’s try that with our patients table.

patients <- patients  %>%  
    mutate(BMI=bmi(Height, Weight))

This isn’t right because the height should be in metres. Let’s add some additional arguments to define the units and then some extra code to adjust the values depending on the units provided.

Defaults

When specifying the arguments we can set default values. This makes these arguments optional.

In this case we will create arguments for the units for the height and weight. We might like to the set the defaults to “m” and “kg”. Then the user only needs to specify these if they are providing measurements in different units.

bmi <- function(height, weight, height_units="m", weight_units="kg"){
    if(height_units == "cm"){ height <- height / 100 }
    if(weight_units == "g"){ weight <- weight / 1000 }
    weight / (height ^ 2)
}

bmi(1.75, 78)
## [1] 25.46939
patients  %>%  
    mutate(BMI=bmi(Height, Weight, height_units = "cm"))
## # A tibble: 100 x 3
##    Height Weight   BMI
##     <dbl>  <dbl> <dbl>
##  1   183.   76.6  22.9
##  2   179.   80.4  25.1
##  3   169.   75.5  26.4
##  4   176.   94.5  30.6
##  5   164.   71.8  26.5
##  6   158.   69.9  27.9
##  7   162.   68.8  26.3
##  8   166.   70.4  25.6
##  9   181.   76.9  23.4
## 10   167.   79.1  28.2
## # … with 90 more rows

Returned object

You’ll notice that we didn’t explicitly tell the function what object we wanted it to return. There is a function return to do this, however, it is not usually necessary to use it as in R the function will just return by default the output of the final line of code.

bmi <- function(height, weight, height_units="m", weight_units="kg"){
    if(height_units == "cm"){ height <- height / 100 }
    if(weight_units == "g"){ weight <- weight / 1000 }
    bmi <- weight / (height ^ 2)
    return(bmi)
}
bmi(165, 67)
## [1] 0.002460973

Generating a plot with a function

diabetes <- read_tsv("data/diabetes.txt")
diabetes %>%  
    filter(ID == "AC/AH/001") %>%  
    ggplot(aes(x = Date, y = Glucose)) +
        geom_line(colour = "blue") +
        labs(title = "Patient: AC/AH/001")

patientID <- "AC/AH/001"
plotTitle <- str_c("Patient: ", patientID)

diabetes %>%  
    filter(ID == patientID) %>%  
    ggplot(aes(x = Date, y = Glucose)) +
        geom_line(colour = "blue") +
        labs(title = plotTitle)

plotGlucose <- function(patientID, dat=diabetes, myColour="blue"){
    plotTitle <- str_c("Patient: ", patientID)
    dat %>%  
        filter(ID == patientID) %>%  
        ggplot(aes(x = Date, y = Glucose)) +
            geom_line(colour = myColour) +
            labs(title = plotTitle)
}

plotGlucose("AC/AH/104")

plot1 <- plotGlucose("AC/AH/017", myColour="purple")
plot1

Doing repetitive tasks with loops

R works nicely with vectors

x <- c(1, 2, 3, 4, 5)
log2(x)
## [1] 0.000000 1.000000 1.584963 2.000000 2.321928

but not always the behaviour we want

sum(x, 5)
## [1] 20

so we could do…

sum(1, 5)
## [1] 6
sum(2, 5)
## [1] 7
sum(3, 5)
## [1] 8
sum(4, 5)
## [1] 9
sum(5, 5)
## [1] 10

If you are repeating code like this, you should use a for loop instead. A for loop allows us to run the same piece of code multiple times, each time changing the value of one of the input variables.

Basic syntax

The basic syntax of a for loop is:

for(a_variable in a_vector){
    some code operating on a_variable
}

The loop will repeatedly run the code inside the curly brackets, each time replacing a_variable with a value from a_vector.

Let’s look at a simple concrete example:

for(a_number in c(11, 21, 16, 4, 500)){
    sum(a_number, 5)
}

In this case the code between the curly brackets has been run 5 times. The first time a_number was replaced with 11 - the first value in the vector, the next time a_number was replaced with 21 - the second value in the vector, and so on.

More commonly we would define the vector first:

some_data <- c(11, 21, 16, 4, 500)
for(a_number in some_data){
    sum(a_number, 5)
}

Very often we will use a sequence and extract the relevant value from the vector.

some_data <- c(11, 21, 16, 4, 500)
for(a_number in seq(1, 5)){
    new_number <- sum(some_data[a_number], 5)
    message("Result number ", a_number, " is ", new_number)
}
## Result number 1 is 16
## Result number 2 is 26
## Result number 3 is 21
## Result number 4 is 9
## Result number 5 is 505

A better way is to use seq_along

some_data <- c(11, 21, 16, 4, 500)
for(a_number in seq_along(some_data)){
    new_number <- sum(some_data[a_number], 5)
    message("Result number ", a_number, " is ", new_number)
}
## Result number 1 is 16
## Result number 2 is 26
## Result number 3 is 21
## Result number 4 is 9
## Result number 5 is 505

A more realistic example

Let’s say that with our patients data we want to export a separate table for each state, containing just the men, with ID, Name, Smokes, BMI and their their average BP from the diabetes table.

states <- unique(patients$State)

avg_bp  <- diabetes %>%  
    group_by(ID) %>%  
    summarise(AvgBP=mean(BP)) %>%  
    ungroup()

for(state in states){
    outputFile <- str_c("data/", state, "_male_patient_data.tsv") %>%  
        str_replace(" ", "_")
    patients  %>%  
        filter(State == state & Sex == "Male") %>%  
        left_join(avg_bp, by = "ID") %>%  
        select(ID, Name, Smokes, BMI, AvgBP) %>%  
        write_tsv(outputFile)
}

Emma’s problem - A loop with plotting

In the diabetes data set we have 100 patients. Lets say we want to plot a time line of glucose levels for each patients separately and output the plots in 4x4 grids to a pdf. We can generate a 4x4 grid of plots easily enough using facet_wrap in ggplot2, but we’ll need to do this repeatedly on subsets of 16 patients.

There are many ways to solve this problem; this is just one of them.

all_patients <- unique(diabetes$ID)
len <- length(all_patients)
start_index <- seq(1, len, by = 16)

plotGlucose <- function(dat){
    ggplot(dat, aes(x = Date, y = Glucose)) +
        geom_line(colour = "red") +
        geom_point(shape = 21, fill = "red") +
        facet_wrap(~ID, ncol=4)
}

pdf("data/My_plots.pdf")
for(stt in start_index){
    end <- min(stt + 15, 100)
    these_patients <- all_patients[seq(stt, end)]
    diabetes %>%  
        filter(ID %in% these_patients) %>%  
        plotGlucose() %>%  
        print()
}
dev.off()

The last page isn’t ideal perhaps. We could solve this with some more advanced plotting using gridExtra to create the grid of plots from a list of individual plots. We will also need a nested loop - a loop within a loop.

library(gridExtra)

pdf("data/My_plots.pdf")
for(stt in start_index){
    end <- min(stt + 15, 100)
    these_patients <- all_patients[seq(stt, end)]
    plotList <- list() 
    for(i in seq_along(these_patients)){
        glPlot <- diabetes %>%  
            filter(ID %in% these_patients[i]) %>%  
            plotGlucose()
        plotList[[i]] <- glPlot
    }
    grid.arrange(grobs = plotList, ncol = 4, nrow = 4)
}
dev.off()