This document is work in progress
library(tidyverse)
Read in the patients data frame using read_tsv
from the readr
package. read_tsv
imports tab-delimited files (tsv = tab-separated values).
patients <- read_tsv("data/patient-data-cleaned.txt") %>%
select(Height, Weight, -BMI)
Lets look at a tidyverse function definition
top_n
## function (x, n, wt)
## {
## wt <- enquo(wt)
## if (quo_is_missing(wt)) {
## vars <- tbl_vars(x)
## wt_name <- vars[length(vars)]
## inform(glue("Selecting by ", wt_name))
## wt <- sym(wt_name)
## }
## if (!is_scalar_integerish(n)) {
## abort("`n` must be a scalar integer")
## }
## if (n > 0) {
## quo <- quo(filter(x, min_rank(desc(!!wt)) <= !!n))
## }
## else {
## quo <- quo(filter(x, min_rank(!!wt) <= !!abs(n)))
## }
## eval_tidy(quo)
## }
## <bytecode: 0x55a7531f5018>
## <environment: namespace:dplyr>
function
So the general syntax is:
Function_Name <- function(arg1, arg2, arg3, ...){
Code_Line_1
Code_Line_2
...
}
Let’s create a simple function. This one has no arguments.
my_first_function <- function(){
message("Hello World!")
}
my_first_function()
## Hello World!
Let’s make it a bit more useful by adding an argument
sayHello <- function(name){
message("Hello ", name, "!")
}
sayHello("Ash")
## Hello Ash!
sayHello("Matt")
## Hello Matt!
So far our function just printed a message to the console, but most functions return a value.
In R a function can only return a single object, this can be of any data type, e.g. data.frame or a vector or even a plot, but only one can be returned (there is a way around this using lists, but we won’t discuss them here).
Let’s create a function to calculate BMI.
BMI is defined as weight (kg) / square of height (m) –> kg/m2
bmi <- function(height, weight){
weight / (height ^ 2)
}
bmi(1.75, 78)
## [1] 25.46939
Let’s try that with our patients table.
patients <- patients %>%
mutate(BMI=bmi(Height, Weight))
This isn’t right because the height should be in metres. Let’s add some additional arguments to define the units and then some extra code to adjust the values depending on the units provided.
When specifying the arguments we can set default values. This makes these arguments optional.
In this case we will create arguments for the units for the height and weight. We might like to the set the defaults to “m” and “kg”. Then the user only needs to specify these if they are providing measurements in different units.
bmi <- function(height, weight, height_units="m", weight_units="kg"){
if(height_units == "cm"){ height <- height / 100 }
if(weight_units == "g"){ weight <- weight / 1000 }
weight / (height ^ 2)
}
bmi(1.75, 78)
## [1] 25.46939
patients %>%
mutate(BMI=bmi(Height, Weight, height_units = "cm"))
## # A tibble: 100 x 3
## Height Weight BMI
## <dbl> <dbl> <dbl>
## 1 183. 76.6 22.9
## 2 179. 80.4 25.1
## 3 169. 75.5 26.4
## 4 176. 94.5 30.6
## 5 164. 71.8 26.5
## 6 158. 69.9 27.9
## 7 162. 68.8 26.3
## 8 166. 70.4 25.6
## 9 181. 76.9 23.4
## 10 167. 79.1 28.2
## # … with 90 more rows
You’ll notice that we didn’t explicitly tell the function what object we wanted it to return. There is a function return
to do this, however, it is not usually necessary to use it as in R the function will just return by default the output of the final line of code.
bmi <- function(height, weight, height_units="m", weight_units="kg"){
if(height_units == "cm"){ height <- height / 100 }
if(weight_units == "g"){ weight <- weight / 1000 }
bmi <- weight / (height ^ 2)
return(bmi)
}
bmi(165, 67)
## [1] 0.002460973
diabetes <- read_tsv("data/diabetes.txt")
diabetes %>%
filter(ID == "AC/AH/001") %>%
ggplot(aes(x = Date, y = Glucose)) +
geom_line(colour = "blue") +
labs(title = "Patient: AC/AH/001")
patientID <- "AC/AH/001"
plotTitle <- str_c("Patient: ", patientID)
diabetes %>%
filter(ID == patientID) %>%
ggplot(aes(x = Date, y = Glucose)) +
geom_line(colour = "blue") +
labs(title = plotTitle)
plotGlucose <- function(patientID, dat=diabetes, myColour="blue"){
plotTitle <- str_c("Patient: ", patientID)
dat %>%
filter(ID == patientID) %>%
ggplot(aes(x = Date, y = Glucose)) +
geom_line(colour = myColour) +
labs(title = plotTitle)
}
plotGlucose("AC/AH/104")
plot1 <- plotGlucose("AC/AH/017", myColour="purple")
plot1
R works nicely with vectors
x <- c(1, 2, 3, 4, 5)
log2(x)
## [1] 0.000000 1.000000 1.584963 2.000000 2.321928
but not always the behaviour we want
sum(x, 5)
## [1] 20
so we could do…
sum(1, 5)
## [1] 6
sum(2, 5)
## [1] 7
sum(3, 5)
## [1] 8
sum(4, 5)
## [1] 9
sum(5, 5)
## [1] 10
If you are repeating code like this, you should use a for
loop instead. A for
loop allows us to run the same piece of code multiple times, each time changing the value of one of the input variables.
The basic syntax of a for
loop is:
for(a_variable in a_vector){
some code operating on a_variable
}
The loop will repeatedly run the code inside the curly brackets, each time replacing a_variable
with a value from a_vector
.
Let’s look at a simple concrete example:
for(a_number in c(11, 21, 16, 4, 500)){
sum(a_number, 5)
}
In this case the code between the curly brackets has been run 5 times. The first time a_number
was replaced with 11
- the first value in the vector, the next time a_number
was replaced with 21
- the second value in the vector, and so on.
More commonly we would define the vector first:
some_data <- c(11, 21, 16, 4, 500)
for(a_number in some_data){
sum(a_number, 5)
}
Very often we will use a sequence and extract the relevant value from the vector.
some_data <- c(11, 21, 16, 4, 500)
for(a_number in seq(1, 5)){
new_number <- sum(some_data[a_number], 5)
message("Result number ", a_number, " is ", new_number)
}
## Result number 1 is 16
## Result number 2 is 26
## Result number 3 is 21
## Result number 4 is 9
## Result number 5 is 505
A better way is to use seq_along
some_data <- c(11, 21, 16, 4, 500)
for(a_number in seq_along(some_data)){
new_number <- sum(some_data[a_number], 5)
message("Result number ", a_number, " is ", new_number)
}
## Result number 1 is 16
## Result number 2 is 26
## Result number 3 is 21
## Result number 4 is 9
## Result number 5 is 505
Let’s say that with our patients data we want to export a separate table for each state, containing just the men, with ID, Name, Smokes, BMI and their their average BP from the diabetes table.
states <- unique(patients$State)
avg_bp <- diabetes %>%
group_by(ID) %>%
summarise(AvgBP=mean(BP)) %>%
ungroup()
for(state in states){
outputFile <- str_c("data/", state, "_male_patient_data.tsv") %>%
str_replace(" ", "_")
patients %>%
filter(State == state & Sex == "Male") %>%
left_join(avg_bp, by = "ID") %>%
select(ID, Name, Smokes, BMI, AvgBP) %>%
write_tsv(outputFile)
}
In the diabetes data set we have 100 patients. Lets say we want to plot a time line of glucose levels for each patients separately and output the plots in 4x4 grids to a pdf. We can generate a 4x4 grid of plots easily enough using facet_wrap
in ggplot2, but we’ll need to do this repeatedly on subsets of 16 patients.
There are many ways to solve this problem; this is just one of them.
all_patients <- unique(diabetes$ID)
len <- length(all_patients)
start_index <- seq(1, len, by = 16)
plotGlucose <- function(dat){
ggplot(dat, aes(x = Date, y = Glucose)) +
geom_line(colour = "red") +
geom_point(shape = 21, fill = "red") +
facet_wrap(~ID, ncol=4)
}
pdf("data/My_plots.pdf")
for(stt in start_index){
end <- min(stt + 15, 100)
these_patients <- all_patients[seq(stt, end)]
diabetes %>%
filter(ID %in% these_patients) %>%
plotGlucose() %>%
print()
}
dev.off()
The last page isn’t ideal perhaps. We could solve this with some more advanced plotting using gridExtra
to create the grid of plots from a list
of individual plots. We will also need a nested loop - a loop within a loop.
library(gridExtra)
pdf("data/My_plots.pdf")
for(stt in start_index){
end <- min(stt + 15, 100)
these_patients <- all_patients[seq(stt, end)]
plotList <- list()
for(i in seq_along(these_patients)){
glPlot <- diabetes %>%
filter(ID %in% these_patients[i]) %>%
plotGlucose()
plotList[[i]] <- glPlot
}
grid.arrange(grobs = plotList, ncol = 4, nrow = 4)
}
dev.off()