The week’s assignment will test your ability to create various different types of plot. It includes some extension exercises which you can consider optional.

library(tidyverse)
metabric <- read_csv("metabric_clinical_and_expression_data.csv")

1. Create a bar plot showing the number of patients with tumours at each of the different neoplasm histologic grades

ggplot(data = metabric) +
  geom_bar(mapping = aes(x = Neoplasm_histologic_grade))
## Warning: Removed 72 rows containing non-finite values (stat_count).

Turn this into a stacked bar plot showing the proportions of patients with each histologic grade that had chemotherapy

ggplot(data = metabric) +
  geom_bar(mapping = aes(x = Neoplasm_histologic_grade, fill = Chemotherapy))
## Warning: Removed 72 rows containing non-finite values (stat_count).

Extension

ggplot gives a warning about non-finite values, i.e. the missing values. Find out how to get rid of this warning.

ggplot(data = metabric) +
  geom_bar(mapping = aes(x = Neoplasm_histologic_grade, fill = Chemotherapy), na.rm = TRUE)

Let’s say we are interested in displaying the missing values as a separate category and recreate the bar plot with an additional column.

metabric$Neoplasm_histologic_grade[is.na(metabric$Neoplasm_histologic_grade)] <- "ungraded"
#
# alternatively convert Neoplasm_histologic_grade to a factor
# metabric$Neoplasm_histologic_grade <- as.factor(metabric$Neoplasm_histologic_grade)
#
ggplot(data = metabric) +
  geom_bar(mapping = aes(x = Neoplasm_histologic_grade, fill = Chemotherapy))


2. Create a density plot of the estrogen receptor (ESR1) expression level

ggplot(data = metabric) +
  geom_density(mapping = aes(x = ESR1))

Hint: this is a new plot type – look up the geom you need by looking at the list of functions for the ggplot2 package in the Packages tab in RStudio.

Create separate but overlaying density plots for ESR1 expression for HER2 positive and negative patients. Try to ensure that both density plots are visible.

ggplot(data = metabric) +
  geom_density(mapping = aes(x = ESR1, fill = HER2_status), alpha = 0.5)


3. Create box plots for the survival time of patients with different tumour stages only including deceased patients

deceased <- metabric[metabric$Survival_status == "DECEASED", ]

# a better alternative would be to subset based on the Vital_status selecting
# only those patients that died of the disease rather than an unrelated cause
deceased <- metabric[metabric$Vital_status == "Died of Disease", ]

ggplot(data = deceased) +
  geom_boxplot(mapping = aes(x = Tumour_stage, y = Survival_time, group = Tumour_stage))
## Warning: Removed 154 rows containing missing values (stat_boxplot).

# alternatively convert Tumour_stage to a factor
deceased$Tumour_stage <- as.factor(deceased$Tumour_stage)

ggplot(data = deceased) +
  geom_boxplot(mapping = aes(x = Tumour_stage, y = Survival_time))
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

Hint: if you get a warning message and only have a single box displayed consider the type of the tumour stage variable as read into R and what kind of variable is expected for the x aesthetic.

Convert this into a violin plot and display ER negative and ER positive patients separately.

ggplot(data = deceased) +
  geom_violin(mapping = aes(x = Tumour_stage, y = Survival_time, fill = ER_status))
## Warning: Removed 1 rows containing non-finite values (stat_ydensity).

Hint: look at the help for geom_boxplot.


4. Explore the correlation between FOXA1 and one other gene

A recently-published paper looks at associations of expression and genomic data in the METABRIC data set. It found a strong correlation between FOXA1 and one of the other genes included in our table. Can you find which one?

result <- cor.test(metabric$FOXA1, metabric$MLPH)
result
## 
##  Pearson's product-moment correlation
## 
## data:  metabric$FOXA1 and metabric$MLPH
## t = 89.263, df = 1902, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8894737 0.9068158
## sample estimates:
##       cor 
## 0.8984947

Hint: the stats package contains a correlation test.

Create a scatter plot of the expression of FOXA1 and this gene with a line of best fit.

ggplot(data = metabric, mapping = aes(x = FOXA1, y = MLPH)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Change the method used to draw the line of best fit to be a straight line.

ggplot(data = metabric, mapping = aes(x = FOXA1, y = MLPH)) +
  geom_point() +
  geom_smooth(method = "lm")

Extension

Fit a linear model to these data (again you’ll need to look up the appropriate function in R).

result <- lm(formula = MLPH ~ FOXA1, data = metabric)
result
## 
## Call:
## lm(formula = MLPH ~ FOXA1, data = metabric)
## 
## Coefficients:
## (Intercept)        FOXA1  
##      2.0273       0.8643

Pass the resulting object to the summary() function for information about the fit and capture the result in another object called summary. Print this out.

summary <- summary(result)
summary
## 
## Call:
## lm(formula = MLPH ~ FOXA1, data = metabric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2482 -0.4374  0.0143  0.4572  3.1692 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.027283   0.105949   19.13   <2e-16 ***
## FOXA1       0.864319   0.009683   89.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.741 on 1902 degrees of freedom
## Multiple R-squared:  0.8073, Adjusted R-squared:  0.8072 
## F-statistic:  7968 on 1 and 1902 DF,  p-value: < 2.2e-16

Extract the R-squared value from the summary and assign this to another variable and print it out.

r2 <- summary$r.squared
r2
## [1] 0.8072928

Hint: what type of object is the summary?


5. Explore the Apple Mobility data on transport use during the coronavirus lockdown using line plots

On the weekend the Government included in their daily press conference a new set of data of transport use as measured by direction requests on Apple Maps.

Search for and download the UK Government data for the Sunday briefing (26 April 2020).

Hint: use the search terms “uk government coronavirus press conference data”.

Load the ‘Apple Mobility’ sheet into a tibble in R. You will need to specify the worksheet and a few other arguments as the data are not in the most convenient format (see how the transport use worksheet was read into R in the course materials).

library(readxl)
apple_mobility <- read_excel("2020-04-26_COVID-19_Press_Conference_Data.xlsx",
                             sheet = "Apple Mobility",
                             col_names = c("date", "transport_type", "percentage", "rolling_average"),
                             skip = 4,
                             n_max = 309)
apple_mobility
## # A tibble: 309 x 4
##    date                transport_type   percentage rolling_average
##    <dttm>              <chr>                 <dbl>           <dbl>
##  1 2020-01-13 00:00:00 Driving                100               NA
##  2 2020-01-13 00:00:00 Public Transport       100               NA
##  3 2020-01-13 00:00:00 Walking                100               NA
##  4 2020-01-14 00:00:00 Driving                105.              NA
##  5 2020-01-14 00:00:00 Public Transport       104.              NA
##  6 2020-01-14 00:00:00 Walking                106.              NA
##  7 2020-01-15 00:00:00 Driving                106.              NA
##  8 2020-01-15 00:00:00 Public Transport       105.              NA
##  9 2020-01-15 00:00:00 Walking                114.              NA
## 10 2020-01-16 00:00:00 Driving                104.              NA
## # … with 299 more rows

Plot the 7-day rolling average against date as a line graph with separate lines for each mode of transport.

ggplot(data = apple_mobility) +
  geom_line(mapping = aes(x = date, y = rolling_average, colour = transport_type))
## Warning: Removed 18 rows containing missing values (geom_path).

What is the main (and obvious) conclusion to draw from this plot?

Now plot the daily percentage instead of the 7-day rolling average.

ggplot(data = apple_mobility) +
  geom_line(mapping = aes(x = date, y = percentage, colour = transport_type))

What do you notice about daily fluctuations?

Extension

Create a new column for the weekday (taking values Sunday, Monday, Tuesday, etc.) in the Apple Mobility table and plot just the daily percentage for walking but now as separate lines for each day of the week.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
apple_mobility$day <- wday(apple_mobility$date, label = TRUE)

walking <- apple_mobility[apple_mobility$transport_type == "Walking", ]

ggplot(data = walking) +
  geom_line(mapping = aes(x = date, y = percentage, colour = day))

Hint: use a function from the lubridate package, another of the tidyverse packages (install if necessary).