The week’s assignment will test your ability to manipulate the METABRIC data set by changing the values of existing columns or adding new columns by computing new variables from existing ones.

We are expecting you to use the 5 main dplyr ‘verb’ functions: select(), filter(), arrange(), mutate() and summarize(). Please use the pipe operator, %>%, in cases where more than one operation is required to achieve the outcome requested.

library(tidyverse)
metabric <- read_csv("metabric_clinical_and_expression_data.csv")

1. Investigate the subset of long-surviving breast cancer patients that didn’t receive chemo or radiotherapy

First obtain the subset of patients that received neither chemotherapy or radiotherapy and survived for more than 20 years.

patients_of_interest <- filter(metabric, Chemotherapy == "NO", Radiotherapy == "NO", Survival_time / 12 > 20)
# alternative using separate steps
patients_of_interest <- metabric %>%
  filter(Chemotherapy == "NO") %>%
  filter(Radiotherapy == "NO") %>%
  mutate(Survival_time_years = Survival_time / 12) %>%
  filter(Survival_time_years > 20)
#
patients_of_interest
## # A tibble: 70 x 33
##    Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status
##    <chr>       <dbl>            <dbl>         <dbl> <chr>          
##  1 MB-0270         1             30.0          337. LIVING         
##  2 MB-2517         2             60.3          268. DECEASED       
##  3 MB-2610         2             59.0          281. LIVING         
##  4 MB-2763         2             70.3          276. LIVING         
##  5 MB-2770         2             60.3          274. LIVING         
##  6 MB-2793         2             67.6          267. LIVING         
##  7 MB-2797         2             50.1          253. LIVING         
##  8 MB-2819         2             62.0          271. LIVING         
##  9 MB-2838         2             67.1          270. LIVING         
## 10 MB-2840         2             60.5          269  LIVING         
## # … with 60 more rows, and 28 more variables: Vital_status <chr>,
## #   Chemotherapy <chr>, Radiotherapy <chr>, Tumour_size <dbl>,
## #   Tumour_stage <dbl>, Neoplasm_histologic_grade <dbl>,
## #   Lymph_nodes_examined_positive <dbl>, Lymph_node_status <dbl>,
## #   Cancer_type <chr>, ER_status <chr>, PR_status <chr>,
## #   HER2_status <chr>, HER2_status_measured_by_SNP6 <chr>, PAM50 <chr>,
## #   `3-gene_classifier` <chr>, Nottingham_prognostic_index <dbl>,
## #   Cellularity <chr>, Integrative_cluster <chr>, Mutation_count <dbl>,
## #   ESR1 <dbl>, ERBB2 <dbl>, PGR <dbl>, TP53 <dbl>, PIK3CA <dbl>,
## #   GATA3 <dbl>, FOXA1 <dbl>, MLPH <dbl>, Survival_time_years <dbl>

Now look at the breakdown of these patients in terms of ER status. Count the numbers of ER positive and ER negative patients in this subset. Calculate the proportion that are ER positive.

table(patients_of_interest$ER_status)
## 
## Negative Positive 
##        8       62
mean(patients_of_interest$ER_status == "Positive")
## [1] 0.8857143

What does this tell us? Calculate the proportion of ER positive patients in the whole cohort by way of a comparison.

table(metabric$ER_status)
## 
## Negative Positive 
##      445     1459
mean(metabric$ER_status == "Positive")
## [1] 0.7662815

Extension

Create a contingency table of the numbers of ER positive and ER negative patients in the two groups (untreated, long-surviving patients and all others) and perform a Chi-squared test or a Fisher’s exact test (whichever is most appropriate) to determine if there is a significant difference.

# hard-coded values - almost certainly a better way of doing this
contingency_table <- matrix(c(8, 445 - 8, 62, 1459 - 62), ncol = 2)
contingency_table
##      [,1] [,2]
## [1,]    8   62
## [2,]  437 1397
fisher.test(contingency_table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.01413
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.1692691 0.8736367
## sample estimates:
## odds ratio 
##  0.4126121
#
# alternative using table() having created a categorical variable for our untreated, long-survival status
metabric <- mutate(metabric, untreated_long_survival = Chemotherapy == "NO" & Radiotherapy == "NO" & Survival_time / 12 > 20)
contingency_table <- table(metabric$untreated_long_survival, metabric$ER_status)
fisher.test(contingency_table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.01413
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.144641 5.907751
## sample estimates:
## odds ratio 
##   2.423584
#
# or can call fisher.test() using the categorical variables directly
fisher.test(metabric$untreated_long_survival, metabric$ER_status)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  metabric$untreated_long_survival and metabric$ER_status
## p-value = 0.01413
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.144641 5.907751
## sample estimates:
## odds ratio 
##   2.423584

Hint: look at the help page for the function for the Chi-squared or Fisher’s exact test to see what form the contingency table needs to take


2. Which patients have spent the largest proportion of their lives dealing with breast cancer?

metabric %>%
  mutate(Survival_time_years = Survival_time / 12) %>%
  mutate(Age = Age_at_diagnosis + Survival_time_years) %>%
  mutate(Proportion_of_life_since_diagnosis = Survival_time_years / Age) %>%
  select(Patient_ID, Age, Age_at_diagnosis, Survival_time_years, Proportion_of_life_since_diagnosis) %>%
  arrange(desc(Proportion_of_life_since_diagnosis)) %>%
  mutate_if(is.numeric, round, digits = 2)
## # A tibble: 1,904 x 5
##    Patient_ID   Age Age_at_diagnosis Survival_time_ye… Proportion_of_life_…
##    <chr>      <dbl>            <dbl>             <dbl>                <dbl>
##  1 MB-0270     58.1             30.0              28.1                 0.48
##  2 MB-4332     60.0             34.4              25.7                 0.43
##  3 MB-6062     55.1             31.6              23.5                 0.43
##  4 MB-2753     53.8             31.0              22.9                 0.42
##  5 MB-2957     52.9             31.0              21.8                 0.41
##  6 MB-2823     58.5             36.0              22.5                 0.38
##  7 MB-2779     59.7             36.8              22.9                 0.38
##  8 MB-2853     58.7             36.2              22.5                 0.38
##  9 MB-4881     59.9             37.0              22.8                 0.38
## 10 MB-4375     64.3             40.0              24.3                 0.38
## # … with 1,894 more rows

3. Convert the expression values for each of the genes into standardized z-scores

Some genes are generally expressed at higher levels than others. This can make comparisons of changes between groups for a set of genes somewhat difficult, particularly if the expression for those genes are on very different scales. The expression values in our METABRIC are on a log2 scale which helps to reduce the range of values but another method for representing expression measurements is to standardize these to produce z-scores.

Standardization of a set of measurements involves subtracting the mean from each and dividing by the standard deviation. This will produce values with a mean of 0 and a standard deviation of 1.

Create a modified version of the metabric data frame containing a new column with the standardized expression values (z-scores) for the ESR1 gene.

metabric_z_scores <- mutate(metabric, ESR1_z_score = (ESR1 - mean(ESR1)) / sd(ESR1))

Check that you’ve done this correctly by calculating the mean and standard deviation of your new z-score variable.

summarise(metabric_z_scores, mean = mean(ESR1_z_score), sd = sd(ESR1_z_score))
## # A tibble: 1 x 2
##        mean    sd
##       <dbl> <dbl>
## 1 -2.31e-16 1.000

Add another column to your modified metabric data frame containing a z-score for GATA3 and then create a scatter plot of the z-scores of GATA3 against ESR1. Modify your plot to facet by the PAM50 classification.

metabric_z_scores <- mutate(metabric_z_scores, GATA3_z_score = (GATA3 - mean(GATA3)) / sd(GATA3))
ggplot(data = metabric_z_scores) +
  geom_point(mapping = aes(x = GATA3_z_score, y = ESR1_z_score)) +
  facet_wrap(vars(PAM50))

Extension 1

Standardize the expression values for all genes in a single operation using an anonymous function, overwriting their original values, and round the resulting values to 3 significant figures.

metabric_z_scores <- metabric %>%
  mutate_at(vars(ESR1:MLPH), ~ (. - mean(.)) / sd(.)) %>%
  mutate_at(vars(ESR1:MLPH), signif, digits = 3)
metabric_z_scores %>%
  select(Patient_ID, ESR1:MLPH)
## # A tibble: 1,904 x 9
##    Patient_ID   ESR1   ERBB2    PGR    TP53  PIK3CA  GATA3  FOXA1    MLPH
##    <chr>       <dbl>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
##  1 MB-0000    -0.318 -1.05   -0.545  0.35   -0.754  -1.71  -1.62  -0.967 
##  2 MB-0002     0.206 -0.763   1.24  -0.0136 -0.602   1.16   0.595  0.696 
##  3 MB-0005     0.203 -0.766   1.12   0.514   2.22   -0.142  0.512 -0.626 
##  4 MB-0006     0.373 -0.317   0.567  1.67    3.54   -0.556  0.606 -0.528 
##  5 MB-0008     0.782 -0.596   1.07   0.348  -0.432   0.144  0.47   0.474 
##  6 MB-0010     0.765 -0.755  -0.277 -1.94    0.434   0.189  0.765  0.0419
##  7 MB-0014     0.556 -1.1     1.45  -0.511   4.29   -0.757  0.389 -0.36  
##  8 MB-0022     0.39  -1.59   -0.632 -0.081   4.6    -1.08  -0.069 -0.84  
##  9 MB-0028     1.37  -0.0642 -0.893  0.0558  0.796   0.504  0.768 -0.253 
## 10 MB-0035    -0.971  0.552  -0.636  0.531   0.0515  0.472  1.14   1.25  
## # … with 1,894 more rows

Check you have done this correctly by computing the mean and standard deviation for each column, again using a single operation.

summarize_at(metabric_z_scores, vars(ESR1:PGR), list(mean = mean, sd = sd))
## # A tibble: 1 x 6
##    ESR1_mean ERBB2_mean   PGR_mean ESR1_sd ERBB2_sd PGR_sd
##        <dbl>      <dbl>      <dbl>   <dbl>    <dbl>  <dbl>
## 1 0.00000218  0.0000470 0.00000362    1.00     1.00  1.000

To be doubly certain, you could also download the expression z-score values for these genes from cBioPortal.

Extension 2

Create a plot comparing the distribution of standardized expression values for TP53 against a normal distribution.

ggplot(data = metabric_z_scores) +
  geom_density(mapping = aes(x = TP53)) +
  stat_function(fun = dnorm, colour = "blue")

Hint: use stat_function()


4. Which Star Wars characters need to go on a diet?

Compute the body mass index (BMI) of characters in the starwars tibble.

starwars <- starwars %>%
  mutate(height = height / 100) %>%
  mutate(BMI = mass / height ^ 2)

Filter for human characters that are overweight (BMI > 25) and display in decreasing order of BMI.

starwars %>%
  filter(species == "Human") %>%
  filter(BMI > 25) %>%
  arrange(desc(BMI)) %>%
  select(name, height, mass, BMI)
## # A tibble: 10 x 4
##    name               height  mass   BMI
##    <chr>               <dbl> <dbl> <dbl>
##  1 Owen Lars            1.78   120  37.9
##  2 Jek Tono Porkins     1.8    110  34.0
##  3 Darth Vader          2.02   136  33.3
##  4 Beru Whitesun lars   1.65    75  27.5
##  5 Wedge Antilles       1.7     77  26.6
##  6 Luke Skywalker       1.72    77  26.0
##  7 Palpatine            1.7     75  26.0
##  8 Lobot                1.75    79  25.8
##  9 Lando Calrissian     1.77    79  25.2
## 10 Biggs Darklighter    1.83    84  25.1