The week’s assignment will test your ability to manipulate the METABRIC data set by changing the values of existing columns or adding new columns by computing new variables from existing ones.
We are expecting you to use the 5 main dplyr ‘verb’ functions: select()
, filter()
, arrange()
, mutate()
and summarize()
. Please use the pipe operator, %>%
, in cases where more than one operation is required to achieve the outcome requested.
library(tidyverse)
metabric <- read_csv("metabric_clinical_and_expression_data.csv")
1. Investigate the subset of long-surviving breast cancer patients that didn’t receive chemo or radiotherapy
First obtain the subset of patients that received neither chemotherapy or radiotherapy and survived for more than 20 years.
patients_of_interest <- filter(metabric, Chemotherapy == "NO", Radiotherapy == "NO", Survival_time / 12 > 20)
# alternative using separate steps
patients_of_interest <- metabric %>%
filter(Chemotherapy == "NO") %>%
filter(Radiotherapy == "NO") %>%
mutate(Survival_time_years = Survival_time / 12) %>%
filter(Survival_time_years > 20)
#
patients_of_interest
## # A tibble: 70 x 33
## Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 MB-0270 1 30.0 337. LIVING
## 2 MB-2517 2 60.3 268. DECEASED
## 3 MB-2610 2 59.0 281. LIVING
## 4 MB-2763 2 70.3 276. LIVING
## 5 MB-2770 2 60.3 274. LIVING
## 6 MB-2793 2 67.6 267. LIVING
## 7 MB-2797 2 50.1 253. LIVING
## 8 MB-2819 2 62.0 271. LIVING
## 9 MB-2838 2 67.1 270. LIVING
## 10 MB-2840 2 60.5 269 LIVING
## # … with 60 more rows, and 28 more variables: Vital_status <chr>,
## # Chemotherapy <chr>, Radiotherapy <chr>, Tumour_size <dbl>,
## # Tumour_stage <dbl>, Neoplasm_histologic_grade <dbl>,
## # Lymph_nodes_examined_positive <dbl>, Lymph_node_status <dbl>,
## # Cancer_type <chr>, ER_status <chr>, PR_status <chr>,
## # HER2_status <chr>, HER2_status_measured_by_SNP6 <chr>, PAM50 <chr>,
## # `3-gene_classifier` <chr>, Nottingham_prognostic_index <dbl>,
## # Cellularity <chr>, Integrative_cluster <chr>, Mutation_count <dbl>,
## # ESR1 <dbl>, ERBB2 <dbl>, PGR <dbl>, TP53 <dbl>, PIK3CA <dbl>,
## # GATA3 <dbl>, FOXA1 <dbl>, MLPH <dbl>, Survival_time_years <dbl>
Now look at the breakdown of these patients in terms of ER status. Count the numbers of ER positive and ER negative patients in this subset. Calculate the proportion that are ER positive.
table(patients_of_interest$ER_status)
##
## Negative Positive
## 8 62
mean(patients_of_interest$ER_status == "Positive")
## [1] 0.8857143
What does this tell us? Calculate the proportion of ER positive patients in the whole cohort by way of a comparison.
table(metabric$ER_status)
##
## Negative Positive
## 445 1459
mean(metabric$ER_status == "Positive")
## [1] 0.7662815
Extension
Create a contingency table of the numbers of ER positive and ER negative patients in the two groups (untreated, long-surviving patients and all others) and perform a Chi-squared test or a Fisher’s exact test (whichever is most appropriate) to determine if there is a significant difference.
# hard-coded values - almost certainly a better way of doing this
contingency_table <- matrix(c(8, 445 - 8, 62, 1459 - 62), ncol = 2)
contingency_table
## [,1] [,2]
## [1,] 8 62
## [2,] 437 1397
fisher.test(contingency_table)
##
## Fisher's Exact Test for Count Data
##
## data: contingency_table
## p-value = 0.01413
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.1692691 0.8736367
## sample estimates:
## odds ratio
## 0.4126121
#
# alternative using table() having created a categorical variable for our untreated, long-survival status
metabric <- mutate(metabric, untreated_long_survival = Chemotherapy == "NO" & Radiotherapy == "NO" & Survival_time / 12 > 20)
contingency_table <- table(metabric$untreated_long_survival, metabric$ER_status)
fisher.test(contingency_table)
##
## Fisher's Exact Test for Count Data
##
## data: contingency_table
## p-value = 0.01413
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.144641 5.907751
## sample estimates:
## odds ratio
## 2.423584
#
# or can call fisher.test() using the categorical variables directly
fisher.test(metabric$untreated_long_survival, metabric$ER_status)
##
## Fisher's Exact Test for Count Data
##
## data: metabric$untreated_long_survival and metabric$ER_status
## p-value = 0.01413
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.144641 5.907751
## sample estimates:
## odds ratio
## 2.423584
Hint: look at the help page for the function for the Chi-squared or Fisher’s exact test to see what form the contingency table needs to take
2. Which patients have spent the largest proportion of their lives dealing with breast cancer?
metabric %>%
mutate(Survival_time_years = Survival_time / 12) %>%
mutate(Age = Age_at_diagnosis + Survival_time_years) %>%
mutate(Proportion_of_life_since_diagnosis = Survival_time_years / Age) %>%
select(Patient_ID, Age, Age_at_diagnosis, Survival_time_years, Proportion_of_life_since_diagnosis) %>%
arrange(desc(Proportion_of_life_since_diagnosis)) %>%
mutate_if(is.numeric, round, digits = 2)
## # A tibble: 1,904 x 5
## Patient_ID Age Age_at_diagnosis Survival_time_ye… Proportion_of_life_…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 MB-0270 58.1 30.0 28.1 0.48
## 2 MB-4332 60.0 34.4 25.7 0.43
## 3 MB-6062 55.1 31.6 23.5 0.43
## 4 MB-2753 53.8 31.0 22.9 0.42
## 5 MB-2957 52.9 31.0 21.8 0.41
## 6 MB-2823 58.5 36.0 22.5 0.38
## 7 MB-2779 59.7 36.8 22.9 0.38
## 8 MB-2853 58.7 36.2 22.5 0.38
## 9 MB-4881 59.9 37.0 22.8 0.38
## 10 MB-4375 64.3 40.0 24.3 0.38
## # … with 1,894 more rows
3. Convert the expression values for each of the genes into standardized z-scores
Some genes are generally expressed at higher levels than others. This can make comparisons of changes between groups for a set of genes somewhat difficult, particularly if the expression for those genes are on very different scales. The expression values in our METABRIC are on a log2 scale which helps to reduce the range of values but another method for representing expression measurements is to standardize these to produce z-scores.
Standardization of a set of measurements involves subtracting the mean from each and dividing by the standard deviation. This will produce values with a mean of 0 and a standard deviation of 1.
Create a modified version of the metabric
data frame containing a new column with the standardized expression values (z-scores) for the ESR1 gene.
metabric_z_scores <- mutate(metabric, ESR1_z_score = (ESR1 - mean(ESR1)) / sd(ESR1))
Check that you’ve done this correctly by calculating the mean and standard deviation of your new z-score variable.
summarise(metabric_z_scores, mean = mean(ESR1_z_score), sd = sd(ESR1_z_score))
## # A tibble: 1 x 2
## mean sd
## <dbl> <dbl>
## 1 -2.31e-16 1.000
Add another column to your modified metabric
data frame containing a z-score for GATA3 and then create a scatter plot of the z-scores of GATA3 against ESR1. Modify your plot to facet by the PAM50 classification.
metabric_z_scores <- mutate(metabric_z_scores, GATA3_z_score = (GATA3 - mean(GATA3)) / sd(GATA3))
ggplot(data = metabric_z_scores) +
geom_point(mapping = aes(x = GATA3_z_score, y = ESR1_z_score)) +
facet_wrap(vars(PAM50))
Extension 1
Standardize the expression values for all genes in a single operation using an anonymous function, overwriting their original values, and round the resulting values to 3 significant figures.
metabric_z_scores <- metabric %>%
mutate_at(vars(ESR1:MLPH), ~ (. - mean(.)) / sd(.)) %>%
mutate_at(vars(ESR1:MLPH), signif, digits = 3)
metabric_z_scores %>%
select(Patient_ID, ESR1:MLPH)
## # A tibble: 1,904 x 9
## Patient_ID ESR1 ERBB2 PGR TP53 PIK3CA GATA3 FOXA1 MLPH
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MB-0000 -0.318 -1.05 -0.545 0.35 -0.754 -1.71 -1.62 -0.967
## 2 MB-0002 0.206 -0.763 1.24 -0.0136 -0.602 1.16 0.595 0.696
## 3 MB-0005 0.203 -0.766 1.12 0.514 2.22 -0.142 0.512 -0.626
## 4 MB-0006 0.373 -0.317 0.567 1.67 3.54 -0.556 0.606 -0.528
## 5 MB-0008 0.782 -0.596 1.07 0.348 -0.432 0.144 0.47 0.474
## 6 MB-0010 0.765 -0.755 -0.277 -1.94 0.434 0.189 0.765 0.0419
## 7 MB-0014 0.556 -1.1 1.45 -0.511 4.29 -0.757 0.389 -0.36
## 8 MB-0022 0.39 -1.59 -0.632 -0.081 4.6 -1.08 -0.069 -0.84
## 9 MB-0028 1.37 -0.0642 -0.893 0.0558 0.796 0.504 0.768 -0.253
## 10 MB-0035 -0.971 0.552 -0.636 0.531 0.0515 0.472 1.14 1.25
## # … with 1,894 more rows
Check you have done this correctly by computing the mean and standard deviation for each column, again using a single operation.
summarize_at(metabric_z_scores, vars(ESR1:PGR), list(mean = mean, sd = sd))
## # A tibble: 1 x 6
## ESR1_mean ERBB2_mean PGR_mean ESR1_sd ERBB2_sd PGR_sd
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.00000218 0.0000470 0.00000362 1.00 1.00 1.000
To be doubly certain, you could also download the expression z-score values for these genes from cBioPortal.
Extension 2
Create a plot comparing the distribution of standardized expression values for TP53 against a normal distribution.
ggplot(data = metabric_z_scores) +
geom_density(mapping = aes(x = TP53)) +
stat_function(fun = dnorm, colour = "blue")
Hint: use stat_function()
4. Which Star Wars characters need to go on a diet?
Compute the body mass index (BMI) of characters in the starwars
tibble.
starwars <- starwars %>%
mutate(height = height / 100) %>%
mutate(BMI = mass / height ^ 2)
Filter for human characters that are overweight (BMI > 25) and display in decreasing order of BMI.
starwars %>%
filter(species == "Human") %>%
filter(BMI > 25) %>%
arrange(desc(BMI)) %>%
select(name, height, mass, BMI)
## # A tibble: 10 x 4
## name height mass BMI
## <chr> <dbl> <dbl> <dbl>
## 1 Owen Lars 1.78 120 37.9
## 2 Jek Tono Porkins 1.8 110 34.0
## 3 Darth Vader 2.02 136 33.3
## 4 Beru Whitesun lars 1.65 75 27.5
## 5 Wedge Antilles 1.7 77 26.6
## 6 Luke Skywalker 1.72 77 26.0
## 7 Palpatine 1.7 75 26.0
## 8 Lobot 1.75 79 25.8
## 9 Lando Calrissian 1.77 79 25.2
## 10 Biggs Darklighter 1.83 84 25.1