This week’s assignment will involve summarizing values for groups of observations in the METABRIC data set and combining clinical and mutation data to perform exploratory analysis involving both.
The mutation data come from targeted sequencing of 2433 primary tumours from the METABRIC study using a panel of 173 of the most frequently mutated breast cancer genes.
Pereira et al., Nature Communications 7:11479, 2016
We’ll start by loading the two tables we’re going to need. The first is the clinical data that you should by now be quite familiar with. This table does not contain the additional mRNA expression columns we’ve used in previous assignments.
library(tidyverse)
metabric <- read_csv("metabric_clinical_data.csv")
# convert the Integrative_cluster variable into a categorical variable with
# levels in the correct order
metabric <- mutate(metabric, Integrative_cluster = factor(Integrative_cluster, levels = c("1", "2", "3", "4ER-", "4ER+", "5", "6", "7", "8", "9", "10")))
The second table contains the mutations that were detected in each patient tumour sample.
mutations <- read_csv("metabric_mutations.csv")
## Parsed with column specification:
## cols(
## Patient_ID = col_character(),
## Chromosome = col_character(),
## Start_Position = col_double(),
## End_Position = col_double(),
## Strand = col_character(),
## Reference_Allele = col_character(),
## Tumor_Seq_Allele1 = col_character(),
## Tumor_Seq_Allele2 = col_character(),
## Variant_Type = col_character(),
## Variant_Classification = col_character(),
## Gene = col_character(),
## Codon = col_double(),
## HGVSc = col_character(),
## HGVSp = col_character(),
## HGVSp_Short = col_character()
## )
mutations
## # A tibble: 17,272 x 15
## Patient_ID Chromosome Start_Position End_Position Strand
## <chr> <chr> <dbl> <dbl> <chr>
## 1 MTS-T0058 17 7579344 7579345 +
## 2 MTS-T0058 17 7579346 7579347 +
## 3 MTS-T0058 6 168299111 168299111 +
## 4 MTS-T0058 22 29999995 29999995 +
## 5 MTS-T0059 2 198288682 198288682 +
## 6 MTS-T0059 6 86195125 86195125 +
## 7 MTS-T0059 7 55241717 55241717 +
## 8 MTS-T0059 10 6556986 6556986 +
## 9 MTS-T0059 11 62300529 62300529 +
## 10 MTS-T0059 15 74912475 74912475 +
## # … with 17,262 more rows, and 10 more variables: Reference_Allele <chr>,
## # Tumor_Seq_Allele1 <chr>, Tumor_Seq_Allele2 <chr>, Variant_Type <chr>,
## # Variant_Classification <chr>, Gene <chr>, Codon <dbl>, HGVSc <chr>,
## # HGVSp <chr>, HGVSp_Short <chr>
The METABRIC cohort contains 2509 patients but not all patient tumour samples were sequenced and for some patient samples that were sequenced, no mutations were detected. For one of the exercises we need to know which patients were sequenced, so we’ll also read the case list of samples that were sequenced and add a column indicating this in the clinical data table.
cases_sequenced <- read_tsv("cases_nat_comm_2016.txt", col_names = "Patient_ID")
## Parsed with column specification:
## cols(
## Patient_ID = col_character()
## )
metabric <- mutate(metabric, Sequenced = Patient_ID %in% cases_sequenced$Patient_ID)
count(metabric, Sequenced)
## # A tibble: 2 x 2
## Sequenced n
## <lgl> <int>
## 1 FALSE 76
## 2 TRUE 2433
The clinical data, gene expression values and mutation data were all downloaded from cBioPortal.
1. Compare the average survival time and other attributes between ER-negative and ER-positive patients
Compute the average survival time for the ER-negative and ER-positive groups. Note that such a comparison only makes sense for those patients that are deceased so apply the appropriate filter first. Add a column for the number of patients in each group.
metabric %>%
filter(Survival_status == "DECEASED") %>%
group_by(ER_status) %>%
summarize(`Average survival time` = mean(Survival_time), N = n())
## # A tibble: 2 x 3
## ER_status `Average survival time` N
## <chr> <dbl> <int>
## 1 Negative 66.3 262
## 2 Positive 111. 882
Compute the average tumour size, number of positive lymph nodes and Nottingham prognostic index within ER-negative and ER-positive patients.
metabric %>%
group_by(ER_status) %>%
summarize_at(vars(Tumour_size, Lymph_nodes_examined_positive, Nottingham_prognostic_index), mean, na.rm = TRUE)
## # A tibble: 3 x 4
## ER_status Tumour_size Lymph_nodes_examined_posi… Nottingham_prognostic_i…
## <chr> <dbl> <dbl> <dbl>
## 1 Negative 28.3 2.62 4.60
## 2 Positive 25.4 1.73 3.86
## 3 <NA> 35 1.70 2.26
Compute the proportion of tumour samples that are ER positive within each of PAM50 groups.
metabric %>%
group_by(PAM50) %>%
summarize(Samples = n(), ER_positive_count = sum(ER_status == "Positive"), ER_positive_proportion = mean(ER_status == "Positive"))
## # A tibble: 8 x 4
## PAM50 Samples ER_positive_count ER_positive_proportion
## <chr> <int> <int> <dbl>
## 1 Basal 209 29 0.139
## 2 claudin-low 218 80 0.367
## 3 Her2 224 95 0.424
## 4 LumA 700 696 0.994
## 5 LumB 475 475 1
## 6 NC 6 6 1
## 7 Normal 148 125 0.845
## 8 <NA> 529 NA NA
2. Find the top 10 most frequently mutated genes
Count the numbers of mutations for each gene and display the top 10 most frequently mutated genes.
mutations %>%
count(Gene) %>%
arrange(desc(n)) %>%
head(10)
## # A tibble: 10 x 2
## Gene n
## <chr> <int>
## 1 PIK3CA 1122
## 2 TP53 897
## 3 AHNAK2 859
## 4 MUC16 666
## 5 SYNE1 468
## 6 KMT2C 389
## 7 MAP3K1 372
## 8 AHNAK 327
## 9 GATA3 301
## 10 DNAH11 300
Some genes contain multiple mutations within a single tumour sample so we will be over-counting the numbers of samples containing a mutated gene. Modify your calculation to prevent this. The count should be the number of patient samples in which the gene is mutated.
mutations %>%
distinct(Patient_ID, Gene) %>%
count(Gene) %>%
arrange(desc(n)) %>%
head(10)
## # A tibble: 10 x 2
## Gene n
## <chr> <int>
## 1 PIK3CA 980
## 2 TP53 869
## 3 AHNAK2 530
## 4 MUC16 527
## 5 SYNE1 397
## 6 KMT2C 338
## 7 GATA3 288
## 8 DNAH11 273
## 9 AHNAK 272
## 10 MAP3K1 264
Hint: use the distinct()
function
We’re usually only interested in non-silent mutations, i.e. those that may have some functional effect. One of the findings of the METABRIC study is that 7 genes harbour coding mutations in at least 10% of the samples. A total of 2433 patient samples were sequenced.
Calculate the percentages of samples containing non-silent mutations for each gene and filter those genes mutated in more than 10% of samples. Compare your gene list and percentages with those given in the METABRIC 2016 publication (second paragraph in the Results section).
number_of_cases_sequenced <- nrow(cases_sequenced)
mutations %>%
filter(Variant_Classification != "Silent") %>%
distinct(Patient_ID, Gene) %>%
count(Gene) %>%
mutate(percentage = 100.0 * n / number_of_cases_sequenced) %>%
filter(percentage >= 10) %>%
arrange(desc(n))
## # A tibble: 7 x 3
## Gene n percentage
## <chr> <int> <dbl>
## 1 PIK3CA 975 40.1
## 2 TP53 864 35.5
## 3 MUC16 409 16.8
## 4 AHNAK2 395 16.2
## 5 SYNE1 294 12.1
## 6 KMT2C 280 11.5
## 7 GATA3 269 11.1
3. Find mutation hotspots in PIK3CA
PIK3CA and TP53 are the most frequently mutated genes in breast cancers. But do the mutations occur randomly within these genes or are they usually found in specific locations?
Find the most frequently mutated codons in PIK3CA.
mutations %>%
filter(Gene == "PIK3CA") %>%
count(Codon) %>%
arrange(desc(n))
## # A tibble: 88 x 2
## Codon n
## <dbl> <int>
## 1 1047 471
## 2 545 192
## 3 542 101
## 4 345 64
## 5 726 28
## 6 546 26
## 7 420 19
## 8 111 13
## 9 1043 12
## 10 108 10
## # … with 78 more rows
These frequently mutated loci are known as “hotspots”.
What type of mutations are found at the most prevalent hotspot in PIK3CA, codon 1047? What changes are occurring at the DNA and protein level?
mutations %>%
filter(Gene == "PIK3CA", Codon == 1047) %>%
count(Variant_Classification, HGVSc, HGVSp, HGVSp_Short)
## # A tibble: 4 x 5
## Variant_Classificati… HGVSc HGVSp HGVSp_Short n
## <chr> <chr> <chr> <chr> <int>
## 1 Missense_Mutation ENST00000263967.3:c.3… p.His1047… p.H1047Y 4
## 2 Missense_Mutation ENST00000263967.3:c.3… p.His1047… p.H1047R 424
## 3 Missense_Mutation ENST00000263967.3:c.3… p.His1047… p.H1047L 42
## 4 Missense_Mutation ENST00000263967.3:c.3… p.His1047… p.H1047Q 1
Hint: take a look at what columns are available in the mutations
table and count the values
4. Compare the prevalance of TP53 mutations across the Integrative Cluster breast cancer subtypes
Figure 5a from the METABRIC 2016 paper compares the percentage of samples that have non-silent mutations for various genes within each of the Integrative Clusters. In this exercise, we’ll reproduce a version of one of these bar charts for the TP53 gene.
First, we’ll need to distinguish between those patients that have a non-silent TP53 mutation and those that don’t. Create a data frame containing the number of non-silent TP53 mutations in each patient sample. It should have two columns, Patient_ID
and TP53_mutation_count
.
tp53_mutation_counts <- mutations %>%
filter(Gene == "TP53", Variant_Classification != "Silent") %>%
count(Patient_ID, name = "TP53_mutation_count")
Now join this table to the metabric
data frame so that the latter has the added TP53_mutation_count
column.
metabric <- left_join(metabric, tp53_mutation_counts, by = "Patient_ID")
Create a new column called Has_TP53_mutation
containing just TRUE
or FALSE
values, i.e. a logical variable.
metabric <- metabric %>%
mutate(Has_TP53_mutation = !is.na(TP53_mutation_count))
Hint: use is.na()
function
Compute the proportion of patient samples within each Integrative Cluster with a TP53 mutation. Exclude those samples that have not been sequenced and those that have not been classified into one of the Integrative Cluster subtypes.
tp53_percentages <- metabric %>%
filter(Sequenced) %>%
filter(!is.na(Integrative_cluster)) %>%
group_by(Integrative_cluster) %>%
summarize(Percentage_TP53_mutation = 100 * mean(Has_TP53_mutation))
tp53_percentages
## # A tibble: 11 x 2
## Integrative_cluster Percentage_TP53_mutation
## <fct> <dbl>
## 1 1 28.8
## 2 2 26.4
## 3 3 11.3
## 4 4ER- 60.8
## 5 4ER+ 22.5
## 6 5 71.2
## 7 6 42.9
## 8 7 15.9
## 9 8 4.84
## 10 9 47.9
## 11 10 87.7
Finally, plot these percentages as a bar chart using geom_col()
. Look at the help page for geom_bar
and geom_col
to see why we’d want to use the latter in this case.
ggplot(data = tp53_percentages) +
geom_col(mapping = aes(x = Integrative_cluster, y = Percentage_TP53_mutation), show.legend = FALSE)
Customize your plot in the following ways:
scale_colour_viridis_d
)+ theme_bw()
to the plot+ theme(panel.grid.major.x = element_blank())
The final two touches make stylistic changes to the theme using customizations that we’ll cover next week.
ggplot(data = tp53_percentages) +
geom_col(mapping = aes(x = Integrative_cluster, y = Percentage_TP53_mutation, fill = Integrative_cluster), show.legend = FALSE) +
labs(
title = "Prevalance of TP53 mutations across Integrative Clusters",
x = "Integrative Cluster",
y = "% samples"
) +
scale_y_continuous(breaks = c(0, 20, 40, 60, 80), expand = expand_scale(mult = 0), limit = c(0, 100)) +
scale_fill_viridis_d() +
theme_bw() +
theme(panel.grid.major.x = element_blank())
Compare the plot with the equivalent plot in figure 5a from the METABRIC mutation paper. Do the bars look in roughly the correct proportions? Note that only 9 of the integrative clusters are shown in figure 5a.
5. Compute the tumour suppressor gene (TSG) score for each gene
The tumour suppressor gene (TSG) score is the proportion of inactivating mutations for each gene out of the total number of mutations for that gene.
Inactivating mutations are nonsense SNVs, frameshift substitutions and variants that affected splice sites.
Compute the TSG score for each gene and find the top 10 tumour suppressor genes, i.e. those with the highest TSG scores.
inactivating_mutation_types <- c("Nonsense_Mutation", "Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Region", "Splice_Site")
tsg_scores <- mutations %>%
mutate(Inactivating = Variant_Classification %in% inactivating_mutation_types) %>%
group_by(Gene) %>%
summarize(TSG_score = mean(Inactivating))
tsg_scores %>%
arrange(desc(TSG_score)) %>%
head(10)
## # A tibble: 10 x 2
## Gene TSG_score
## <chr> <dbl>
## 1 CDH1 0.815
## 2 GATA3 0.734
## 3 CBFB 0.655
## 4 MAP3K1 0.653
## 5 GPS2 0.649
## 6 ZFP36L1 0.606
## 7 PTEN 0.6
## 8 RUNX1 0.571
## 9 TBX3 0.553
## 10 MAP2K4 0.547
Hint: use %in%
to determine if a mutation is an inactivating mutation
Extension
Download Supplementary Data 3 and create a scatter plot comparing the TSG scores in the allSamples_TSGScore
column with those you’ve calculated.
library(readxl)
supplementary_table_3 <- read_excel("ncomms11479-s3.xlsx", skip = 1)
supplementary_table_3 %>%
select(HGNC_symbol, allSamples_TSGScore) %>%
left_join(tsg_scores, by = c("HGNC_symbol" = "Gene")) %>%
ggplot() +
geom_point(mapping = aes(x = allSamples_TSGScore, TSG_score))