Assignment 5

This week’s assignment will involve summarizing values for groups of observations in the METABRIC data set and combining clinical and mutation data to perform exploratory analysis involving both.

The mutation data come from targeted sequencing of 2433 primary tumours from the METABRIC study using a panel of 173 of the most frequently mutated breast cancer genes.

Pereira et al., Nature Communications 7:11479, 2016

We’ll start by loading the two tables we’re going to need. The first is the clinical data that you should by now be quite familiar with. This table does not contain the additional mRNA expression columns we’ve used in previous assignments.

library(tidyverse)
metabric <- read_csv("metabric_clinical_data.csv")
# convert the Integrative_cluster variable into a categorical variable with
# levels in the correct order
metabric <- mutate(metabric, Integrative_cluster = factor(Integrative_cluster, levels = c("1", "2", "3", "4ER-", "4ER+", "5", "6", "7", "8", "9", "10")))

The second table contains the mutations that were detected in each patient tumour sample.

mutations <- read_csv("metabric_mutations.csv")

## Parsed with column specification:
## cols(
##   Patient_ID = col_character(),
##   Chromosome = col_character(),
##   Start_Position = col_double(),
##   End_Position = col_double(),
##   Strand = col_character(),
##   Reference_Allele = col_character(),
##   Tumor_Seq_Allele1 = col_character(),
##   Tumor_Seq_Allele2 = col_character(),
##   Variant_Type = col_character(),
##   Variant_Classification = col_character(),
##   Gene = col_character(),
##   Codon = col_double(),
##   HGVSc = col_character(),
##   HGVSp = col_character(),
##   HGVSp_Short = col_character()
## )

mutations

## # A tibble: 17,272 x 15
##    Patient_ID Chromosome Start_Position End_Position Strand
##    <chr>      <chr>               <dbl>        <dbl> <chr> 
##  1 MTS-T0058  17                7579344      7579345 +     
##  2 MTS-T0058  17                7579346      7579347 +     
##  3 MTS-T0058  6               168299111    168299111 +     
##  4 MTS-T0058  22               29999995     29999995 +     
##  5 MTS-T0059  2               198288682    198288682 +     
##  6 MTS-T0059  6                86195125     86195125 +     
##  7 MTS-T0059  7                55241717     55241717 +     
##  8 MTS-T0059  10                6556986      6556986 +     
##  9 MTS-T0059  11               62300529     62300529 +     
## 10 MTS-T0059  15               74912475     74912475 +     
## # … with 17,262 more rows, and 10 more variables: Reference_Allele <chr>,
## #   Tumor_Seq_Allele1 <chr>, Tumor_Seq_Allele2 <chr>, Variant_Type <chr>,
## #   Variant_Classification <chr>, Gene <chr>, Codon <dbl>, HGVSc <chr>,
## #   HGVSp <chr>, HGVSp_Short <chr>

The METABRIC cohort contains 2509 patients but not all patient tumour samples were sequenced and for some patient samples that were sequenced, no mutations were detected. For one of the exercises we need to know which patients were sequenced, so we’ll also read the case list of samples that were sequenced and add a column indicating this in the clinical data table.

cases_sequenced <- read_tsv("cases_nat_comm_2016.txt", col_names = "Patient_ID")

## Parsed with column specification:
## cols(
##   Patient_ID = col_character()
## )

metabric <- mutate(metabric, Sequenced = Patient_ID %in% cases_sequenced$Patient_ID)
count(metabric, Sequenced)

## # A tibble: 2 x 2
##   Sequenced     n
##   <lgl>     <int>
## 1 FALSE        76
## 2 TRUE       2433

The clinical data, gene expression values and mutation data were all downloaded from cBioPortal.

1. Compare the average survival time and other attributes between ER-negative and ER-positive patients

Compute the average survival time for the ER-negative and ER-positive groups. Note that such a comparison only makes sense for those patients that are deceased so apply the appropriate filter first. Add a column for the number of patients in each group.

metabric %>%
  filter(Survival_status == "DECEASED") %>%
  group_by(ER_status) %>%
  summarize(`Average survival time` = mean(Survival_time), N = n())

## # A tibble: 2 x 3
##   ER_status `Average survival time`     N
##   <chr>                       <dbl> <int>
## 1 Negative                     66.3   262
## 2 Positive                    111.    882

Compute the average tumour size, number of positive lymph nodes and Nottingham prognostic index within ER-negative and ER-positive patients.

metabric %>%
  group_by(ER_status) %>%
  summarize_at(vars(Tumour_size, Lymph_nodes_examined_positive, Nottingham_prognostic_index), mean, na.rm = TRUE)

## # A tibble: 3 x 4
##   ER_status Tumour_size Lymph_nodes_examined_posi… Nottingham_prognostic_i…
##   <chr>           <dbl>                      <dbl>                    <dbl>
## 1 Negative         28.3                       2.62                     4.60
## 2 Positive         25.4                       1.73                     3.86
## 3 <NA>             35                         1.70                     2.26

Compute the proportion of tumour samples that are ER positive within each of PAM50 groups.

metabric %>%
  group_by(PAM50) %>%
  summarize(Samples = n(), ER_positive_count = sum(ER_status == "Positive"), ER_positive_proportion = mean(ER_status == "Positive"))

## # A tibble: 8 x 4
##   PAM50       Samples ER_positive_count ER_positive_proportion
##   <chr>         <int>             <int>                  <dbl>
## 1 Basal           209                29                  0.139
## 2 claudin-low     218                80                  0.367
## 3 Her2            224                95                  0.424
## 4 LumA            700               696                  0.994
## 5 LumB            475               475                  1    
## 6 NC                6                 6                  1    
## 7 Normal          148               125                  0.845
## 8 <NA>            529                NA                 NA

2. Find the top 10 most frequently mutated genes

Count the numbers of mutations for each gene and display the top 10 most frequently mutated genes.

mutations %>%
  count(Gene) %>%
  arrange(desc(n)) %>%
  head(10)

## # A tibble: 10 x 2
##    Gene       n
##    <chr>  <int>
##  1 PIK3CA  1122
##  2 TP53     897
##  3 AHNAK2   859
##  4 MUC16    666
##  5 SYNE1    468
##  6 KMT2C    389
##  7 MAP3K1   372
##  8 AHNAK    327
##  9 GATA3    301
## 10 DNAH11   300

Some genes contain multiple mutations within a single tumour sample so we will be over-counting the numbers of samples containing a mutated gene. Modify your calculation to prevent this. The count should be the number of patient samples in which the gene is mutated.

mutations %>%
  distinct(Patient_ID, Gene) %>%
  count(Gene) %>%
  arrange(desc(n)) %>%
  head(10)

## # A tibble: 10 x 2
##    Gene       n
##    <chr>  <int>
##  1 PIK3CA   980
##  2 TP53     869
##  3 AHNAK2   530
##  4 MUC16    527
##  5 SYNE1    397
##  6 KMT2C    338
##  7 GATA3    288
##  8 DNAH11   273
##  9 AHNAK    272
## 10 MAP3K1   264

Hint: use the distinct() function

We’re usually only interested in non-silent mutations, i.e. those that may have some functional effect. One of the findings of the METABRIC study is that 7 genes harbour coding mutations in at least 10% of the samples. A total of 2433 patient samples were sequenced.

Calculate the percentages of samples containing non-silent mutations for each gene and filter those genes mutated in more than 10% of samples. Compare your gene list and percentages with those given in the METABRIC 2016 publication (second paragraph in the Results section).

number_of_cases_sequenced <- nrow(cases_sequenced)
mutations %>%
  filter(Variant_Classification != "Silent") %>%
  distinct(Patient_ID, Gene) %>%
  count(Gene) %>%
  mutate(percentage = 100.0 * n / number_of_cases_sequenced) %>%
  filter(percentage >= 10) %>%
  arrange(desc(n))

## # A tibble: 7 x 3
##   Gene       n percentage
##   <chr>  <int>      <dbl>
## 1 PIK3CA   975       40.1
## 2 TP53     864       35.5
## 3 MUC16    409       16.8
## 4 AHNAK2   395       16.2
## 5 SYNE1    294       12.1
## 6 KMT2C    280       11.5
## 7 GATA3    269       11.1

3. Find mutation hotspots in PIK3CA

PIK3CA and TP53 are the most frequently mutated genes in breast cancers. But do the mutations occur randomly within these genes or are they usually found in specific locations?

Find the most frequently mutated codons in PIK3CA.

mutations %>%
  filter(Gene == "PIK3CA") %>%
  count(Codon) %>%
  arrange(desc(n))

## # A tibble: 88 x 2
##    Codon     n
##    <dbl> <int>
##  1  1047   471
##  2   545   192
##  3   542   101
##  4   345    64
##  5   726    28
##  6   546    26
##  7   420    19
##  8   111    13
##  9  1043    12
## 10   108    10
## # … with 78 more rows

These frequently mutated loci are known as “hotspots”.

What type of mutations are found at the most prevalent hotspot in PIK3CA, codon 1047? What changes are occurring at the DNA and protein level?

mutations %>%
  filter(Gene == "PIK3CA", Codon == 1047) %>%
  count(Variant_Classification, HGVSc, HGVSp, HGVSp_Short)

## # A tibble: 4 x 5
##   Variant_Classificati… HGVSc                  HGVSp      HGVSp_Short     n
##   <chr>                 <chr>                  <chr>      <chr>       <int>
## 1 Missense_Mutation     ENST00000263967.3:c.3… p.His1047… p.H1047Y        4
## 2 Missense_Mutation     ENST00000263967.3:c.3… p.His1047… p.H1047R      424
## 3 Missense_Mutation     ENST00000263967.3:c.3… p.His1047… p.H1047L       42
## 4 Missense_Mutation     ENST00000263967.3:c.3… p.His1047… p.H1047Q        1

Hint: take a look at what columns are available in the mutations table and count the values

4. Compare the prevalance of TP53 mutations across the Integrative Cluster breast cancer subtypes

Figure 5a from the METABRIC 2016 paper compares the percentage of samples that have non-silent mutations for various genes within each of the Integrative Clusters. In this exercise, we’ll reproduce a version of one of these bar charts for the TP53 gene.

First, we’ll need to distinguish between those patients that have a non-silent TP53 mutation and those that don’t. Create a data frame containing the number of non-silent TP53 mutations in each patient sample. It should have two columns, Patient_ID and TP53_mutation_count.

tp53_mutation_counts <- mutations %>%
  filter(Gene == "TP53", Variant_Classification != "Silent") %>%
  count(Patient_ID, name = "TP53_mutation_count")

Now join this table to the metabric data frame so that the latter has the added TP53_mutation_count column.

metabric <- left_join(metabric, tp53_mutation_counts, by = "Patient_ID")

Create a new column called Has_TP53_mutation containing just TRUE or FALSE values, i.e. a logical variable.

metabric <- metabric %>%
  mutate(Has_TP53_mutation = !is.na(TP53_mutation_count))

Hint: use is.na() function

Compute the proportion of patient samples within each Integrative Cluster with a TP53 mutation. Exclude those samples that have not been sequenced and those that have not been classified into one of the Integrative Cluster subtypes.

tp53_percentages <- metabric %>%
  filter(Sequenced) %>%
  filter(!is.na(Integrative_cluster)) %>%
  group_by(Integrative_cluster) %>%
  summarize(Percentage_TP53_mutation = 100 * mean(Has_TP53_mutation))
tp53_percentages

## # A tibble: 11 x 2
##    Integrative_cluster Percentage_TP53_mutation
##    <fct>                                  <dbl>
##  1 1                                      28.8 
##  2 2                                      26.4 
##  3 3                                      11.3 
##  4 4ER-                                   60.8 
##  5 4ER+                                   22.5 
##  6 5                                      71.2 
##  7 6                                      42.9 
##  8 7                                      15.9 
##  9 8                                       4.84
## 10 9                                      47.9 
## 11 10                                     87.7

Finally, plot these percentages as a bar chart using geom_col(). Look at the help page for geom_bar and geom_col to see why we’d want to use the latter in this case.

ggplot(data = tp53_percentages) +
  geom_col(mapping = aes(x = Integrative_cluster, y = Percentage_TP53_mutation), show.legend = FALSE)

Customize your plot in the following ways:

add a title
change the axis labels so they don’t contain underscores
set the limits on the y axis so that the entire 0 - 100% range is displayed
set breaks on the y axis to be at 20% intervals
reduce the amount of space between the bottom of the bars and the tick marks for each integrative cluster
apply the Viridis colour scheme for the bars (scale_colour_viridis_d)
choose the black and white theme by appending + theme_bw() to the plot
remove the vertical grid lines by appending + theme(panel.grid.major.x = element_blank())

The final two touches make stylistic changes to the theme using customizations that we’ll cover next week.

ggplot(data = tp53_percentages) +
  geom_col(mapping = aes(x = Integrative_cluster, y = Percentage_TP53_mutation, fill = Integrative_cluster), show.legend = FALSE) +
  labs(
    title = "Prevalance of TP53 mutations across Integrative Clusters",
    x = "Integrative Cluster",
    y = "% samples"
  ) +
  scale_y_continuous(breaks = c(0, 20, 40, 60, 80), expand = expand_scale(mult = 0), limit = c(0, 100)) +
  scale_fill_viridis_d() +
  theme_bw() +
  theme(panel.grid.major.x = element_blank())

Compare the plot with the equivalent plot in figure 5a from the METABRIC mutation paper. Do the bars look in roughly the correct proportions? Note that only 9 of the integrative clusters are shown in figure 5a.

5. Compute the tumour suppressor gene (TSG) score for each gene

The tumour suppressor gene (TSG) score is the proportion of inactivating mutations for each gene out of the total number of mutations for that gene.

Inactivating mutations are nonsense SNVs, frameshift substitutions and variants that affected splice sites.

Compute the TSG score for each gene and find the top 10 tumour suppressor genes, i.e. those with the highest TSG scores.

inactivating_mutation_types <- c("Nonsense_Mutation", "Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Region", "Splice_Site")

tsg_scores <- mutations %>%
  mutate(Inactivating = Variant_Classification %in% inactivating_mutation_types) %>%
  group_by(Gene) %>%
  summarize(TSG_score = mean(Inactivating))

tsg_scores %>%
  arrange(desc(TSG_score)) %>%
  head(10)

## # A tibble: 10 x 2
##    Gene    TSG_score
##    <chr>       <dbl>
##  1 CDH1        0.815
##  2 GATA3       0.734
##  3 CBFB        0.655
##  4 MAP3K1      0.653
##  5 GPS2        0.649
##  6 ZFP36L1     0.606
##  7 PTEN        0.6  
##  8 RUNX1       0.571
##  9 TBX3        0.553
## 10 MAP2K4      0.547

Hint: use %in% to determine if a mutation is an inactivating mutation

Extension

Download Supplementary Data 3 and create a scatter plot comparing the TSG scores in the allSamples_TSGScore column with those you’ve calculated.

library(readxl)
supplementary_table_3 <- read_excel("ncomms11479-s3.xlsx", skip = 1)
supplementary_table_3 %>%
  select(HGNC_symbol, allSamples_TSGScore) %>%
  left_join(tsg_scores, by = c("HGNC_symbol" = "Gene")) %>%
  ggplot() +
  geom_point(mapping = aes(x = allSamples_TSGScore, TSG_score))