This week’s involves transforming two data sets in order to ask a specific question of each. The first is a more complete set of data on the number of new tuberculosis cases each year from 1980 to 2013 in 219 countries, taken from the World Health Organization Global Tuberculosis Report.

For the second “real world” example we return to the METABRIC data set and introduce the other set of assay data obtained as part of the original 2012 study that concerns genomic copy number. We’ll transform the copy number data downloaded from cBioPortal so that we can look at the copy number states for some of our favourite genes and then combine that with the expression data to see how copy number influences gene expression.

The clinical data, gene expression values and mutation data were all downloaded from cBioPortal.

library(tidyverse)

1. Tidy the WHO tuberculosis data set

In this week’s materials we demonstrated the restructuring of different representations of a very small subset of the WHO tuberculosis data set (tables 1 to 4a and b). The more complete data set is also available in its original format in another data frame available as part of the tidyr package, named who.

who
## # A tibble: 7,240 x 60
##    country iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534
##    <chr>   <chr> <chr> <int>       <int>        <int>        <int>
##  1 Afghan… AF    AFG    1980          NA           NA           NA
##  2 Afghan… AF    AFG    1981          NA           NA           NA
##  3 Afghan… AF    AFG    1982          NA           NA           NA
##  4 Afghan… AF    AFG    1983          NA           NA           NA
##  5 Afghan… AF    AFG    1984          NA           NA           NA
##  6 Afghan… AF    AFG    1985          NA           NA           NA
##  7 Afghan… AF    AFG    1986          NA           NA           NA
##  8 Afghan… AF    AFG    1987          NA           NA           NA
##  9 Afghan… AF    AFG    1988          NA           NA           NA
## 10 Afghan… AF    AFG    1989          NA           NA           NA
## # … with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
## #   new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
## #   new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
## #   new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## #   new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
## #   new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## #   new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
## #   new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
## #   new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
## #   new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
## #   new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
## #   new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
## #   new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
## #   new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,
## #   newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
## #   newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
## #   newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,
## #   newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
## #   newrel_f65 <int>

What are the variables in this data set?

Create a bulleted list with your answer here.

Hint: see the help page for who to understand the meaning of the column headings.

Hint: the Markdown Quick Reference available from the Help menu in RStudio is a very useful guide to formatting text in R markdown documents.

Why isn’t this data set tidy and what kinds of analysis are not straightforward with the data in this format?

Transform the data into a tidy format.

Columns new_sp_m014 to newrel_f65 are values for a compound variable, i.e. multiple variables joined together - the restructured data frame should have columns for each of the separated values in these column headings.

Also remove all rows with missing values for the number of tuberculosis cases.

# the hints below lead to the following solution
who_tidy <- who %>%
  pivot_longer(new_sp_m014:newrel_f65, names_to = "group", values_to = "cases") %>%
  mutate(group = str_remove(group, "new_")) %>%
  mutate(group = str_remove(group, "new")) %>%
  separate(group, into = c("diagnosis", "group"), sep = "_") %>%
  separate(group, into = c("gender", "age_group"), sep = 1) %>%
  filter(!is.na(cases))

# this is how it could be done removing missing values as part of the pivot and
# using a more sophisticated regular expression to remove the new_? prefix in a
# single step
who_tidy <- who %>%
  pivot_longer(starts_with("new"), names_to = "group", values_to = "cases", values_drop_na = TRUE) %>%
  mutate(group = str_remove(group, "^new_?")) %>%
  separate(group, into = c("diagnosis", "group"), sep = "_") %>%
  separate(group, into = c("gender", "age_group"), sep = 1)

# alternative using str_replace_all to get groups into suitable state for single
# separate operation
who_tidy <- who %>%
  pivot_longer(starts_with("new"), names_to = "group", values_to = "cases", values_drop_na = TRUE) %>%
  mutate(group = str_replace_all(group, c("newrel" = "rel", "new_" = "", "_m" = "_m_", "_f" = "_f_"))) %>%
  separate(group, into = c("diagnosis", "gender", "age_group"), sep = "_")

# alternative using names_pattern in pivot_longer (need to know regular
# expressions for this) avoiding need for separate operation
who_tidy <- who %>%
  pivot_longer(
    starts_with("new"),
    names_pattern = "new_?(.*)_(.)(.*)",
    names_to = c("diagnosis", "gender", "age_group"),
    values_to = "cases",
    values_drop_na = TRUE
  )

# convert the age group into a factor and change the levels to a more
# human-readable form
# (could also use factor or recode_factor for this)
who_tidy <- mutate(who_tidy, age_group = fct_recode(age_group, "0-14" = "014", "15-24" = "1524", "25-34" = "2534", "35-44" = "3544", "45-54" = "4554", "55-64" = "5564", "65+" = "65"))

who_tidy
## # A tibble: 76,046 x 8
##    country     iso2  iso3   year diagnosis gender age_group cases
##    <chr>       <chr> <chr> <int> <chr>     <chr>  <fct>     <int>
##  1 Afghanistan AF    AFG    1997 sp        m      0-14          0
##  2 Afghanistan AF    AFG    1997 sp        m      15-24        10
##  3 Afghanistan AF    AFG    1997 sp        m      25-34         6
##  4 Afghanistan AF    AFG    1997 sp        m      35-44         3
##  5 Afghanistan AF    AFG    1997 sp        m      45-54         5
##  6 Afghanistan AF    AFG    1997 sp        m      55-64         2
##  7 Afghanistan AF    AFG    1997 sp        m      65+           0
##  8 Afghanistan AF    AFG    1997 sp        f      0-14          5
##  9 Afghanistan AF    AFG    1997 sp        f      15-24        38
## 10 Afghanistan AF    AFG    1997 sp        f      25-34        36
## # … with 76,036 more rows

Hint 1: the column headers from new_sp_m014 onwards contain a superfluous prefix “new” or “new_” that you will want to remove; this can be done using one of the stringr functions in a single step if you know about regular expressions but is also achievable in two steps if you don’t use any regular expression magic.

Hint 2: separating two of the variables is a bit tricky because there is no separator character; try looking at the help for the separate function to see how this can be done.


2. Recreate table1 from your tidy version of the WHO tuberculosis data set

Recreate the first 3 columns of table1, containing a small subset of the WHO tuberculosis data set, from your tidy version of who.

select(table1, 1:3)
## # A tibble: 6 x 3
##   country      year  cases
##   <chr>       <int>  <int>
## 1 Afghanistan  1999    745
## 2 Afghanistan  2000   2666
## 3 Brazil       1999  37737
## 4 Brazil       2000  80488
## 5 China        1999 212258
## 6 China        2000 213766
who_tidy %>%
  filter(country %in% c("Afghanistan", "Brazil", "China")) %>%
  filter(year %in% c(1999, 2000)) %>%
  group_by(country, year) %>%
  summarise(cases = sum(cases))
## # A tibble: 6 x 3
## # Groups:   country [3]
##   country      year  cases
##   <chr>       <int>  <int>
## 1 Afghanistan  1999    745
## 2 Afghanistan  2000   2666
## 3 Brazil       1999  37737
## 4 Brazil       2000  80488
## 5 China        1999 212258
## 6 China        2000 213766

Hint: this will involve some filtering, grouping and summarization.


3. Create time series plots for cases of tuberculosis

Create a time series plot showing the number of tuberculosis cases in the United Kingdom with separate lines for each age group. The year should be on the x axis and the number of cases on the y axis.

who_tidy <- who_tidy %>%
  group_by(country, year, age_group) %>%
  summarise(cases = sum(cases))

who_tidy %>%
  filter(country == "United Kingdom of Great Britain and Northern Ireland") %>%
  ggplot() +
  geom_line(mapping = aes(x = year, y = cases, colour = age_group))

Hint: we’re not distinguishing between diagnoses and genders in this plot so you’ll need to sum the numbers of cases for each country, year and age group.

Use faceting in ggplot2 to create a series of plots showing the same information for France, Brazil, Japan and Uganda.

The numbers of cases in these countries are on quite different scales. Look at help page for facet_wrap() and try changing the scales argument so we can more easily compare the overall patterns between countries.

who_tidy %>%
  filter(country %in% c("France", "Brazil", "Japan", "Uganda")) %>%
  ggplot() +
  geom_line(mapping = aes(x = year, y = cases, colour = age_group)) +
  facet_wrap(vars(country), scales = "free_y")


4. Relate METABRIC copy number and expression data

For this exercise we’ll read in the METABRIC mRNA expression data and copy number states into separate data frames. The aim is to see how the gene expression is affected by the copy number state of various genes through a series of box plot.

expression <- read_tsv("metabric_mrna_expression.txt")

The expression data is in a matrix-like format with a row for each sample and a column for each gene. This is the format used when downloading data from cBioPortal after selecting a set of genes.

expression
## # A tibble: 2,509 x 10
##    STUDY_ID      SAMPLE_ID  ESR1 ERBB2   PGR  TP53 PIK3CA GATA3 FOXA1  MLPH
##    <chr>         <chr>     <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
##  1 brca_metabric MB-0000    8.93  9.33  5.68  6.34   5.70  6.93  7.95  9.73
##  2 brca_metabric MB-0002   10.0   9.73  7.51  6.19   5.76 11.3  11.8  12.5 
##  3 brca_metabric MB-0005   10.0   9.73  7.38  6.40   6.75  9.29 11.7  10.3 
##  4 brca_metabric MB-0006   10.4  10.3   6.82  6.87   7.22  8.67 11.9  10.5 
##  5 brca_metabric MB-0008   11.3   9.96  7.33  6.34   5.82  9.72 11.6  12.2 
##  6 brca_metabric MB-0010   11.2   9.74  5.95  5.42   6.12  9.79 12.1  11.4 
##  7 brca_metabric MB-0014   10.8   9.28  7.72  5.99   7.48  8.37 11.5  10.8 
##  8 brca_metabric MB-0020   NA    NA    NA    NA     NA    NA    NA    NA   
##  9 brca_metabric MB-0022   10.4   8.61  5.59  6.17   7.59  7.87 10.7   9.95
## 10 brca_metabric MB-0025   NA    NA    NA    NA     NA    NA    NA    NA   
## # … with 2,499 more rows

Convert this data frame into a tidy format with a row-per-sample-per-gene. Remove the rather uninteresting STUDY_ID column and filter out missing expression values (for those samples that weren’t assayed).

expression <- expression %>%
  pivot_longer(ESR1:MLPH, names_to = "Gene", values_to = "Expression", values_drop_na = TRUE) %>%
  select(-STUDY_ID)
expression
## # A tibble: 15,232 x 3
##    SAMPLE_ID Gene   Expression
##    <chr>     <chr>       <dbl>
##  1 MB-0000   ESR1         8.93
##  2 MB-0000   ERBB2        9.33
##  3 MB-0000   PGR          5.68
##  4 MB-0000   TP53         6.34
##  5 MB-0000   PIK3CA       5.70
##  6 MB-0000   GATA3        6.93
##  7 MB-0000   FOXA1        7.95
##  8 MB-0000   MLPH         9.73
##  9 MB-0002   ESR1        10.0 
## 10 MB-0002   ERBB2        9.73
## # … with 15,222 more rows

The copy number data are also in a matrix-like format but this time we have a row for each gene and a column for each sample. It is a very wide table containing 2175 columns (only the first 10 displayed below). This is the format of the copy number table you obtain if you download the complete METABRIC dataset from cBioPortal. The table read in is a subset containing rows for the 8 genes we’ve been using up till now.

copy_number_states <- read_tsv("metabric_cna.txt")
select(copy_number_states, 1:10)
## # A tibble: 8 x 10
##   Hugo_Symbol Entrez_Gene_Id `MB-0000` `MB-0039` `MB-0045` `MB-0046`
##   <chr>                <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 ERBB2                 2064         0         0         0         2
## 2 ESR1                  2099         0         0        -1         0
## 3 FOXA1                 3169         0         0         0         0
## 4 GATA3                 2625         0         0         0         0
## 5 MLPH                 79083         0         0         0         0
## 6 PGR                   5241         0        -1         0         0
## 7 PIK3CA                5290         0         0         2         0
## 8 TP53                  7157         0         0        -1        -1
## # … with 4 more variables: `MB-0048` <dbl>, `MB-0050` <dbl>,
## #   `MB-0053` <dbl>, `MB-0062` <dbl>

Convert this data frame into a tidy format with a row-per-sample-per-gene where the values are the copy number states:

Remove the Entrez_Gene_id column and convert the copy number state variable into a factor using the more human-readable names for each state.

# using a range of column names
# copy_number_states <- copy_number_states %>%
#   pivot_longer(`MB-0000`:`MB-7188`, names_to = "Sample", values_to = "Copy_number_state") %>%
#   select(-Entrez_Gene_Id) %>%
#   mutate(Copy_number_state = as_factor(Copy_number_state))

# using numeric indexes for columns instead
# copy_number_states <- copy_number_states %>%
#   pivot_longer(3:2175, names_to = "Sample", values_to = "Copy_number_state") %>%
#   select(-Entrez_Gene_Id) %>%
#   mutate(Copy_number_state = as_factor(Copy_number_state))

# can instead select those columns not to use
# we've also relabeled the copy number state levels (could instead use
# fct_recode or recode_factor)
copy_number_states <- copy_number_states %>%
  select(-Entrez_Gene_Id) %>%
  pivot_longer(-Hugo_Symbol, names_to = "Sample", values_to = "Copy_number_state") %>%
  mutate(Copy_number_state = factor(Copy_number_state, levels = c("-2", "-1", "0", "1", "2"), labels = c("deletion", "loss", "neutral", "gain", "amplification")))

copy_number_states
## # A tibble: 17,384 x 3
##    Hugo_Symbol Sample  Copy_number_state
##    <chr>       <chr>   <fct>            
##  1 ERBB2       MB-0000 neutral          
##  2 ERBB2       MB-0039 neutral          
##  3 ERBB2       MB-0045 neutral          
##  4 ERBB2       MB-0046 amplification    
##  5 ERBB2       MB-0048 amplification    
##  6 ERBB2       MB-0050 loss             
##  7 ERBB2       MB-0053 neutral          
##  8 ERBB2       MB-0062 loss             
##  9 ERBB2       MB-0064 neutral          
## 10 ERBB2       MB-0066 neutral          
## # … with 17,374 more rows

Hint: look at the help page for factor() to see how to set the labels for each copy number state or use fct_recode() from the forcats package.

Count the total number of occurrences of each copy number state.

count(copy_number_states, Copy_number_state)
## # A tibble: 5 x 2
##   Copy_number_state     n
##   <fct>             <int>
## 1 deletion             10
## 2 loss               3469
## 3 neutral           12077
## 4 gain               1156
## 5 amplification       672

Create a combined data set containing the expression value and copy number state for each patient and gene pairing. The data frame should contain the columns Sample, Gene, Expression and Copy_number_state.

combined <- expression %>%
  rename(Sample = SAMPLE_ID) %>%
  inner_join(copy_number_states, by = c("Sample", "Gene" = "Hugo_Symbol"))
combined
## # A tibble: 15,232 x 4
##    Sample  Gene   Expression Copy_number_state
##    <chr>   <chr>       <dbl> <fct>            
##  1 MB-0000 ESR1         8.93 neutral          
##  2 MB-0000 ERBB2        9.33 neutral          
##  3 MB-0000 PGR          5.68 neutral          
##  4 MB-0000 TP53         6.34 neutral          
##  5 MB-0000 PIK3CA       5.70 neutral          
##  6 MB-0000 GATA3        6.93 neutral          
##  7 MB-0000 FOXA1        7.95 neutral          
##  8 MB-0000 MLPH         9.73 neutral          
##  9 MB-0002 ESR1        10.0  loss             
## 10 MB-0002 ERBB2        9.73 loss             
## # … with 15,222 more rows

Create a series of box plots for each of the genes with a box-and-whiskers showing the range of expression values for each copy number state.

ggplot(data = combined, mapping = aes(x = Copy_number_state, y = Expression)) +
  geom_boxplot() +
  facet_wrap(vars(Gene), scales = "free_y")

Customize this plot by changing the labels, scales, colours and theme as you like – be creative!

Save the plot as a PDF using ggsave() and attach with the assignment when handing in.

There will be chocolate-based prizes awarded to any assignment handed in with a plot that is prettier than mine (bear in mind that I’m also the judge and jury – Matt :-)

ggplot(data = combined, mapping = aes(x = Copy_number_state, y = Expression, fill = Copy_number_state)) +
  geom_boxplot(outlier.size = 0.2) +
  facet_wrap(vars(Gene), ncol = 4, scales = "free_y") +
  labs(x = NULL) +
  scale_x_discrete(labels = c("deletion", "loss", "neutral", "gain", "amplification")) +
  scale_fill_brewer(palette = "RdPu") +
  labs(title = "Relationship between expression and copy number state", fill = NULL) +
  theme_minimal() +
  theme(
    panel.grid.major.x = element_blank(),
    plot.title = element_text(size = 10),
    axis.title.y = element_text(size = 8),
    axis.text.y = element_text(size = 7),
    axis.text.x = element_blank(),
    strip.text = element_text(size = 8, face = "italic"),
    legend.text = element_text(size = 7),
    legend.position = "bottom"
  )

ggsave("expression_vs_copy_number_state.pdf", width = 20, height = 12, units = "cm")

5. Extension: Mutual exclusivity of ERBB2 amplifications and AKT1 mutations

Seriously, stop here if you haven’t got the time or inclination to attempt this final, bonus extension exercise (but please still hand in your assignment).

In the METABRIC 2016 paper, Pereira et al. (Nature Communications 7:11479, 2016) considered the association between somatic mutations, exploring patterns of co-mutation and mutual exclusivity. One of their findings was the mutual exclusivity between AKT1 mutations and amplification of ERBB2. For the final exercise we will reproduce this observation.

We’ll first need to read the mutation data into R and we’re only interested in those cases where the patient tumour sample was sequenced so we’ll read those in as well.

mutations <- read_csv("metabric_mutations.csv")
cases_sequenced <- read_tsv("cases_nat_comm_2016.txt", col_names = "Patient_ID")

Filter the mutations table to only contain the non-silent mutations in AKT1.

akt1_mutations <- mutations %>%
  filter(Gene == "AKT1", Variant_Classification != "Silent")

Add a logical column to the cases_sequenced data frame called AKT1_mutated that indicates whether there was a non-silent AKT1 mutation detected in each patient tumour sample.

Count the number of patients with and without a non-silent AKT1 mutation.

cases_sequenced <- cases_sequenced %>%
  mutate(AKT1_mutated = Patient_ID %in% akt1_mutations$Patient_ID)
count(cases_sequenced, AKT1_mutated)
## # A tibble: 2 x 2
##   AKT1_mutated     n
##   <lgl>        <int>
## 1 FALSE         2333
## 2 TRUE           100

Hint: use a mutate() with %in%.

Filter the copy number states table to only include entries for ERBB2.

Count the number of samples for each ERBB2 copy number state.

erbb2_copy_number_states <- copy_number_states %>%
  filter(Hugo_Symbol == "ERBB2") %>%
  select(Sample, ERBB2_copy_number_state = Copy_number_state)
count(erbb2_copy_number_states, ERBB2_copy_number_state)
## # A tibble: 4 x 2
##   ERBB2_copy_number_state     n
##   <fct>                   <int>
## 1 loss                      378
## 2 neutral                  1270
## 3 gain                      183
## 4 amplification             342

Join the ERBB2 copy number state table to the table we created before identifying which patient has an AKT1 mutation. The combined data frame should only contain entries for patients that have been sequenced and for which copy number data are available.

erbb2_akt1_association <- inner_join(cases_sequenced, erbb2_copy_number_states, by = c("Patient_ID" = "Sample"))
erbb2_akt1_association
## # A tibble: 2,097 x 3
##    Patient_ID AKT1_mutated ERBB2_copy_number_state
##    <chr>      <lgl>        <fct>                  
##  1 MB-0000    FALSE        neutral                
##  2 MB-0002    FALSE        loss                   
##  3 MB-0005    FALSE        neutral                
##  4 MB-0006    FALSE        neutral                
##  5 MB-0008    FALSE        neutral                
##  6 MB-0010    FALSE        loss                   
##  7 MB-0014    FALSE        neutral                
##  8 MB-0022    FALSE        loss                   
##  9 MB-0025    FALSE        amplification          
## 10 MB-0028    FALSE        gain                   
## # … with 2,087 more rows

Count the numbers of patients with different combinations of ERBB2 copy number state and mutated AKT1.

count(erbb2_akt1_association, ERBB2_copy_number_state, AKT1_mutated)
## # A tibble: 8 x 3
##   ERBB2_copy_number_state AKT1_mutated     n
##   <fct>                   <lgl>        <int>
## 1 loss                    FALSE          355
## 2 loss                    TRUE            21
## 3 neutral                 FALSE         1155
## 4 neutral                 TRUE            62
## 5 gain                    FALSE          174
## 6 gain                    TRUE             2
## 7 amplification           FALSE          327
## 8 amplification           TRUE             1

Does this reflect the observation of mutual exclusivity of ERBB2 amplification and AKT1 mutations?

Create a logical column called ERBB2_amplified indicated which patient samples have amplifications (copy number state 2) in ERBB2.

erbb2_akt1_association <- mutate(erbb2_akt1_association, ERBB2_amplified = ERBB2_copy_number_state == "amplification")
erbb2_akt1_association
## # A tibble: 2,097 x 4
##    Patient_ID AKT1_mutated ERBB2_copy_number_state ERBB2_amplified
##    <chr>      <lgl>        <fct>                   <lgl>          
##  1 MB-0000    FALSE        neutral                 FALSE          
##  2 MB-0002    FALSE        loss                    FALSE          
##  3 MB-0005    FALSE        neutral                 FALSE          
##  4 MB-0006    FALSE        neutral                 FALSE          
##  5 MB-0008    FALSE        neutral                 FALSE          
##  6 MB-0010    FALSE        loss                    FALSE          
##  7 MB-0014    FALSE        neutral                 FALSE          
##  8 MB-0022    FALSE        loss                    FALSE          
##  9 MB-0025    FALSE        amplification           TRUE           
## 10 MB-0028    FALSE        gain                    FALSE          
## # … with 2,087 more rows

Create a 2-by-2 contingency table using table() and the AKT1_mutated and ERBB2_amplified columns.

contingency_table <- erbb2_akt1_association %>%
  select(ERBB2_amplified, AKT1_mutated) %>%
  table()
contingency_table
##                AKT1_mutated
## ERBB2_amplified FALSE TRUE
##           FALSE  1684   85
##           TRUE    327    1

Perform a Fisher exact test on this contingency table.

fisher.test(contingency_table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 1.128e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.001515843 0.349760112
## sample estimates:
## odds ratio 
## 0.06061971