This week’s involves transforming two data sets in order to ask a specific question of each. The first is a more complete set of data on the number of new tuberculosis cases each year from 1980 to 2013 in 219 countries, taken from the World Health Organization Global Tuberculosis Report.
For the second “real world” example we return to the METABRIC data set and introduce the other set of assay data obtained as part of the original 2012 study that concerns genomic copy number. We’ll transform the copy number data downloaded from cBioPortal so that we can look at the copy number states for some of our favourite genes and then combine that with the expression data to see how copy number influences gene expression.
The clinical data, gene expression values and mutation data were all downloaded from cBioPortal.
library(tidyverse)
1. Tidy the WHO tuberculosis data set
In this week’s materials we demonstrated the restructuring of different representations of a very small subset of the WHO tuberculosis data set (tables 1 to 4a and b). The more complete data set is also available in its original format in another data frame available as part of the tidyr package, named who
.
who
## # A tibble: 7,240 x 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Afghan… AF AFG 1980 NA NA NA
## 2 Afghan… AF AFG 1981 NA NA NA
## 3 Afghan… AF AFG 1982 NA NA NA
## 4 Afghan… AF AFG 1983 NA NA NA
## 5 Afghan… AF AFG 1984 NA NA NA
## 6 Afghan… AF AFG 1985 NA NA NA
## 7 Afghan… AF AFG 1986 NA NA NA
## 8 Afghan… AF AFG 1987 NA NA NA
## 9 Afghan… AF AFG 1988 NA NA NA
## 10 Afghan… AF AFG 1989 NA NA NA
## # … with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
## # new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
## # new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
## # new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
## # new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## # new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
## # new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
## # new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
## # new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
## # new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
## # new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
## # new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
## # new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,
## # newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
## # newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
## # newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,
## # newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
## # newrel_f65 <int>
What are the variables in this data set?
Create a bulleted list with your answer here.
Hint: see the help page for who
to understand the meaning of the column headings.
Hint: the Markdown Quick Reference available from the Help menu in RStudio is a very useful guide to formatting text in R markdown documents.
Why isn’t this data set tidy and what kinds of analysis are not straightforward with the data in this format?
Transform the data into a tidy format.
Columns new_sp_m014
to newrel_f65
are values for a compound variable, i.e. multiple variables joined together - the restructured data frame should have columns for each of the separated values in these column headings.
Also remove all rows with missing values for the number of tuberculosis cases.
# the hints below lead to the following solution
who_tidy <- who %>%
pivot_longer(new_sp_m014:newrel_f65, names_to = "group", values_to = "cases") %>%
mutate(group = str_remove(group, "new_")) %>%
mutate(group = str_remove(group, "new")) %>%
separate(group, into = c("diagnosis", "group"), sep = "_") %>%
separate(group, into = c("gender", "age_group"), sep = 1) %>%
filter(!is.na(cases))
# this is how it could be done removing missing values as part of the pivot and
# using a more sophisticated regular expression to remove the new_? prefix in a
# single step
who_tidy <- who %>%
pivot_longer(starts_with("new"), names_to = "group", values_to = "cases", values_drop_na = TRUE) %>%
mutate(group = str_remove(group, "^new_?")) %>%
separate(group, into = c("diagnosis", "group"), sep = "_") %>%
separate(group, into = c("gender", "age_group"), sep = 1)
# alternative using str_replace_all to get groups into suitable state for single
# separate operation
who_tidy <- who %>%
pivot_longer(starts_with("new"), names_to = "group", values_to = "cases", values_drop_na = TRUE) %>%
mutate(group = str_replace_all(group, c("newrel" = "rel", "new_" = "", "_m" = "_m_", "_f" = "_f_"))) %>%
separate(group, into = c("diagnosis", "gender", "age_group"), sep = "_")
# alternative using names_pattern in pivot_longer (need to know regular
# expressions for this) avoiding need for separate operation
who_tidy <- who %>%
pivot_longer(
starts_with("new"),
names_pattern = "new_?(.*)_(.)(.*)",
names_to = c("diagnosis", "gender", "age_group"),
values_to = "cases",
values_drop_na = TRUE
)
# convert the age group into a factor and change the levels to a more
# human-readable form
# (could also use factor or recode_factor for this)
who_tidy <- mutate(who_tidy, age_group = fct_recode(age_group, "0-14" = "014", "15-24" = "1524", "25-34" = "2534", "35-44" = "3544", "45-54" = "4554", "55-64" = "5564", "65+" = "65"))
who_tidy
## # A tibble: 76,046 x 8
## country iso2 iso3 year diagnosis gender age_group cases
## <chr> <chr> <chr> <int> <chr> <chr> <fct> <int>
## 1 Afghanistan AF AFG 1997 sp m 0-14 0
## 2 Afghanistan AF AFG 1997 sp m 15-24 10
## 3 Afghanistan AF AFG 1997 sp m 25-34 6
## 4 Afghanistan AF AFG 1997 sp m 35-44 3
## 5 Afghanistan AF AFG 1997 sp m 45-54 5
## 6 Afghanistan AF AFG 1997 sp m 55-64 2
## 7 Afghanistan AF AFG 1997 sp m 65+ 0
## 8 Afghanistan AF AFG 1997 sp f 0-14 5
## 9 Afghanistan AF AFG 1997 sp f 15-24 38
## 10 Afghanistan AF AFG 1997 sp f 25-34 36
## # … with 76,036 more rows
Hint 1: the column headers from new_sp_m014 onwards contain a superfluous prefix “new” or “new_” that you will want to remove; this can be done using one of the stringr functions in a single step if you know about regular expressions but is also achievable in two steps if you don’t use any regular expression magic.
Hint 2: separating two of the variables is a bit tricky because there is no separator character; try looking at the help for the separate function to see how this can be done.
2. Recreate table1
from your tidy version of the WHO tuberculosis data set
Recreate the first 3 columns of table1
, containing a small subset of the WHO tuberculosis data set, from your tidy version of who
.
select(table1, 1:3)
## # A tibble: 6 x 3
## country year cases
## <chr> <int> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
who_tidy %>%
filter(country %in% c("Afghanistan", "Brazil", "China")) %>%
filter(year %in% c(1999, 2000)) %>%
group_by(country, year) %>%
summarise(cases = sum(cases))
## # A tibble: 6 x 3
## # Groups: country [3]
## country year cases
## <chr> <int> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
Hint: this will involve some filtering, grouping and summarization.
3. Create time series plots for cases of tuberculosis
Create a time series plot showing the number of tuberculosis cases in the United Kingdom with separate lines for each age group. The year should be on the x axis and the number of cases on the y axis.
who_tidy <- who_tidy %>%
group_by(country, year, age_group) %>%
summarise(cases = sum(cases))
who_tidy %>%
filter(country == "United Kingdom of Great Britain and Northern Ireland") %>%
ggplot() +
geom_line(mapping = aes(x = year, y = cases, colour = age_group))
Hint: we’re not distinguishing between diagnoses and genders in this plot so you’ll need to sum the numbers of cases for each country, year and age group.
Use faceting in ggplot2 to create a series of plots showing the same information for France, Brazil, Japan and Uganda.
The numbers of cases in these countries are on quite different scales. Look at help page for facet_wrap()
and try changing the scales
argument so we can more easily compare the overall patterns between countries.
who_tidy %>%
filter(country %in% c("France", "Brazil", "Japan", "Uganda")) %>%
ggplot() +
geom_line(mapping = aes(x = year, y = cases, colour = age_group)) +
facet_wrap(vars(country), scales = "free_y")
4. Relate METABRIC copy number and expression data
For this exercise we’ll read in the METABRIC mRNA expression data and copy number states into separate data frames. The aim is to see how the gene expression is affected by the copy number state of various genes through a series of box plot.
expression <- read_tsv("metabric_mrna_expression.txt")
The expression data is in a matrix-like format with a row for each sample and a column for each gene. This is the format used when downloading data from cBioPortal after selecting a set of genes.
expression
## # A tibble: 2,509 x 10
## STUDY_ID SAMPLE_ID ESR1 ERBB2 PGR TP53 PIK3CA GATA3 FOXA1 MLPH
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 brca_metabric MB-0000 8.93 9.33 5.68 6.34 5.70 6.93 7.95 9.73
## 2 brca_metabric MB-0002 10.0 9.73 7.51 6.19 5.76 11.3 11.8 12.5
## 3 brca_metabric MB-0005 10.0 9.73 7.38 6.40 6.75 9.29 11.7 10.3
## 4 brca_metabric MB-0006 10.4 10.3 6.82 6.87 7.22 8.67 11.9 10.5
## 5 brca_metabric MB-0008 11.3 9.96 7.33 6.34 5.82 9.72 11.6 12.2
## 6 brca_metabric MB-0010 11.2 9.74 5.95 5.42 6.12 9.79 12.1 11.4
## 7 brca_metabric MB-0014 10.8 9.28 7.72 5.99 7.48 8.37 11.5 10.8
## 8 brca_metabric MB-0020 NA NA NA NA NA NA NA NA
## 9 brca_metabric MB-0022 10.4 8.61 5.59 6.17 7.59 7.87 10.7 9.95
## 10 brca_metabric MB-0025 NA NA NA NA NA NA NA NA
## # … with 2,499 more rows
Convert this data frame into a tidy format with a row-per-sample-per-gene. Remove the rather uninteresting STUDY_ID
column and filter out missing expression values (for those samples that weren’t assayed).
expression <- expression %>%
pivot_longer(ESR1:MLPH, names_to = "Gene", values_to = "Expression", values_drop_na = TRUE) %>%
select(-STUDY_ID)
expression
## # A tibble: 15,232 x 3
## SAMPLE_ID Gene Expression
## <chr> <chr> <dbl>
## 1 MB-0000 ESR1 8.93
## 2 MB-0000 ERBB2 9.33
## 3 MB-0000 PGR 5.68
## 4 MB-0000 TP53 6.34
## 5 MB-0000 PIK3CA 5.70
## 6 MB-0000 GATA3 6.93
## 7 MB-0000 FOXA1 7.95
## 8 MB-0000 MLPH 9.73
## 9 MB-0002 ESR1 10.0
## 10 MB-0002 ERBB2 9.73
## # … with 15,222 more rows
The copy number data are also in a matrix-like format but this time we have a row for each gene and a column for each sample. It is a very wide table containing 2175 columns (only the first 10 displayed below). This is the format of the copy number table you obtain if you download the complete METABRIC dataset from cBioPortal. The table read in is a subset containing rows for the 8 genes we’ve been using up till now.
copy_number_states <- read_tsv("metabric_cna.txt")
select(copy_number_states, 1:10)
## # A tibble: 8 x 10
## Hugo_Symbol Entrez_Gene_Id `MB-0000` `MB-0039` `MB-0045` `MB-0046`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ERBB2 2064 0 0 0 2
## 2 ESR1 2099 0 0 -1 0
## 3 FOXA1 3169 0 0 0 0
## 4 GATA3 2625 0 0 0 0
## 5 MLPH 79083 0 0 0 0
## 6 PGR 5241 0 -1 0 0
## 7 PIK3CA 5290 0 0 2 0
## 8 TP53 7157 0 0 -1 -1
## # … with 4 more variables: `MB-0048` <dbl>, `MB-0050` <dbl>,
## # `MB-0053` <dbl>, `MB-0062` <dbl>
Convert this data frame into a tidy format with a row-per-sample-per-gene where the values are the copy number states:
Remove the Entrez_Gene_id
column and convert the copy number state variable into a factor using the more human-readable names for each state.
# using a range of column names
# copy_number_states <- copy_number_states %>%
# pivot_longer(`MB-0000`:`MB-7188`, names_to = "Sample", values_to = "Copy_number_state") %>%
# select(-Entrez_Gene_Id) %>%
# mutate(Copy_number_state = as_factor(Copy_number_state))
# using numeric indexes for columns instead
# copy_number_states <- copy_number_states %>%
# pivot_longer(3:2175, names_to = "Sample", values_to = "Copy_number_state") %>%
# select(-Entrez_Gene_Id) %>%
# mutate(Copy_number_state = as_factor(Copy_number_state))
# can instead select those columns not to use
# we've also relabeled the copy number state levels (could instead use
# fct_recode or recode_factor)
copy_number_states <- copy_number_states %>%
select(-Entrez_Gene_Id) %>%
pivot_longer(-Hugo_Symbol, names_to = "Sample", values_to = "Copy_number_state") %>%
mutate(Copy_number_state = factor(Copy_number_state, levels = c("-2", "-1", "0", "1", "2"), labels = c("deletion", "loss", "neutral", "gain", "amplification")))
copy_number_states
## # A tibble: 17,384 x 3
## Hugo_Symbol Sample Copy_number_state
## <chr> <chr> <fct>
## 1 ERBB2 MB-0000 neutral
## 2 ERBB2 MB-0039 neutral
## 3 ERBB2 MB-0045 neutral
## 4 ERBB2 MB-0046 amplification
## 5 ERBB2 MB-0048 amplification
## 6 ERBB2 MB-0050 loss
## 7 ERBB2 MB-0053 neutral
## 8 ERBB2 MB-0062 loss
## 9 ERBB2 MB-0064 neutral
## 10 ERBB2 MB-0066 neutral
## # … with 17,374 more rows
Hint: look at the help page for factor()
to see how to set the labels for each copy number state or use fct_recode()
from the forcats
package.
Count the total number of occurrences of each copy number state.
count(copy_number_states, Copy_number_state)
## # A tibble: 5 x 2
## Copy_number_state n
## <fct> <int>
## 1 deletion 10
## 2 loss 3469
## 3 neutral 12077
## 4 gain 1156
## 5 amplification 672
Create a combined data set containing the expression value and copy number state for each patient and gene pairing. The data frame should contain the columns Sample, Gene, Expression and Copy_number_state.
combined <- expression %>%
rename(Sample = SAMPLE_ID) %>%
inner_join(copy_number_states, by = c("Sample", "Gene" = "Hugo_Symbol"))
combined
## # A tibble: 15,232 x 4
## Sample Gene Expression Copy_number_state
## <chr> <chr> <dbl> <fct>
## 1 MB-0000 ESR1 8.93 neutral
## 2 MB-0000 ERBB2 9.33 neutral
## 3 MB-0000 PGR 5.68 neutral
## 4 MB-0000 TP53 6.34 neutral
## 5 MB-0000 PIK3CA 5.70 neutral
## 6 MB-0000 GATA3 6.93 neutral
## 7 MB-0000 FOXA1 7.95 neutral
## 8 MB-0000 MLPH 9.73 neutral
## 9 MB-0002 ESR1 10.0 loss
## 10 MB-0002 ERBB2 9.73 loss
## # … with 15,222 more rows
Create a series of box plots for each of the genes with a box-and-whiskers showing the range of expression values for each copy number state.
ggplot(data = combined, mapping = aes(x = Copy_number_state, y = Expression)) +
geom_boxplot() +
facet_wrap(vars(Gene), scales = "free_y")
Customize this plot by changing the labels, scales, colours and theme as you like – be creative!
Save the plot as a PDF using ggsave()
and attach with the assignment when handing in.
There will be chocolate-based prizes awarded to any assignment handed in with a plot that is prettier than mine (bear in mind that I’m also the judge and jury – Matt :-)
ggplot(data = combined, mapping = aes(x = Copy_number_state, y = Expression, fill = Copy_number_state)) +
geom_boxplot(outlier.size = 0.2) +
facet_wrap(vars(Gene), ncol = 4, scales = "free_y") +
labs(x = NULL) +
scale_x_discrete(labels = c("deletion", "loss", "neutral", "gain", "amplification")) +
scale_fill_brewer(palette = "RdPu") +
labs(title = "Relationship between expression and copy number state", fill = NULL) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
plot.title = element_text(size = 10),
axis.title.y = element_text(size = 8),
axis.text.y = element_text(size = 7),
axis.text.x = element_blank(),
strip.text = element_text(size = 8, face = "italic"),
legend.text = element_text(size = 7),
legend.position = "bottom"
)
ggsave("expression_vs_copy_number_state.pdf", width = 20, height = 12, units = "cm")
5. Extension: Mutual exclusivity of ERBB2 amplifications and AKT1 mutations
Seriously, stop here if you haven’t got the time or inclination to attempt this final, bonus extension exercise (but please still hand in your assignment).
In the METABRIC 2016 paper, Pereira et al. (Nature Communications 7:11479, 2016) considered the association between somatic mutations, exploring patterns of co-mutation and mutual exclusivity. One of their findings was the mutual exclusivity between AKT1 mutations and amplification of ERBB2. For the final exercise we will reproduce this observation.
We’ll first need to read the mutation data into R and we’re only interested in those cases where the patient tumour sample was sequenced so we’ll read those in as well.
mutations <- read_csv("metabric_mutations.csv")
cases_sequenced <- read_tsv("cases_nat_comm_2016.txt", col_names = "Patient_ID")
Filter the mutations table to only contain the non-silent mutations in AKT1.
akt1_mutations <- mutations %>%
filter(Gene == "AKT1", Variant_Classification != "Silent")
Add a logical column to the cases_sequenced
data frame called AKT1_mutated
that indicates whether there was a non-silent AKT1 mutation detected in each patient tumour sample.
Count the number of patients with and without a non-silent AKT1 mutation.
cases_sequenced <- cases_sequenced %>%
mutate(AKT1_mutated = Patient_ID %in% akt1_mutations$Patient_ID)
count(cases_sequenced, AKT1_mutated)
## # A tibble: 2 x 2
## AKT1_mutated n
## <lgl> <int>
## 1 FALSE 2333
## 2 TRUE 100
Hint: use a mutate()
with %in%
.
Filter the copy number states table to only include entries for ERBB2.
Count the number of samples for each ERBB2 copy number state.
erbb2_copy_number_states <- copy_number_states %>%
filter(Hugo_Symbol == "ERBB2") %>%
select(Sample, ERBB2_copy_number_state = Copy_number_state)
count(erbb2_copy_number_states, ERBB2_copy_number_state)
## # A tibble: 4 x 2
## ERBB2_copy_number_state n
## <fct> <int>
## 1 loss 378
## 2 neutral 1270
## 3 gain 183
## 4 amplification 342
Join the ERBB2 copy number state table to the table we created before identifying which patient has an AKT1 mutation. The combined data frame should only contain entries for patients that have been sequenced and for which copy number data are available.
erbb2_akt1_association <- inner_join(cases_sequenced, erbb2_copy_number_states, by = c("Patient_ID" = "Sample"))
erbb2_akt1_association
## # A tibble: 2,097 x 3
## Patient_ID AKT1_mutated ERBB2_copy_number_state
## <chr> <lgl> <fct>
## 1 MB-0000 FALSE neutral
## 2 MB-0002 FALSE loss
## 3 MB-0005 FALSE neutral
## 4 MB-0006 FALSE neutral
## 5 MB-0008 FALSE neutral
## 6 MB-0010 FALSE loss
## 7 MB-0014 FALSE neutral
## 8 MB-0022 FALSE loss
## 9 MB-0025 FALSE amplification
## 10 MB-0028 FALSE gain
## # … with 2,087 more rows
Count the numbers of patients with different combinations of ERBB2 copy number state and mutated AKT1.
count(erbb2_akt1_association, ERBB2_copy_number_state, AKT1_mutated)
## # A tibble: 8 x 3
## ERBB2_copy_number_state AKT1_mutated n
## <fct> <lgl> <int>
## 1 loss FALSE 355
## 2 loss TRUE 21
## 3 neutral FALSE 1155
## 4 neutral TRUE 62
## 5 gain FALSE 174
## 6 gain TRUE 2
## 7 amplification FALSE 327
## 8 amplification TRUE 1
Does this reflect the observation of mutual exclusivity of ERBB2 amplification and AKT1 mutations?
Create a logical column called ERBB2_amplified
indicated which patient samples have amplifications (copy number state 2) in ERBB2.
erbb2_akt1_association <- mutate(erbb2_akt1_association, ERBB2_amplified = ERBB2_copy_number_state == "amplification")
erbb2_akt1_association
## # A tibble: 2,097 x 4
## Patient_ID AKT1_mutated ERBB2_copy_number_state ERBB2_amplified
## <chr> <lgl> <fct> <lgl>
## 1 MB-0000 FALSE neutral FALSE
## 2 MB-0002 FALSE loss FALSE
## 3 MB-0005 FALSE neutral FALSE
## 4 MB-0006 FALSE neutral FALSE
## 5 MB-0008 FALSE neutral FALSE
## 6 MB-0010 FALSE loss FALSE
## 7 MB-0014 FALSE neutral FALSE
## 8 MB-0022 FALSE loss FALSE
## 9 MB-0025 FALSE amplification TRUE
## 10 MB-0028 FALSE gain FALSE
## # … with 2,087 more rows
Create a 2-by-2 contingency table using table()
and the AKT1_mutated
and ERBB2_amplified
columns.
contingency_table <- erbb2_akt1_association %>%
select(ERBB2_amplified, AKT1_mutated) %>%
table()
contingency_table
## AKT1_mutated
## ERBB2_amplified FALSE TRUE
## FALSE 1684 85
## TRUE 327 1
Perform a Fisher exact test on this contingency table.
fisher.test(contingency_table)
##
## Fisher's Exact Test for Count Data
##
## data: contingency_table
## p-value = 1.128e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.001515843 0.349760112
## sample estimates:
## odds ratio
## 0.06061971