We commonly work with gene symbols but these are not a very precise way of referring to specific genes, with the same symbol used in some cases as an alias for more than one gene.
There are lots of ways of mapping gene symbols to gene entries in popular databases or resouces, such as Ensembl or RefSeq, but in this exercise we will use a resource provided by the HUGO Gene Nomenclature Committee.
The aim is to create a lookup that gives Ensembl and RefSeq accession ids for gene symbols.
In a web browser, search for and navigate to the HGNC website. Go to the ‘Statistics and download files’ page from the ‘Downloads’ drop-down menu. Towards the bottom of the page is a section called ‘Complete dataset download links’. Download the Complete HGNC dataset as a tab-separated text file (‘TXT’ button). You may need to right click and choose ‘Save link as’ depending on what brower you’re using. The file is called hgnc_complete_set.txt
.
hgnc_complete_set.txt
file into a data framelibrary(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## Registered S3 method overwritten by 'rvest':
## method from
## read_xml.response xml2
## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
genes <- read_tsv("hgnc_complete_set.txt")
## Parsed with column specification:
## cols(
## .default = col_character(),
## gene_family = col_logical(),
## gene_family_id = col_logical(),
## date_approved_reserved = col_date(format = ""),
## date_symbol_changed = col_date(format = ""),
## date_name_changed = col_date(format = ""),
## date_modified = col_date(format = ""),
## entrez_id = col_double(),
## mirbase = col_logical(),
## homeodb = col_double(),
## snornabase = col_logical(),
## bioparadigms_slc = col_logical(),
## orphanet = col_double(),
## horde_id = col_logical(),
## imgt = col_logical(),
## kznf_gene_catalog = col_logical(),
## `mamit-trnadb` = col_logical(),
## intermediate_filament_db = col_logical(),
## rna_central_ids = col_logical(),
## gtrnadb = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 5662 parsing failures.
## row col expected actual file
## 2200 kznf_gene_catalog 1/0/T/F/TRUE/FALSE 404 'hgnc_complete_set.txt'
## 2210 kznf_gene_catalog 1/0/T/F/TRUE/FALSE 341 'hgnc_complete_set.txt'
## 2211 kznf_gene_catalog 1/0/T/F/TRUE/FALSE 90 'hgnc_complete_set.txt'
## 2276 intermediate_filament_db 1/0/T/F/TRUE/FALSE HGNC:1040 'hgnc_complete_set.txt'
## 2277 intermediate_filament_db 1/0/T/F/TRUE/FALSE HGNC:1041 'hgnc_complete_set.txt'
## .... ........................ .................. ......... .......................
## See problems(...) for more details.
These are the HGNC id, the RefSeq and Ensembl accession numbers, and 3 columns containing gene symbols - one for the primary symbol recognized by HGNC, another for symbols that were used as the primary symbol in previous releases of the resource, and one final column containing aliases.
It will be useful at this point to rename the 3 gene symbol columns to ‘primary’, ‘previous’ and ‘alias’
genes <- genes %>%
select(
hgnc_id,
refseq_accession,
ensembl_gene_id,
primary = symbol,
previous = prev_symbol,
alias = alias_symbol
)
Use the gather
function to create ‘symbol’ and ‘class’ columns.
It would be very helpful here to retain the primary HGNC gene symbol column so apply a mutate
to copy this column to a new column called ‘hgnc_symbol’ before calling gather
.
genes <- genes %>%
mutate(hgnc_symbol = primary) %>%
gather(key = class, value = symbol, primary, previous, alias, na.rm = TRUE)
Genes with no previously used symbol or aliases will have rows with NA
values in the symbol column. Remove these.
genes <- genes %>%
filter(!is.na(symbol))
Note that this step is not necessary if na.rm = TRUE
used in gather
previously.
Some genes have multiple gene symbol aliases that are separated by the ‘|’ character. These need to be split into separate rows. Look up the list of functions provided by the tidyr package using the Help tab in RStudio and find the function you need to do this.
The separator you will need to provide is a regular expression and because the ‘|’ character means something special within regular expression you will need to tell it that you really mean the ‘|’ character by prefixing it with “\”, i.e. sep = "\\|"
).
genes <- genes %>%
separate_rows(symbol, sep = "\\|")
Some gene symbols map to multiple genes. A primary gene symbol should take precedence over a previously used symbol and an alias should be lower priority still. To enable us to prioritize gene symbol mappings based on this ordering we need to turn our ‘class’ column into a factor with the levels defining the order or precedence.
genes <- genes %>%
mutate(class = factor(class, levels = c("primary", "previous", "alias")))
Each gene symbol needs to be treated separately, i.e. as a group. Within each group the rows need to be sorted by the order of precedence of the type of symbol and then a rank or priority assigned where 1 would be our favoured mapping, 2 our next preferred, etc.
Hint: use the row_number
function
lookup <- genes %>%
group_by(symbol) %>%
arrange(class) %>%
mutate(ranking = row_number()) %>%
ungroup() %>%
select(symbol, ranking, class, everything()) %>%
arrange(symbol, ranking)
Check if you need to remove grouping in your data frame after adding the ranking.
We now have our gene symbol lookup table and can use it. See which genes the following symbols map to.
lookup %>%
filter(symbol %in% c("KRAS", "MTCP1", "CASP", "HSR1", "TP53"))
## # A tibble: 10 x 7
## symbol ranking class hgnc_id refseq_accession ensembl_gene_id
## <chr> <int> <fct> <chr> <chr> <chr>
## 1 CASP 1 alias HGNC:2… NM_006371 ENSG00000170275
## 2 CASP 2 alias HGNC:2… NM_001913 ENSG00000257923
## 3 CASP 3 alias HGNC:9… NM_004288 ENSG00000115165
## 4 HSR1 1 prim… HGNC:3… <NA> <NA>
## 5 HSR1 2 alias HGNC:4… NM_005275 ENSG00000204590
## 6 KRAS 1 prim… HGNC:6… NM_033360 ENSG00000133703
## 7 MTCP1 1 prim… HGNC:7… NM_001018025 ENSG00000214827
## 8 MTCP1 2 prev… HGNC:3… NM_001018024.2 ENSG00000182712
## 9 MTCP1 3 alias HGNC:1… NM_002351 ENSG00000183918
## 10 TP53 1 prim… HGNC:1… NM_000546 ENSG00000141510
## # … with 1 more variable: hgnc_symbol <chr>
Note that using the row_number
function means that matches for symbols with the same class are still given different rankings.
lookup %>%
filter(symbol == "CYP2A")
## # A tibble: 4 x 7
## symbol ranking class hgnc_id refseq_accession ensembl_gene_id hgnc_symbol
## <chr> <int> <fct> <chr> <chr> <chr> <chr>
## 1 CYP2A 1 prim… HGNC:2… <NA> <NA> CYP2A
## 2 CYP2A 2 alias HGNC:2… NM_000762 ENSG00000255974 CYP2A6
## 3 CYP2A 3 alias HGNC:2… NM_030589 ENSG00000198077 CYP2A7
## 4 CYP2A 4 alias HGNC:2… NM_000766 ENSG00000197838 CYP2A13
We may instead want to have ties for those matches with the same class and can use the min_rank
function for this.
lookup <- genes %>%
group_by(symbol) %>%
mutate(ranking = min_rank(class)) %>%
ungroup() %>%
select(symbol, ranking, class, everything()) %>%
arrange(symbol, ranking)
lookup %>%
filter(symbol == "CYP2A")
## # A tibble: 4 x 7
## symbol ranking class hgnc_id refseq_accession ensembl_gene_id hgnc_symbol
## <chr> <int> <fct> <chr> <chr> <chr> <chr>
## 1 CYP2A 1 prim… HGNC:2… <NA> <NA> CYP2A
## 2 CYP2A 2 alias HGNC:2… NM_000762 ENSG00000255974 CYP2A6
## 3 CYP2A 2 alias HGNC:2… NM_030589 ENSG00000198077 CYP2A7
## 4 CYP2A 2 alias HGNC:2… NM_000766 ENSG00000197838 CYP2A13
lookup %>%
filter(ranking == 1) %>%
group_by(class) %>%
summarize(n = n(), percent = 100 * mean(!is.na(ensembl_gene_id)))
## # A tibble: 3 x 3
## class n percent
## <fct> <int> <dbl>
## 1 primary 43053 88.1
## 2 previous 14366 92.4
## 3 alias 40798 96.6