Session 5 exercise: building a gene symbol lookup

We commonly work with gene symbols but these are not a very precise way of referring to specific genes, with the same symbol used in some cases as an alias for more than one gene.

There are lots of ways of mapping gene symbols to gene entries in popular databases or resouces, such as Ensembl or RefSeq, but in this exercise we will use a resource provided by the HUGO Gene Nomenclature Committee.

The aim is to create a lookup that gives Ensembl and RefSeq accession ids for gene symbols.

Download the HGNC complete gene set

In a web browser, search for and navigate to the HGNC website. Go to the ‘Statistics and download files’ page from the ‘Downloads’ drop-down menu. Towards the bottom of the page is a section called ‘Complete dataset download links’. Download the Complete HGNC dataset as a tab-separated text file (‘TXT’ button). You may need to right click and choose ‘Save link as’ depending on what brower you’re using. The file is called hgnc_complete_set.txt.

Load the hgnc_complete_set.txt file into a data frame

library(tidyverse)

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

## Registered S3 method overwritten by 'rvest':
##   method            from
##   read_xml.response xml2

## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

genes <- read_tsv("hgnc_complete_set.txt")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   gene_family = col_logical(),
##   gene_family_id = col_logical(),
##   date_approved_reserved = col_date(format = ""),
##   date_symbol_changed = col_date(format = ""),
##   date_name_changed = col_date(format = ""),
##   date_modified = col_date(format = ""),
##   entrez_id = col_double(),
##   mirbase = col_logical(),
##   homeodb = col_double(),
##   snornabase = col_logical(),
##   bioparadigms_slc = col_logical(),
##   orphanet = col_double(),
##   horde_id = col_logical(),
##   imgt = col_logical(),
##   kznf_gene_catalog = col_logical(),
##   `mamit-trnadb` = col_logical(),
##   intermediate_filament_db = col_logical(),
##   rna_central_ids = col_logical(),
##   gtrnadb = col_logical()
## )

## See spec(...) for full column specifications.

## Warning: 5662 parsing failures.
##  row                      col           expected    actual                    file
## 2200 kznf_gene_catalog        1/0/T/F/TRUE/FALSE 404       'hgnc_complete_set.txt'
## 2210 kznf_gene_catalog        1/0/T/F/TRUE/FALSE 341       'hgnc_complete_set.txt'
## 2211 kznf_gene_catalog        1/0/T/F/TRUE/FALSE 90        'hgnc_complete_set.txt'
## 2276 intermediate_filament_db 1/0/T/F/TRUE/FALSE HGNC:1040 'hgnc_complete_set.txt'
## 2277 intermediate_filament_db 1/0/T/F/TRUE/FALSE HGNC:1041 'hgnc_complete_set.txt'
## .... ........................ .................. ......... .......................
## See problems(...) for more details.

Select the columns of interest

These are the HGNC id, the RefSeq and Ensembl accession numbers, and 3 columns containing gene symbols - one for the primary symbol recognized by HGNC, another for symbols that were used as the primary symbol in previous releases of the resource, and one final column containing aliases.

It will be useful at this point to rename the 3 gene symbol columns to ‘primary’, ‘previous’ and ‘alias’

genes <- genes %>%
  select(
    hgnc_id,
    refseq_accession,
    ensembl_gene_id,
    primary = symbol,
    previous = prev_symbol,
    alias = alias_symbol
  )

Gather the 3 gene symbol columns into a single column

Use the gather function to create ‘symbol’ and ‘class’ columns.

It would be very helpful here to retain the primary HGNC gene symbol column so apply a mutate to copy this column to a new column called ‘hgnc_symbol’ before calling gather.

genes <- genes %>%
  mutate(hgnc_symbol = primary) %>%
  gather(key = class, value = symbol, primary, previous, alias, na.rm = TRUE)

Filter for missing values.

Genes with no previously used symbol or aliases will have rows with NA values in the symbol column. Remove these.

genes <- genes %>%
  filter(!is.na(symbol))

Note that this step is not necessary if na.rm = TRUE used in gather previously.

Split multiple gene symbols into separate rows.

Some genes have multiple gene symbol aliases that are separated by the ‘|’ character. These need to be split into separate rows. Look up the list of functions provided by the tidyr package using the Help tab in RStudio and find the function you need to do this.

The separator you will need to provide is a regular expression and because the ‘|’ character means something special within regular expression you will need to tell it that you really mean the ‘|’ character by prefixing it with “\”, i.e. sep = "\\|").

genes <- genes %>%
  separate_rows(symbol, sep = "\\|")

Assign an order of precedence base on the class of the symbol

Some gene symbols map to multiple genes. A primary gene symbol should take precedence over a previously used symbol and an alias should be lower priority still. To enable us to prioritize gene symbol mappings based on this ordering we need to turn our ‘class’ column into a factor with the levels defining the order or precedence.

genes <- genes %>%
  mutate(class = factor(class, levels = c("primary", "previous", "alias")))

Assign a ranking to the mapping for each gene symbol based on the order of precedence

Each gene symbol needs to be treated separately, i.e. as a group. Within each group the rows need to be sorted by the order of precedence of the type of symbol and then a rank or priority assigned where 1 would be our favoured mapping, 2 our next preferred, etc.

Hint: use the row_number function

lookup <- genes %>%
  group_by(symbol) %>%
  arrange(class) %>%
  mutate(ranking = row_number()) %>%
  ungroup() %>%
  select(symbol, ranking, class, everything()) %>%
  arrange(symbol, ranking)

Check if you need to remove grouping in your data frame after adding the ranking.

Use the gene symbol lookup table

We now have our gene symbol lookup table and can use it. See which genes the following symbols map to.

KRAS
MTCP1
CASP
HSR1
your favourite gene

lookup %>%
  filter(symbol %in% c("KRAS", "MTCP1", "CASP", "HSR1", "TP53"))

## # A tibble: 10 x 7
##    symbol ranking class hgnc_id refseq_accession ensembl_gene_id
##    <chr>    <int> <fct> <chr>   <chr>            <chr>          
##  1 CASP         1 alias HGNC:2… NM_006371        ENSG00000170275
##  2 CASP         2 alias HGNC:2… NM_001913        ENSG00000257923
##  3 CASP         3 alias HGNC:9… NM_004288        ENSG00000115165
##  4 HSR1         1 prim… HGNC:3… <NA>             <NA>           
##  5 HSR1         2 alias HGNC:4… NM_005275        ENSG00000204590
##  6 KRAS         1 prim… HGNC:6… NM_033360        ENSG00000133703
##  7 MTCP1        1 prim… HGNC:7… NM_001018025     ENSG00000214827
##  8 MTCP1        2 prev… HGNC:3… NM_001018024.2   ENSG00000182712
##  9 MTCP1        3 alias HGNC:1… NM_002351        ENSG00000183918
## 10 TP53         1 prim… HGNC:1… NM_000546        ENSG00000141510
## # … with 1 more variable: hgnc_symbol <chr>

Note that using the row_number function means that matches for symbols with the same class are still given different rankings.

lookup %>%
  filter(symbol == "CYP2A")

## # A tibble: 4 x 7
##   symbol ranking class hgnc_id refseq_accession ensembl_gene_id hgnc_symbol
##   <chr>    <int> <fct> <chr>   <chr>            <chr>           <chr>      
## 1 CYP2A        1 prim… HGNC:2… <NA>             <NA>            CYP2A      
## 2 CYP2A        2 alias HGNC:2… NM_000762        ENSG00000255974 CYP2A6     
## 3 CYP2A        3 alias HGNC:2… NM_030589        ENSG00000198077 CYP2A7     
## 4 CYP2A        4 alias HGNC:2… NM_000766        ENSG00000197838 CYP2A13

We may instead want to have ties for those matches with the same class and can use the min_rank function for this.

lookup <- genes %>%
  group_by(symbol) %>%
  mutate(ranking = min_rank(class)) %>%
  ungroup() %>%
  select(symbol, ranking, class, everything()) %>%
  arrange(symbol, ranking)

lookup %>%
  filter(symbol == "CYP2A")

## # A tibble: 4 x 7
##   symbol ranking class hgnc_id refseq_accession ensembl_gene_id hgnc_symbol
##   <chr>    <int> <fct> <chr>   <chr>            <chr>           <chr>      
## 1 CYP2A        1 prim… HGNC:2… <NA>             <NA>            CYP2A      
## 2 CYP2A        2 alias HGNC:2… NM_000762        ENSG00000255974 CYP2A6     
## 3 CYP2A        2 alias HGNC:2… NM_030589        ENSG00000198077 CYP2A7     
## 4 CYP2A        2 alias HGNC:2… NM_000766        ENSG00000197838 CYP2A13

Using just the preferred mappings, compute the number of symbols in each class and the percentage of those that map to a HGNC gene with an Ensembl cross-reference

lookup %>%
  filter(ranking == 1) %>%
  group_by(class) %>%
  summarize(n = n(), percent = 100 * mean(!is.na(ensembl_gene_id)))

## # A tibble: 3 x 3
##   class        n percent
##   <fct>    <int>   <dbl>
## 1 primary  43053    88.1
## 2 previous 14366    92.4
## 3 alias    40798    96.6

Session 5 exercise: building a gene symbol lookup

Last modified: 31 May 2019