RNA-seq Analysis in R

# load data
load("../../Course_Materials/Robjects/DE.RData")
## set up connection to ensembl database
ensembl <- useEnsembl("genes")
# specify a data set to use
ensembl <- useDataset("mmusculus_gene_ensembl", mart=ensembl)

Challenge 1

That was just 1000 genes. We need annotations for the entire results table. Also, there may be some other interesting columns in BioMart that we wish to retrieve.

Search the attributes and add the following to our list of attributes:

The gene description

The gene biotype

# check the available "attributes" - things you can retreive
listAttributes(ensembl) %>%
    filter(str_detect(name, "description"))

##                          name                                 description
## 1                 description                            Gene description
## 2       phenotype_description                       Phenotype description
## 3      goslim_goa_description                      GOSlim GOA Description
## 4             mgi_description                             MGI description
## 5      entrezgene_description NCBI gene (formerly Entrezgene) description
## 6        wikigene_description                        WikiGene description
## 7          family_description                  Ensembl Family Description
## 8  interpro_short_description                  Interpro Short Description
## 9        interpro_description                        Interpro Description
## 10                description                            Gene description
## 11                description                            Gene description
## 12                description                            Gene description
## 13         source_description                  Variant source description
## 14                description                            Gene description
##            page
## 1  feature_page
## 2  feature_page
## 3  feature_page
## 4  feature_page
## 5  feature_page
## 6  feature_page
## 7  feature_page
## 8  feature_page
## 9  feature_page
## 10    structure
## 11     homologs
## 12          snp
## 13          snp
## 14    sequences

listAttributes(ensembl) %>%
    filter(str_detect(name, "biotype"))

##                 name     description         page
## 1       gene_biotype       Gene type feature_page
## 2 transcript_biotype Transcript type feature_page
## 3       gene_biotype       Gene type    structure
## 4       gene_biotype       Gene type    sequences
## 5 transcript_biotype Transcript type    sequences

Query BioMart using all of the genes in our results table (resLvV)

# set attributes
attributeNames <- c('ensembl_gene_id',
                    'entrezgene_id',
                    'external_gene_name',
                    'description',
                    'gene_biotype')

# Set the filter type and values
ourFilterType <- "ensembl_gene_id"

# set the values for the filter
filterValues <- rownames(resLvV)

# run the query
annot <- getBM(attributes=attributeNames,
               filters = ourFilterType,
               values = filterValues,
               mart = ensembl)
head(annot)

##      ensembl_gene_id entrezgene_id external_gene_name
## 1 ENSMUSG00000000134        209446               Tfe3
## 2 ENSMUSG00000000194        277463             Gpr107
## 3 ENSMUSG00000000223         13497               Drp2
## 4 ENSMUSG00000000247         16870               Lhx2
## 5 ENSMUSG00000000266         23947               Mid2
## 6 ENSMUSG00000000305         12561               Cdh4
##                                                          description
## 1          transcription factor E3 [Source:MGI Symbol;Acc:MGI:98511]
## 2 G protein-coupled receptor 107 [Source:MGI Symbol;Acc:MGI:2139054]
## 3    dystrophin related protein 2 [Source:MGI Symbol;Acc:MGI:107432]
## 4           LIM homeobox protein 2 [Source:MGI Symbol;Acc:MGI:96785]
## 5                      midline 2 [Source:MGI Symbol;Acc:MGI:1344333]
## 6                       cadherin 4 [Source:MGI Symbol;Acc:MGI:99218]
##     gene_biotype
## 1 protein_coding
## 2 protein_coding
## 3 protein_coding
## 4 protein_coding
## 5 protein_coding
## 6 protein_coding

How many Ensembl genes have multipe Entrez IDs associated with them?

# duplicate ids
annot %>%
  add_count(ensembl_gene_id) %>% 
  filter(n>1) %>% 
  distinct(ensembl_gene_id) %>% 
  nrow()

## [1] 64

How many Ensembl genes in resLvV don’t have any annotation?

length(filterValues) - length(unique(annot$ensembl_gene_id))

## [1] 142

Note: The answers to (c) and (d) may change depending on the current release version of BioMart. This html was generated with release 101.

RNA-seq Analysis in R

Annotation and Visualisation of RNA-seq results - Solutions

Stephane Ballereau, Dominique-Laurent Couturier, Mark Dunning, Abbi Edwards, Ashley Sawle

Last modified: 13 Nov 2020

Challenge 1