RNA-seq Analysis in R

Challenge 1

That was just 1000 genes. We need annotations for the entire results table. Also, there may be some other interesting columns in BioMart that we wish to retrieve.

Search the attributes and add the following to our list of attributes:

The gene description

The gene biotype

Query BioMart using all of the genes in our results table (resLvV)

How many Ensembl genes have multipe Entrez IDs associated with them?

filterValues <- rownames(resLvV)

# check the available "attributes" - things you can retreive
listAttributes(ensembl) %>%
    filter(str_detect(name, "description"))

##                          name                description         page
## 1                 description           Gene description feature_page
## 2       phenotype_description      Phenotype description feature_page
## 3      goslim_goa_description     GOSlim GOA Description feature_page
## 4             mgi_description            MGI description feature_page
## 5      entrezgene_description      NCBI gene description feature_page
## 6        wikigene_description       WikiGene description feature_page
## 7          family_description Ensembl Family Description feature_page
## 8  interpro_short_description Interpro Short Description feature_page
## 9        interpro_description       Interpro Description feature_page
## 10                description           Gene description    structure
## 11                description           Gene description     homologs
## 12                description           Gene description          snp
## 13         source_description Variant source description          snp
## 14                description           Gene description    sequences

listAttributes(ensembl) %>%
    filter(str_detect(name, "biotype"))

##                 name     description         page
## 1       gene_biotype       Gene type feature_page
## 2 transcript_biotype Transcript type feature_page
## 3       gene_biotype       Gene type    structure
## 4       gene_biotype       Gene type    sequences
## 5 transcript_biotype Transcript type    sequences

attributeNames <- c('ensembl_gene_id',
                    'entrezgene_id',
                    'external_gene_name',
                    'description',
                    'gene_biotype')

# run the query
annot <- getBM(attributes=attributeNames,
               filters = ourFilterType,
               values = filterValues,
               mart = ensembl)

# dulicate ids
annot %>%
  add_count(ensembl_gene_id) %>% 
  filter(n>1) %>% 
  distinct(ensembl_gene_id) %>% 
  nrow()

## [1] 97

# missing genes
missingGenes <- !rownames(resLvV)%in%annot$ensembl_gene_id
rownames(resLvV)[missingGenes]

## character(0)

Challenge 2

Use the log2 fold change (logFC) on the x-axis, and use -log10(FDR) on the y-axis. (This >-log10 transformation is commonly used for p-values as it means that more significant genes have a >higher scale)

Create a column of -log10(FDR) values

Create a plot with points coloured by if FDR < 0.05

# first remove the filtered genes (FDR=NA) and create a -log10(FDR) column
filtTab <- shrinkLvV %>% 
    filter(!is.na(FDR)) %>% 
    mutate(`-log10(FDR)` = -log10(FDR))

ggplot(filtTab, aes(x = logFC, y=`-log10(FDR)`)) + 
    geom_point(aes(colour=FDR < 0.05), size=1)

Challenge 3 {.challenge} - In Supplementary Materials

Use the txMm to retrieve the exon coordinates for the genes:

ENSMUSG00000021604

ENSMUSG00000022146

ENSMUSG00000040118

keyList <- c("ENSMUSG00000021604", "ENSMUSG00000022146", "ENSMUSG00000040118")

AnnotationDbi::select(txMm, 
                      keys=keyList, 
                      keytype = "GENEID", 
                      columns=c("TXNAME", "TXCHROM", "TXSTART", "TXEND", "TXSTRAND", "TXTYPE"))

##                GENEID             TXNAME     TXTYPE TXCHROM TXSTRAND
## 1  ENSMUSG00000021604 ENSMUST00000176684 transcript      13        +
## 2  ENSMUSG00000021604 ENSMUST00000022095 transcript      13        +
## 3  ENSMUSG00000022146 ENSMUST00000022746 transcript      15        -
## 4  ENSMUSG00000022146 ENSMUST00000176826 transcript      15        -
## 5  ENSMUSG00000022146 ENSMUST00000176554 transcript      15        -
## 6  ENSMUSG00000022146 ENSMUST00000175862 transcript      15        -
## 7  ENSMUSG00000022146 ENSMUST00000177478 transcript      15        -
## 8  ENSMUSG00000022146 ENSMUST00000177263 transcript      15        -
## 9  ENSMUSG00000040118 ENSMUST00000167946 transcript       5        +
## 10 ENSMUSG00000040118 ENSMUST00000101581 transcript       5        +
## 11 ENSMUSG00000040118 ENSMUST00000039370 transcript       5        +
## 12 ENSMUSG00000040118 ENSMUST00000180204 transcript       5        +
## 13 ENSMUSG00000040118 ENSMUST00000199704 transcript       5        +
## 14 ENSMUSG00000040118 ENSMUST00000078272 transcript       5        +
## 15 ENSMUSG00000040118 ENSMUST00000115281 transcript       5        +
## 16 ENSMUSG00000040118 ENSMUST00000196750 transcript       5        +
## 17 ENSMUSG00000040118 ENSMUST00000200270 transcript       5        +
## 18 ENSMUSG00000040118 ENSMUST00000200158 transcript       5        +
## 19 ENSMUSG00000040118 ENSMUST00000200294 transcript       5        +
## 20 ENSMUSG00000040118 ENSMUST00000199236 transcript       5        +
##     TXSTART    TXEND
## 1  73260479 73269608
## 2  73260497 73269608
## 3   6813577  6874969
## 4   6815037  6874969
## 5   6820637  6824595
## 6   6836758  6874268
## 7   6843969  6874257
## 8   6854987  6874296
## 9  15934691 16374511
## 10 15934788 16371051
## 11 15934788 16371069
## 12 15934788 16371069
## 13 15934788 16371069
## 14 15934788 16374504
## 15 15934829 16370727
## 16 15934911 16089022
## 17 16025714 16268604
## 18 16325985 16329883
## 19 16326151 16341059
## 20 16361395 16362326

RNA-seq Analysis in R

Annotation and Visualisation of RNA-seq results - Solutions

Stephane Ballereau, Mark Dunning, Oscar Rueda, Ashley Sawle

Last modified: 17 Jul 2019

Challenge 1

Challenge 2

Challenge 3 {.challenge} - In Supplementary Materials