Preamble
Load a DESeq2 results table to retrieve gene ids from
Retrieve annotation using biomaRt
- Set up connection to ensembl database
- Retrieve the complete annotation
Fix the various One-to-Many relationships

Preamble

This document provides the necessary code for creating the rds file with the annotation table used in the course. It uses biomaRt to retrieve the annotation from Ensembl. The list of Ensembl IDs to retrieve annotation for is extracted from an rds object containing the DESeq2 differential expression results, which was already created.

This document is not intended as a tutorial, for a detailed explanation of how to use biomaRt please see supplementary materials Annotation with biomaRt. Nor is this document intended to be a guide for creating your own annotations, some totally arbitrary decisions were made in order to save effort.

library(DESeq2)
library(biomaRt)
library(tidyverse)

Load a DESeq2 results table to retrieve gene ids from

results.interaction.11 <- readRDS("RObjects/DESeqResults.interaction_d11.rds")

Retrieve annotation using biomaRt

Set up connection to ensembl database

ensembl <- useEnsembl(biomart = 'genes', 
                      dataset = 'mmusculus_gene_ensembl',
                      version = 102)

Retrieve the complete annotation

filterType <- "ensembl_gene_id"
filterValues <- rownames(results.interaction.11)
attributeNames <- c("ensembl_gene_id",
                    "entrezgene_id",
                    "external_gene_name",
                    "description",
                    "gene_biotype",
                    "chromosome_name",
                    "start_position",
                    "end_position",
                    "strand",
                    "entrezgene_accession")

annot <- getBM(attributes=attributeNames,
               filters = filterType,
               values = filterValues,
               mart = ensembl)

This is the complete annotation with all the one-to-many relationships. Export it, so that we won’t have to run the query again if we wish to come back to it.

saveRDS(annot, file="Full_annotation_with_duplicates.rds")

Fix the various One-to-Many relationships

Assess the one-to-many relationships related to duplicated Ensembl IDs

There are ensembl id’s with multiple Entrez ID’s:

annot %>%  
    add_count(ensembl_gene_id) %>%  
    filter(n>1) %>% 
    count()

##     n
## 1 128

So 128 IDs with multiple entries.

annot %>%  
    add_count(ensembl_gene_id) %>%  
    filter(n>1) %>% 
    select(ensembl_gene_id, external_gene_name, entrezgene_accession)

##        ensembl_gene_id external_gene_name entrezgene_accession
## 1   ENSMUSG00000000562             Adora3               Adora3
## 2   ENSMUSG00000000562             Adora3               Tmigd3
## 3   ENSMUSG00000004455             Ppp1cc               Ppp1cc
## 4   ENSMUSG00000004455             Ppp1cc              Ppp1ccb
## 5   ENSMUSG00000015290              Ubl4a                Ubl4a
## 6   ENSMUSG00000015290              Ubl4a              Gm44504
## 7   ENSMUSG00000021451             Sema4d               Sema4d
## 8   ENSMUSG00000021451             Sema4d         LOC115488166
## 9   ENSMUSG00000022820             Ndufb4               Ndufb4
## 10  ENSMUSG00000022820             Ndufb4              Ndufb4c
## 11  ENSMUSG00000022820             Ndufb4              Ndufb4b
## 12  ENSMUSG00000023156              Rpp14                Rpp14
## 13  ENSMUSG00000023156              Rpp14                 Htd2
## 14  ENSMUSG00000027022              Xirp2                Xirp2
## 15  ENSMUSG00000027022              Xirp2               Gm1322
## 16  ENSMUSG00000032750               Gab3                 Gab3
## 17  ENSMUSG00000032750               Gab3              Gm38425
## 18  ENSMUSG00000038209              Itln1                Itln1
## 19  ENSMUSG00000038209              Itln1                Itlnb
## 20  ENSMUSG00000038244             Mical2               Mical2
## 21  ENSMUSG00000038244             Mical2              Micalcl
## 22  ENSMUSG00000039997             Ifi203               Ifi203
## 23  ENSMUSG00000039997             Ifi203         LOC102641031
## 24  ENSMUSG00000041841              Rpl37                Rpl37
## 25  ENSMUSG00000041841              Rpl37              Rpl37rt
## 26  ENSMUSG00000044783              Hjurp        A730008H23Rik
## 27  ENSMUSG00000044783              Hjurp                Hjurp
## 28  ENSMUSG00000047675               Rps8                 Rps8
## 29  ENSMUSG00000047675               Rps8              Gm15501
## 30  ENSMUSG00000047844               Bex4                 Bex4
## 31  ENSMUSG00000047844               Bex4         LOC115487124
## 32  ENSMUSG00000052305             Hbb-bs               Hbb-b1
## 33  ENSMUSG00000052305             Hbb-bs               Hbb-bs
## 34  ENSMUSG00000052749            Trim30b              Trim30b
## 35  ENSMUSG00000052749            Trim30b              Gm38525
## 36  ENSMUSG00000054128              H2-T3               H2-T18
## 37  ENSMUSG00000054128              H2-T3                H2-T3
## 38  ENSMUSG00000056116             H2-T22               H2-T22
## 39  ENSMUSG00000056116             H2-T22                H2-T9
## 40  ENSMUSG00000056457             Prl2c3               Prl2c3
## 41  ENSMUSG00000056457             Prl2c3               Prl2c4
## 42  ENSMUSG00000056629              Fkbp2                Fkbp2
## 43  ENSMUSG00000056629              Fkbp2         LOC114841036
## 44  ENSMUSG00000060550              H2-Q7                H2-Q7
## 45  ENSMUSG00000060550              H2-Q7                H2-Q9
## 46  ENSMUSG00000062006              Rpl34                Rpl34
## 47  ENSMUSG00000062006              Rpl34            Rpl34-ps1
## 48  ENSMUSG00000062270            Morf4l1              Morf4l1
## 49  ENSMUSG00000062270            Morf4l1             Morf4l1b
## 50  ENSMUSG00000062783              Csprs                Csprs
## 51  ENSMUSG00000062783              Csprs              Gm38510
## 52  ENSMUSG00000063316              Rpl27                Rpl27
## 53  ENSMUSG00000063316              Rpl27         LOC108167922
## 54  ENSMUSG00000063480              Snu13                Snu13
## 55  ENSMUSG00000063480              Snu13         LOC100862468
## 56  ENSMUSG00000070645               Ren1                 Ren1
## 57  ENSMUSG00000070645               Ren1                 Ren2
## 58  ENSMUSG00000071415              Rpl23                Rpl23
## 59  ENSMUSG00000071415              Rpl23         LOC100044627
## 60  ENSMUSG00000071415              Rpl23         LOC100862455
## 61  ENSMUSG00000072674             Plac9b               Gm9780
## 62  ENSMUSG00000072674             Plac9b               Plac9b
## 63  ENSMUSG00000073888             Ccl27a               Ccl27a
## 64  ENSMUSG00000073888             Ccl27a              Gm13306
## 65  ENSMUSG00000073888             Ccl27a               Ccl27b
## 66  ENSMUSG00000073888             Ccl27a         LOC100861978
## 67  ENSMUSG00000074141              Il4i1                Il4i1
## 68  ENSMUSG00000074141              Il4i1               Il4i1b
## 69  ENSMUSG00000074417            Gm14548               Pira11
## 70  ENSMUSG00000074417            Gm14548                Pira6
## 71  ENSMUSG00000074417            Gm14548                Pira7
## 72  ENSMUSG00000074417            Gm14548              Gm14548
## 73  ENSMUSG00000074419            Gm15448                Pira6
## 74  ENSMUSG00000074419            Gm15448              Gm15448
## 75  ENSMUSG00000075046              Duxf3                Duxf3
## 76  ENSMUSG00000075046              Duxf3                  Dux
## 77  ENSMUSG00000078452             Raet1d               Raet1a
## 78  ENSMUSG00000078452             Raet1d               Raet1b
## 79  ENSMUSG00000078452             Raet1d               Raet1c
## 80  ENSMUSG00000078452             Raet1d               Raet1d
## 81  ENSMUSG00000078485            Plekhn1              Plekhn1
## 82  ENSMUSG00000078485            Plekhn1               Klhl17
## 83  ENSMUSG00000078688               Mup2                 Mup2
## 84  ENSMUSG00000078688               Mup2                Mup13
## 85  ENSMUSG00000078817             Nlrp12               Nlrp12
## 86  ENSMUSG00000078817             Nlrp12         LOC115489386
## 87  ENSMUSG00000078878            Gm14305              Gm14432
## 88  ENSMUSG00000078878            Gm14305              Gm14305
## 89  ENSMUSG00000078899             Gm4631              Gm14430
## 90  ENSMUSG00000078899             Gm4631         LOC102633156
## 91  ENSMUSG00000078941                Ak6                 Taf9
## 92  ENSMUSG00000078941                Ak6                  Ak6
## 93  ENSMUSG00000079033              Mef2b                Mef2b
## 94  ENSMUSG00000079033              Mef2b              Gm45929
## 95  ENSMUSG00000079036             Alkbh1               Alkbh1
## 96  ENSMUSG00000079036             Alkbh1                  Nrp
## 97  ENSMUSG00000087408              Cers1                 Gdf1
## 98  ENSMUSG00000087408              Cers1                Cers1
## 99  ENSMUSG00000089756             Zfp966               Zfp966
## 100 ENSMUSG00000089756             Zfp966               Zfp968
## 101 ENSMUSG00000091537               Tma7                 Tma7
## 102 ENSMUSG00000091537               Tma7              Tma7-ps
## 103 ENSMUSG00000091563             Gm8108               Gm8108
## 104 ENSMUSG00000091563             Gm8108               Gm3752
## 105 ENSMUSG00000091617             Gm3752               Gm3005
## 106 ENSMUSG00000091617             Gm3752               Gm3752
## 107 ENSMUSG00000092165             Gm5624               Gm5930
## 108 ENSMUSG00000092165             Gm5624              Gm21154
## 109 ENSMUSG00000092349             Smim40               Smim40
## 110 ENSMUSG00000092349             Smim40         LOC115489363
## 111 ENSMUSG00000092349             Smim40         LOC115489383
## 112 ENSMUSG00000093803            Ppp2r3d              Ppp2r3d
## 113 ENSMUSG00000093803            Ppp2r3d         LOC108167320
## 114 ENSMUSG00000093803            Ppp2r3d         LOC108167694
## 115 ENSMUSG00000093979             Gm2237               Gm2237
## 116 ENSMUSG00000093979             Gm2237               Gm3667
## 117 ENSMUSG00000093996           Fam205a3             Fam205a4
## 118 ENSMUSG00000093996           Fam205a3             Fam205a3
## 119 ENSMUSG00000095199             Zfp967               Zfp968
## 120 ENSMUSG00000095199             Zfp967               Zfp967
## 121 ENSMUSG00000095304             Plac9a               Plac9a
## 122 ENSMUSG00000095304             Plac9a               Gm9780
## 123 ENSMUSG00000095545             Zfp969               Zfp968
## 124 ENSMUSG00000095545             Zfp969               Gm4724
## 125 ENSMUSG00000096488            Gm10409               Gm3667
## 126 ENSMUSG00000096488            Gm10409               Gm3752
## 127 ENSMUSG00000111375              Btbd8        A830010M20Rik
## 128 ENSMUSG00000111375              Btbd8                Btbd8

Deduplicate the using the `entrezgene_accession` and the `external_gene_name`

Collect the duplicated gene ids into a new object.

dups <- annot %>%  
    add_count(ensembl_gene_id) %>%  
    filter(n>1)

Fix as many as possible by keeping those where the two gene symbols match.

fixedDuplicates <- dups %>% 
    select(-n) %>% 
    filter(entrezgene_accession==external_gene_name)

Check that this has no duplicates.

fixedDuplicates %>%  
    add_count(ensembl_gene_id) %>%  
    filter(n>1)

##  [1] ensembl_gene_id      entrezgene_id        external_gene_name  
##  [4] description          gene_biotype         chromosome_name     
##  [7] start_position       end_position         strand              
## [10] entrezgene_accession n                   
## <0 rows> (or 0-length row.names)

Create a new annotation with all the unique entries from the full annotation plus the fixedDuplicates.

annot2 <- annot %>%  
    add_count(ensembl_gene_id) %>%  
    filter(n==1) %>% 
    select(-n) %>% 
    bind_rows(fixedDuplicates)

nrow(annot2)

## [1] 20087

length(unique(annot$ensembl_gene_id))

## [1] 20091

A pragmatic solution for the remainder

There are four remaining.

dups %>% 
    filter(!ensembl_gene_id%in%annot2$ensembl_gene_id)

##      ensembl_gene_id entrezgene_id external_gene_name
## 1 ENSMUSG00000078899        627914             Gm4631
## 2 ENSMUSG00000078899     102633156             Gm4631
## 3 ENSMUSG00000092165        546250             Gm5624
## 4 ENSMUSG00000092165     100861708             Gm5624
## 5 ENSMUSG00000095545     100043914             Zfp969
## 6 ENSMUSG00000095545     100043915             Zfp969
## 7 ENSMUSG00000096488     100042100            Gm10409
## 8 ENSMUSG00000096488     115488284            Gm10409
##                                                   description   gene_biotype
## 1     predicted gene 4631 [Source:MGI Symbol;Acc:MGI:3782813] protein_coding
## 2     predicted gene 4631 [Source:MGI Symbol;Acc:MGI:3782813] protein_coding
## 3     predicted gene 5624 [Source:MGI Symbol;Acc:MGI:3646247] protein_coding
## 4     predicted gene 5624 [Source:MGI Symbol;Acc:MGI:3646247] protein_coding
## 5 zinc finger protein 969 [Source:MGI Symbol;Acc:MGI:3782422] protein_coding
## 6 zinc finger protein 969 [Source:MGI Symbol;Acc:MGI:3782422] protein_coding
## 7    predicted gene 10409 [Source:MGI Symbol;Acc:MGI:3710610] protein_coding
## 8    predicted gene 10409 [Source:MGI Symbol;Acc:MGI:3710610] protein_coding
##   chromosome_name start_position end_position strand entrezgene_accession n
## 1               2      175321955    175338197     -1              Gm14430 2
## 2               2      175321955    175338197     -1         LOC102633156 2
## 3              14       44556795     44627938     -1               Gm5930 2
## 4              14       44556795     44627938     -1              Gm21154 2
## 5               2      175692223    175703646     -1               Zfp968 2
## 6               2      175692223    175703646     -1               Gm4724 2
## 7              14        3412614      3433800      1               Gm3667 2
## 8              14        3412614      3433800      1               Gm3752 2

We could spend time looking at the data bases but it’s really not important for the course, so we’ll just make an arbitrary decision to keep the first entry of each.

fixedDuplicates <- dups %>% 
    filter(!ensembl_gene_id%in%annot2$ensembl_gene_id) %>% 
    distinct(ensembl_gene_id, .keep_all = TRUE) %>%
    select(-n)

annotUn <- bind_rows(annot2, fixedDuplicates)
nrow(annotUn)

## [1] 20091

length(unique(annot$ensembl_gene_id))

## [1] 20091

all(filterValues%in%annotUn$ensembl_gene_id)

## [1] TRUE

Check for duplicated Entrez IDs

annotUn %>% 
    filter(!is.na(entrezgene_id)) %>% 
    add_count(entrezgene_id) %>% 
    filter(n>1) %>% 
    count(entrezgene_id)

##    entrezgene_id n
## 1          11877 2
## 2          14356 2
## 3          15953 2
## 4          18752 2
## 5          18861 2
## 6          19243 2
## 7          26912 2
## 8          27366 2
## 9          50518 2
## 10         66461 2
## 11         66990 2
## 12         67118 2
## 13         67149 2
## 14         67238 2
## 15         68646 2
## 16         70579 2
## 17         74477 2
## 18         78797 2
## 19         94089 2
## 20         99889 2
## 21        102115 2
## 22        207728 2
## 23        218734 2
## 24        280287 2
## 25        637515 3
## 26        664987 2
## 27        677884 3
## 28     100039796 2
## 29     100041194 2
## 30     100042100 3
## 31     100043914 2

There are 31 Entrez IDs that match multiple Ensembl IDs. Resolving which of these is correct would require more extensive research on the data bases. We need a pragmatic solution for the course. Let’s have alook at the signficance of these genes.

A pragmatic solution for the Entrez IDs

as.data.frame(results.interaction.11) %>% 
    rownames_to_column("ensembl_gene_id") %>% 
    left_join(annotUn, by = "ensembl_gene_id") %>% 
    filter(!is.na(entrezgene_id)) %>% 
    add_count(entrezgene_id) %>% 
    filter(n>1) %>% 
    select(ensembl_gene_id, entrezgene_id, padj, chromosome_name) %>% 
    arrange(padj)

##       ensembl_gene_id entrezgene_id         padj          chromosome_name
## 1  ENSMUSG00000115886     100039796 2.188768e-77 CHR_WSB_EIJ_MMCHR11_CTG1
## 2  ENSMUSG00000078920         15953 3.437499e-60                       11
## 3  ENSMUSG00000078921     100039796 2.667329e-34                       11
## 4  ENSMUSG00000096488     100042100 5.880162e-02                       14
## 5  ENSMUSG00000022253         68646 9.627923e-02                       15
## 6  ENSMUSG00000072812     100041194 9.939717e-02                       12
## 7  ENSMUSG00000116158        280287 1.922501e-01                        1
## 8  ENSMUSG00000070390        637515 1.962265e-01                       11
## 9  ENSMUSG00000089945        677884 2.536130e-01                        4
## 10 ENSMUSG00000116275         70579 3.226737e-01                        1
## 11 ENSMUSG00000110195        207728 3.388073e-01                        7
## 12 ENSMUSG00000118332         67238 3.523385e-01                        5
## 13 ENSMUSG00000074513         99889 3.679077e-01                        3
## 14 ENSMUSG00000111527         68646 3.795302e-01         CHR_MG4288_PATCH
## 15 ENSMUSG00000078532         67149 3.894284e-01                        4
## 16 ENSMUSG00000096793     100042100 4.522771e-01                       14
## 17 ENSMUSG00000102976         70579 5.774588e-01                        1
## 18 ENSMUSG00000116408         94089 6.618999e-01 CHR_WSB_EIJ_MMCHR11_CTG1
## 19 ENSMUSG00000063235         66461 6.767142e-01                        2
## 20 ENSMUSG00000083012         67238 7.066279e-01                        5
## 21 ENSMUSG00000000325         11877 7.075947e-01                       16
## 22 ENSMUSG00000022684         67118 7.175289e-01                       16
## 23 ENSMUSG00000106964         67149 7.280606e-01         CHR_MG4266_PATCH
## 24 ENSMUSG00000078816         18752 7.361723e-01                        7
## 25 ENSMUSG00000079737         67118 7.361723e-01                       16
## 26 ENSMUSG00000107877         74477 7.381156e-01                       11
## 27 ENSMUSG00000114378        218734 7.461406e-01                       14
## 28 ENSMUSG00000117310         19243 7.567281e-01                        1
## 29 ENSMUSG00000075569         18861 7.639096e-01                        5
## 30 ENSMUSG00000079109         18861 7.680020e-01                        5
## 31 ENSMUSG00000024571         27366 7.924897e-01                       18
## 32 ENSMUSG00000116254        637515 8.323231e-01 CHR_CAST_EI_MMCHR11_CTG4
## 33 ENSMUSG00000084897         50518 8.461295e-01                        2
## 34 ENSMUSG00000089847         14356 8.554819e-01                        7
## 35 ENSMUSG00000078905        664987 8.595182e-01                        2
## 36 ENSMUSG00000115074         78797 8.648626e-01                        2
## 37 ENSMUSG00000057130         27366 8.678565e-01                       18
## 38 ENSMUSG00000110860         66461 8.826651e-01          CHR_MG191_PATCH
## 39 ENSMUSG00000114004        102115 8.858318e-01                       10
## 40 ENSMUSG00000006378         26912 8.969835e-01                       15
## 41 ENSMUSG00000006471         78797 8.969835e-01                        2
## 42 ENSMUSG00000026064         19243 9.024507e-01                        1
## 43 ENSMUSG00000097394         66990 9.087617e-01          CHR_MG153_PATCH
## 44 ENSMUSG00000090053        677884 9.161891e-01                        4
## 45 ENSMUSG00000078898     100043914 9.177512e-01                        2
## 46 ENSMUSG00000090691     100042100 9.327098e-01                       14
## 47 ENSMUSG00000033111        218734 9.334572e-01                       14
## 48 ENSMUSG00000097449         18752 9.357120e-01         CHR_MG4151_PATCH
## 49 ENSMUSG00000040350         94089 9.519887e-01                       11
## 50 ENSMUSG00000116378         26912 9.537884e-01                       15
## 51 ENSMUSG00000110234         14356 9.544490e-01                        7
## 52 ENSMUSG00000102805         99889 9.688617e-01                        3
## 53 ENSMUSG00000078440        102115 9.729195e-01                       10
## 54 ENSMUSG00000020807         74477 9.754994e-01                       11
## 55 ENSMUSG00000098615         11877 9.792462e-01  CHR_MG3833_MG4220_PATCH
## 56 ENSMUSG00000024845         66990 9.818681e-01                       19
## 57 ENSMUSG00000030653        207728 9.829405e-01                        7
## 58 ENSMUSG00000090093        664987 9.883529e-01                        2
## 59 ENSMUSG00000038729        677884 9.908806e-01                        4
## 60 ENSMUSG00000115958        280287 9.960817e-01                        1
## 61 ENSMUSG00000027596         50518           NA                        2
## 62 ENSMUSG00000095545     100043914           NA                        2
## 63 ENSMUSG00000116073        637515           NA CHR_PWK_PHJ_MMCHR11_CTG2
## 64 ENSMUSG00000116330         15953           NA CHR_WSB_EIJ_MMCHR11_CTG1
## 65 ENSMUSG00000116477     100041194           NA         CHR_MG3490_PATCH

These genes are mostly non-significant, so it’s really not going to affect the results of the downstream analyses. Seeing as this is only a teaching exercise, we’ll arbitrarily set the second entry to NA. Some of the duplicates are on patch scaffolds, we’ll arrange by chromosome, so that these get set to NA.

dupEntrez <- annotUn %>% 
    add_count(entrezgene_id) %>% 
    filter(n>1) %>% 
    select(-n) %>% 
    arrange(entrezgene_id, chromosome_name)
dupEntrez$entrezgene_id[duplicated(dupEntrez$entrezgene_id)] <- NA

annotFinal <- annotUn %>% 
    add_count(entrezgene_id) %>% 
    filter(n==1) %>% 
    select(-n) %>% 
    bind_rows(dupEntrez)

Final checks

dim(annotFinal)

## [1] 20091    10

annotFinal %>% 
    filter(!is.na(entrezgene_id)) %>% 
    add_count(entrezgene_id) %>% 
    filter(n>1)

##  [1] ensembl_gene_id      entrezgene_id        external_gene_name  
##  [4] description          gene_biotype         chromosome_name     
##  [7] start_position       end_position         strand              
## [10] entrezgene_accession n                   
## <0 rows> (or 0-length row.names)

all(filterValues%in%annotFinal$ensembl_gene_id)

## [1] TRUE

Final table

ensemblAnnot <- rownames(results.interaction.11) %>%  
    enframe(name = NULL, value = "ensembl_gene_id")  %>%  
    left_join(annotFinal) %>%
    dplyr::select(GeneID="ensembl_gene_id", Entrez="entrezgene_id",
                  Symbol="external_gene_name", Description="description",
                  Biotype="gene_biotype", Chr="chromosome_name",
                  Start="start_position", End="end_position",
                  Strand="strand")

saveRDS(ensemblAnnot, file="RObjects/Ensembl_annotations.rds")

RNA-seq analysis in R

Making the annotation table for the course

Last modified: 19 Mar 2021

Preamble

Load a DESeq2 results table to retrieve gene ids from

Retrieve annotation using biomaRt

Set up connection to ensembl database

Retrieve the complete annotation

Fix the various One-to-Many relationships

Check for duplicated Entrez IDs

A pragmatic solution for the Entrez IDs

Final table

RNA-seq analysis in R

Making the annotation table for the course

Last modified: 19 Mar 2021

Preamble

Load a DESeq2 results table to retrieve gene ids from

Retrieve annotation using biomaRt

Set up connection to ensembl database

Retrieve the complete annotation

Fix the various One-to-Many relationships

Assess the one-to-many relationships related to duplicated Ensembl IDs

Deduplicate the using the entrezgene_accession and the external_gene_name

A pragmatic solution for the remainder

Check for duplicated Entrez IDs

A pragmatic solution for the Entrez IDs

Final table

Deduplicate the using the `entrezgene_accession` and the `external_gene_name`