RNA-seq analysis in R

Challenge 1

Use pathview to export a figure for “mmu04060”, but this time only use genes that are statistically significant at FDR < 0.01

logFC <- shrinkLvV %>% 
  drop_na(FDR, Entrez) %>% 
  filter(FDR < 0.01) %>% 
  select(Entrez, logFC) %>% 
  deframe()

pathview(gene.data = logFC, 
         pathway.id = "mmu04060", 
         species = "mmu", 
         limit = list(gene=5, cpd=1))

## Info: Downloading xml files for mmu04060, 1/1 pathways..

## Info: Downloading png files for mmu04060, 1/1 pathways..

## 'select()' returned 1:1 mapping between keys and columns

## Info: Working in directory /home/cri.camres.org/sawle01/Documents/training/RNAseq_November_2020_remote/Course_Materials/solutions

## Info: Writing image file mmu04060.pathview.png

mmu04060.pathview.png:

mmu04060 -Cytokine-cytokine receptor interaction

Challenge 2 - GO term enrichment analysis

clusterProfiler can also perform over-representation analysis on GO terms. using the commmand enrichGO. Look at the help page for the command enrichGO (?enrichGO) and have a look at the instructions in the clusterProfiler book.

Run the over-representation analysis for GO terms

Use genes that have an adjusted p-value (FDR) of less than 0.01 and an absolute fold change greater than 2.

For this analysis you can use Ensembl IDs rather then Entrez

You’ll need to provide the background (universe) genes, this should be all the genes in our analysis.

The mouse database package is called org.Mm.eg.db. You’ll need to load it using library before running the analysis.

As we are using Ensembl IDs, you’ll need to set the keyType parameter in the enrichGO command to indicate this.

Only test terms in the “Molecular Function” ontology

Use the dotplot function to visualise the results.

library(org.Mm.eg.db)
sigGenes <- shrinkLvV %>% 
    drop_na(FDR) %>% 
    filter(FDR < 0.01 & abs(logFC) > 1) %>% 
    pull(GeneID)

universe <- shrinkLvV$GeneID

ego <- enrichGO(gene          = sigGenes, 
                universe      = universe,
                OrgDb         = org.Mm.eg.db,
                keyType       = "ENSEMBL",
                ont           = "MF",
                pvalueCutoff  = 0.01,
                readable      = TRUE)
dotplot(ego)

Challenge 3

Another common way to rank the genes is to order by pvalue, but also, sorting so that upregulated genes are at start and downregulated at the other - you can do this combining the sign of the fold change and the pvalue.

Rank the genes by statisical significance - you will need to create a new ranking value using -log10({p value}) * sign({Fold Change})

Load the “C2” pathways from the the Robjects/mouse_c2_v5.RData file

Run fgsea using the new ranked genes and the C2 pathways

ranks <- -log10(gseaDat$pvalue) * sign(gseaDat$logFC) # or gseaDat$stat
names(ranks) <- gseaDat$Entrez  

load("Robjects/mouse_c2_v5.RData")

fgseaResC2 <- fgsea(Mm.c2, ranks, minSize=15, maxSize = 500)
head(fgseaResC2[order(padj, -abs(NES)), ], n=10)

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.08% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

## Warning in fgseaMultilevel(...): For some pathways, in reality P-values are less
## than 1e-10. You can set the `eps` argument to zero for better estimation.

## # A tibble: 10 x 8
##    pathway                   pval    padj log2err     ES   NES  size leadingEdge
##    <chr>                    <dbl>   <dbl>   <dbl>  <dbl> <dbl> <int> <list>     
##  1 REACTOME_INFLUENZA_…  1.00e-10 2.15e-8      NA -0.859 -2.70   128 <chr [80]> 
##  2 REACTOME_NONSENSE_M…  1.00e-10 2.15e-8      NA -0.887 -2.70   100 <chr [74]> 
##  3 REACTOME_3_UTR_MEDI…  1.00e-10 2.15e-8      NA -0.886 -2.70   100 <chr [68]> 
##  4 REACTOME_PEPTIDE_CH…  1.00e-10 2.15e-8      NA -0.908 -2.70    79 <chr [68]> 
##  5 REACTOME_INFLUENZA_…  1.00e-10 2.15e-8      NA -0.884 -2.69    95 <chr [69]> 
##  6 KEGG_RIBOSOME         1.00e-10 2.15e-8      NA -0.897 -2.66    81 <chr [66]> 
##  7 REACTOME_METABOLISM…  1.00e-10 2.15e-8      NA -0.774 -2.57   202 <chr [87]> 
##  8 REACTOME_TRANSLATION  1.00e-10 2.15e-8      NA -0.800 -2.56   141 <chr [71]> 
##  9 REACTOME_SRP_DEPEND…  1.00e-10 2.15e-8      NA -0.843 -2.55   103 <chr [67]> 
## 10 REACTOME_METABOLISM…  1.00e-10 2.15e-8      NA -0.755 -2.54   244 <chr [97]>

Run fgsea using the new ranked genes and the H pathways. How do these results differ from the ones we got when ranking by the fold change alone?

fgseaResH <- fgsea(Mm.H, ranks, minSize=15, maxSize = 500)
head(fgseaResH[order(padj, -abs(NES)), ], n=20)

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.08% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

## Warning in fgseaMultilevel(...): For some pathways, in reality P-values are less
## than 1e-10. You can set the `eps` argument to zero for better estimation.

## # A tibble: 20 x 8
##    pathway                  pval     padj log2err     ES   NES  size leadingEdge
##    <chr>                   <dbl>    <dbl>   <dbl>  <dbl> <dbl> <int> <list>     
##  1 HALLMARK_MYC_TARG…   1.00e-10  5.00e-9  NA     -0.683 -2.24   201 <chr [56]> 
##  2 HALLMARK_OXIDATIV…   5.12e- 9  1.28e-7   0.761  0.648  2.07   196 <chr [128]>
##  3 HALLMARK_E2F_TARG…   8.51e- 7  1.42e-5   0.659 -0.562 -1.85   201 <chr [49]> 
##  4 HALLMARK_TGF_BETA…   2.25e- 6  2.82e-5   0.627 -0.742 -2.05    54 <chr [15]> 
##  5 HALLMARK_ADIPOGEN…   5.70e- 6  5.70e-5   0.611  0.573  1.83   199 <chr [79]> 
##  6 HALLMARK_MYC_TARG…   1.05e- 5  8.76e-5   0.593 -0.701 -1.96    58 <chr [25]> 
##  7 HALLMARK_CHOLESTE…   3.10e- 5  2.21e-4   0.557  0.701  1.94    73 <chr [29]> 
##  8 HALLMARK_FATTY_AC…   4.63e- 5  2.57e-4   0.557  0.576  1.78   155 <chr [54]> 
##  9 HALLMARK_MTORC1_S…   4.52e- 5  2.57e-4   0.557  0.550  1.76   203 <chr [47]> 
## 10 HALLMARK_G2M_CHEC…   7.54e- 5  3.77e-4   0.538 -0.506 -1.66   201 <chr [29]> 
## 11 HALLMARK_ESTROGEN…   4.44e- 4  2.02e-3   0.498 -0.482 -1.57   193 <chr [54]> 
## 12 HALLMARK_PEROXISO…   9.93e- 4  4.14e-3   0.455  0.609  1.75    95 <chr [39]> 
## 13 HALLMARK_ALLOGRAF…   1.08e- 3  4.14e-3   0.455 -0.460 -1.52   202 <chr [41]> 
## 14 HALLMARK_WNT_BETA…   2.16e- 3  7.70e-3   0.432 -0.658 -1.73    40 <chr [18]> 
## 15 HALLMARK_NOTCH_SI…   2.48e- 3  7.76e-3   0.432 -0.696 -1.73    31 <chr [15]> 
## 16 HALLMARK_TNFA_SIG…   2.44e- 3  7.76e-3   0.432 -0.449 -1.47   199 <chr [60]> 
## 17 HALLMARK_DNA_REPA…   2.99e- 3  8.78e-3   0.432 -0.459 -1.46   150 <chr [27]> 
## 18 HALLMARK_BILE_ACI…   5.90e- 3  1.64e-2   0.407  0.548  1.57    98 <chr [37]> 
## 19 HALLMARK_ESTROGEN…   1.09e- 2  2.86e-2   0.381 -0.416 -1.36   197 <chr [39]> 
## 20 HALLMARK_PROTEIN_…   1.71e- 2  4.27e-2   0.352  0.499  1.44    96 <chr [56]>

RNA-seq analysis in R

Gene Set Testing for RNA-seq - Solutions

Stephane Ballereau, Dominique-Laurent Couturier, Mark Dunning, Abbi Edwards, Ashley Sawle

Last modified: 20 Nov 2020

Challenge 1

Challenge 2 - GO term enrichment analysis

Challenge 3