Introduction

In this section we will look at how to work with very large data sets that it would not be feasible to have loaded completely into memory. We will instead store the data in a relational database and retrieve just the data we need to display within the Shiny application.

Following on from the breast cancer expression data example, we will rework the Shiny app such that the large expression data matrix is stored in a SQLite database.

SQLite is a lightweight relational database management system that is very convenient to use when building and testing Shiny applications. If you are using a Mac or Linux machine you’re quite likely to already have this installed. Try typing which sqlite3 at the command line.

We will also be using the tidyverse package dplyr to filter and combine expression data and ER status for displaying as a boxplot and the piping mechanism using the %>% operator. If you are unfamiliar with these, take a look at the Data Manipulation and Visualization Using R course materials.

One of the most powerful features of dplyr is that it can work almost seamlessly with a database back end with minimal changes to code that performs filtering and combining (or join) operations. Other databases supported include MySQL and PostgreSQL.

When

Consider using a database when

  • Working with a very large data set
  • You only need to process/display a subset of the data at any one time

Why

  • Reduce memory footprint by only loading what is needed
  • Improved responsiveness

Convert data to tidy format

We first need to convert the NKI expression data from the familiar expresssion matrix format (wide format) to a tidy or narrow format that is more suitable for database queries and more generally for filtering or subsetting operations.

Wide format

probe NKI_4 NKI_6 NKI_7 NKI_8 NKI_9
Contig45645_RC -0.215 0.071 0.182 -0.343 -0.134
Contig44916_RC -0.207 0.055 0.077 0.302 0.051
D25272 -0.158 -0.010 0.059 0.169 -0.007
J00129 -0.819 -0.391 -0.624 -0.528 -0.811
Contig29982_RC -0.267 -0.310 -0.120 -0.447 -0.536

Narrow or tidy format

probe sample expression
Contig45645_RC NKI_4 -0.215
Contig44916_RC NKI_4 -0.207
D25272 NKI_4 -0.158
J00129 NKI_4 -0.819
Contig29982_RC NKI_4 -0.267
Contig45645_RC NKI_6 0.071
Contig44916_RC NKI_6 0.055
D25272 NKI_6 -0.010
J00129 NKI_6 -0.391
Contig29982_RC NKI_6 -0.310

Note that we’ll also be using ggplot2 to create our boxplot and that also requires the data to be in the narrow, or tidy, format.

We’ll also create data frames for the probe/gene mapping and sample ER status containing just the information we need for the Shiny application.

library(breastCancerNKI)
library(Biobase)
library(tidyverse)

data(nki)
erStatus <- pData(nki) %>%
  select(sample = samplename, er)

head(erStatus)
##        sample er
## NKI_4   NKI_4  1
## NKI_6   NKI_6  1
## NKI_7   NKI_7  0
## NKI_8   NKI_8  0
## NKI_9   NKI_9  1
## NKI_11 NKI_11  1
probes <- fData(nki) %>%
  select(probe, gene = HUGO.gene.symbol)

head(probes)
##                         probe   gene
## Contig45645_RC Contig45645_RC  GREM2
## Contig44916_RC Contig44916_RC  SUHW2
## D25272                 D25272   <NA>
## J00129                 J00129    FGB
## Contig29982_RC Contig29982_RC SCARA5
## Contig26811       Contig26811   <NA>
expressionData <- exprs(nki) %>%
  as.data.frame %>%
  rownames_to_column("probe") %>%
  gather(sample, expression, -probe) %>%
  filter(!is.na(expression))

head(expressionData)
##            probe sample expression
## 1 Contig45645_RC  NKI_4     -0.215
## 2 Contig44916_RC  NKI_4     -0.207
## 3         D25272  NKI_4     -0.158
## 4         J00129  NKI_4     -0.819
## 5 Contig29982_RC  NKI_4     -0.267
## 6    Contig26811  NKI_4      0.229

Create a SQLite database

We create a new SQLite database with the src_sqlite function. Then we use copy_to to create a database table for each of the data frames. Note that we create indexes on each of the columns that we might need to query or use when performing join operations to improve performance.

# create a new SQLite3 database
db <- src_sqlite("breastCancerNKI.sqlite", create = TRUE)

# create tables for each of the erStatus, probes and expressionData data frames
# with indexes for efficient searches/filtering
copy_to(db, erStatus, indexes = list("sample", "er"), temporary = FALSE, overwrite = TRUE)
copy_to(db, probes, indexes = list("probe", "gene"), temporary = FALSE, overwrite = TRUE)
copy_to(db, expressionData, indexes = list("probe", "sample"), temporary = FALSE, overwrite = TRUE)

Test retrieval from the database

First we create the database connection using src_sqlite as before. Then we use the tbl function to create a dplyr table for each of our database tables.

db <- src_sqlite("breastCancerNKI.sqlite")

erStatus <- tbl(db, "erStatus")
probes <- tbl(db, "probes")
expressionData <- tbl(db, "expressionData")

At this point no data have been loaded into memory. We can load the contents of an entire table using collect.

head(collect(erStatus))
## # A tibble: 6 x 2
##   sample    er
##    <chr> <int>
## 1  NKI_4     1
## 2  NKI_6     1
## 3  NKI_7     0
## 4  NKI_8     0
## 5  NKI_9     1
## 6 NKI_11     1

While this might be appropriate for smaller tables, we do not want to do this for our large expression data table. Instead we would use collect to bring the results of a database query, i.e. some subset of the data that we need to access, into memory.

The following shows how to retrieve the expression data and ER status for the PTEN gene.

probes %>%
  filter(Gene == "PTEN") %>%
  left_join(expressionData, by = "probe") %>%
  left_join(erStatus, by = "sample") %>%
  collect %>%
  tbl_df
## # A tibble: 668 x 5
##       probe  gene sample expression    er
##       <chr> <chr>  <chr>      <dbl> <int>
##  1 AF019083  PTEN  NKI_4     -0.062     1
##  2 AF019083  PTEN  NKI_6     -0.072     1
##  3 AF019083  PTEN  NKI_7      0.064     0
##  4 AF019083  PTEN  NKI_8      0.132     0
##  5 AF019083  PTEN  NKI_9      0.018     1
##  6 AF019083  PTEN NKI_11     -0.003     1
##  7 AF019083  PTEN NKI_12     -0.265     0
##  8 AF019083  PTEN NKI_13      0.046     1
##  9 AF019083  PTEN NKI_14     -0.028     1
## 10 AF019083  PTEN NKI_17     -0.105     1
## # ... with 658 more rows

The filter operation for selecting the probes for the PTEN gene and the left_join operations to retrieve the expression data and ER status for each sample are standard dplyr operations that produce the same result when applied to the original in-memory data frames.

You can use the show_query function to see the SQL query that this being run on the SQLite database.

probes %>%
  filter(Gene == "PTEN") %>%
  left_join(expressionData, by = "probe") %>%
  left_join(erStatus, by = "sample") %>%
  show_query
## <SQL>
## SELECT `TBL_LEFT`.`probe` AS `probe`, `TBL_LEFT`.`gene` AS `gene`, `TBL_LEFT`.`sample` AS `sample`, `TBL_LEFT`.`expression` AS `expression`, `TBL_RIGHT`.`er` AS `er`
##   FROM (SELECT `TBL_LEFT`.`probe` AS `probe`, `TBL_LEFT`.`gene` AS `gene`, `TBL_RIGHT`.`sample` AS `sample`, `TBL_RIGHT`.`expression` AS `expression`
##   FROM (SELECT *
## FROM `probes`
## WHERE (`Gene` = 'PTEN')) AS `TBL_LEFT`
##   LEFT JOIN `expressionData` AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`probe` = `TBL_RIGHT`.`probe`)
## ) AS `TBL_LEFT`
##   LEFT JOIN `erStatus` AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`sample` = `TBL_RIGHT`.`sample`)

Update the shiny application

library(shiny)
library(tidyverse)

db <- src_sqlite("breastCancerNKI.sqlite")
erStatus <- tbl(db, "erStatus")
probes <- tbl(db, "probes")
expressionData <- tbl(db, "expressionData")

ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      selectInput("thegene", "Gene", choices = c("ESR1", "ERBB2", "PTEN"), selected = "ESR1"),
      width = 3
    ),
    mainPanel(
      plotOutput("boxplot"),
      width = 9
    )
  )
)

server <- function(input, output) {
  output$boxplot <- renderPlot({
    probes %>%
      filter(Gene == input$thegene) %>%
      left_join(expressionData, by = "probe") %>%
      left_join(erStatus, by = "sample") %>%
      select(probe, expression, er) %>%
      collect %>%
      mutate(er = as.factor(ifelse(er == 1, "positive", "negative"))) %>%
      ggplot(aes(x = er, y = expression, fill = probe)) +
      geom_boxplot()
  })
}

shinyApp(ui = ui, server = server)
Shiny applications not supported in static R Markdown documents

Note that PTEN has two probes. In the previous part of the course we showed only the expression values for the first matching probe but here we display the data for all probes, taking advantage of our reformatted data and plotting with ggplot2.