PLEASE DO NOT RUN THE CODE IN THIS DOCUMENT ON THE COURSE MACHINES
Analysing an RNAseq experiment begins with sequencing reads. These are aligned to a reference genome, then the number of reads mapped to each gene can be counted. This results in a table of counts, which is what we perform statistical analyses on in R. This tutorial explains how to download the raw data files from the NCBI Sequence Read Archive public repository, how to QC the reads with FASTQC and finally how to align the reads to the reference genome.
The data for this course comes from a Nature Cell Biology paper, EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival (Fu et al. 2015). Both the raw data (sequence reads) can be downloaded from SRA under SRP045534 and processed data (counts) can be downloaded from Gene Expression Omnibus database (GEO) under accession number GSE60450.
Raw reads from sequencing experiments tend to be distributed through the Sequence Read Archive SRA). SRA provide command line tools for downloading and processing the archive files as the SRA toolkit.
Alternatively the (SRAdb)[http://bioconductor.org/packages/release/bioc/html/SRAdb.html] Bioconductor package can be used to query and download files that are hosted in SRA from within R.
We will download the data using the SRA toolkit in the Terminal.
You will need to select the correct version from the website above for your operating system, in this case we are on a CentOS Linux machine.
# download the gzip file
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.1-1/sratoolkit.2.9.1-1-centos_linux64.tar.gz
# unzip the file
tar -xzvf sratoolkit.2.9.1-1-centos_linux64.tar.gz
# add the 'bin' directory to the PATH - note the you will need to do this
# everytime you start a new terminal and wish to use the toolkit
export PATH=$PWD/sratoolkit.2.9.1-1-centos_linux64/bin/:${PATH}
# create a directory to which to download the sra files
mkdir sraDir
# use the vdb-config tool to set the download directory
vdb-config -i # this pops up an interactive window instructions below
Use the vdb-config window to set the Default Import Path to the new sra
directory we just created. Use tab to navigate to Change
under the Set Default Import Path
(the highlighting indicates the active field the arrow keys and tab to navigate to the correct directory. When you have changed the directory Save
and Exit
We can now directly download each sra
file. The sra
file is SRA's own archive format, but we can extract the raw reads in the more common .fastq
format in the next step.
To download the sra file we need their accessions numbers. Go to the SRA Run Selector and enter the project number SRP045534. This will give us the 8 SRR
run numbers that we need to download.
Use the prefetch
tool from the sra toolkit to download each file. This will take
prefetch SRR1552444
prefetch SRR1552445
prefetch SRR1552446
prefetch SRR1552447
prefetch SRR1552448
prefetch SRR1552449
prefetch SRR1552450
prefetch SRR1552451
prefetch SRR1552452
prefetch SRR1552453
prefetch SRR1552454
prefetch SRR1552455
Using the SRA Toolkit command-line utility from NCBI we can generate the fastq
files from these archive files. We can do this within a Terminal (i.e. not within RStudio) with the following, making sure your working directory contains the .sra
files.
mkdir fastq
for sraFile in sraDir/sra/*.sra; do
echo "Extracting fastq from "${sraFile}
fastq-dump \
--origfmt \
--gzip \
--outdir fastq \
${sraFile}
done
After each fastq file has been extracted, you should see a message to report have many reads (spots) are contained in the file. Note that this process may take several hours.
Fu, Nai Yang, Anne C Rios, Bhupinder Pal, Rina Soetanto, Aaron T L Lun, Kevin Liu, Tamara Beck, et al. 2015. “EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival.” Nature Cell Biology 17 (4): 365–75. doi:10.1038/ncb3117.