Analysing an RNAseq experiment begins with sequencing reads. This tutorial explains how to begin by downloading the raw data files from the NCBI Sequence Read Archive public repository.
The data for this course comes from a Frontiers in Microbiology paper, Transcriptomic Profiling of Mouse Brain During Acute and Chronic Infections by Toxoplasma gondii Oocysts* (Hu et al. 2020). The raw data (sequence reads) can be downloaded from SRA under under the bio-project number PRJNA483261.
Raw reads from sequencing experiments tend to be distributed through the Sequence Read Archive SRA). SRA provide command line tools for downloading and processing the archive files as the SRA toolkit.
Alternatively the (SRAdb)[http://bioconductor.org/packages/release/bioc/html/SRAdb.html] Bioconductor package can be used to query and download files that are hosted in SRA from within R.
We will download the data using the SRA toolkit in the Terminal.
You will need to select the correct version from the website above for your operating system, in this case we are on a CentOS Linux machine. There are other versions for Windows and MAC OS, be sure to download the correct version for your system.
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.10.9/sratoolkit.2.10.9-centos_linux64.tar.gz
tar -xzvf sratoolkit.2.10.9-centos_linux64.tar.gz
bin
directory to the PATHThe tools are located in the bin
directory. Adding them to the PATH allows us to use them by name on the command line without having to provide the full path to file every time.
NOTE: You will need to do this every time you start a new terminal and wish to use the toolkit
export PATH=$PWD/sratoolkit.2.10.9-centos_linux64/bin/:${PATH}
We want to direct the toolkit to download the data to a directory we specify.
mkdir sra
vdb-config -i
The last command will open an interactive window:
Use the vdb-config window to set the import path to the sra
directory we just created:
sra
:
sra
directory press ‘Tab’, the red indicator will move to ‘OK’, then press ‘Enter’We can now directly download the sra
files. The sra
file is SRA’s own archive format, but we can extract the raw reads in the more common .fastq
format in the next step.
To download the sra file we need their accessions numbers. Go to the SRA Run Selector and enter the project number PRJNA483261.
“Metadata” –> SraRunTable.txt “Accession List” –> SRR_Acc_List.txt
Use the prefetch
tool from the sra toolkit to download each file.
for sraAcc in `cat SRR_Acc_List.txt`; do
prefetch ${srrAcc}
done
This will download the sra
files into the sra
directory. There will be one file for each SRR
run number in the SRR_Acc_List.txt
file, these correspond to samples.
We can extract reads from these archive files to fastq
format files using fasterq-dump
tool.
mkdir fastq
for sraFile in sraDir/sra/*.sra; do
fasterq-dump -O fastq -e 8 --split-files ${sraFile}
done
After each fastq file has been extracted, you should see a message to report have many reads are contained in the file.