1 Introduction
2 Downloading data from SRA
3 Checking read quality using FastQC
4 Quality based trimming using Cutadapt
5 Checking the quality of trimmed reads using FastQC

1 Introduction

The data-set for this practical is a publicly available dataset downloaded from the NCBI GEO repository with the accession: GSE15780. It looks at the genome wide binding of tp53 and tp73 (TAp73beta isoform) transcription factors in the human osteosarcoma cell line Saos-2.

Genomic alignments can be time consuming and not realistic to do in the short time we have on this course. Therefore, we downloaded and preprocessed a single chromosome from the above dataset to save time. This preprocessing step included aligning to a GRCh38 genome with a sponge database (which removes artefacts and non-chromosomal sequences) and then regenerating the chr3 fastq files.

In this tutorial first we will see an example on how to download fastq files, then we generate a quality report about a single sample, tp73_rep2 (that we preprocessed for you following the above mentioned method) using FastQC, based on this report we will decide what type of trimming is needed in order to improve the quality of our reads before alignment. We will use Cutadapt to trim the reads, and finally we will generate a new quality report from the trimmed reads and compare it with the original one.

2 Downloading data from SRA

There are multiple ways of downloading data from SRA (Sequence Read Archive).

Browse the online SRA dababase and download the data.

https://www.ncbi.nlm.nih.gov/sra

Use SRA toolkit. You need to install and configure this first. SRA toolkit is not installed on your computers, so DO NOT RUN the code below! For later reference, you can find the user manual and a simple command to download an sra file and convert it into fastq file below:

https://ncbi.github.io/sra-tools/

prefetch SRR098986
fastq-dump --gzip SRR098986.sra

Use the SRAdb Bioconductor package to download the fastq files.

https://bioconductor.org/packages/release/bioc/html/SRAdb.html

Now try this in R:

library(SRAdb)

setwd("~/Course_Materials/")
sqlfile <- 'SRAmetadb.sqlite'

## this line would download the sqlite database if it wasn't in the folder
if(!file.exists('SRAmetadb.sqlite')) sqlfile <<- getSRAdbFile()

## establish a connection to the database
sra_con <- dbConnect(SQLite(),sqlfile)

## download the SRR file
getSRAfile(in_acc = c("SRR098986"), sra_con = sra_con, fileType='fastq',
           destDir = "~/Course_Materials/Introduction/SS_DB/RawData/")

3 Checking read quality using FastQC

Let’s open a terminal window and change the directory to:

~/Course_Materials/Introduction/SS_DB/RawData/ChIPseq/:

cd ~/Course_Materials/Introduction/SS_DB/

We will use FastQC to check the quality of our sequence reads. Here we will use the command line version of FastQC. To check what kind of options this tool has, type:

fastqc --help

This command will display in your terminal window all the parameters you can use when running FastQC. Now we will run a fairly simple command:

fastqc -o ~/Course_Materials/Introduction/SS_DB/QC_Trimming/ --noextract -f fastq RawData/ChIPseq/tp53_r2.fastq.gz

The options we were using:

-o FOLDER_NAME: you can define this way in which folder you want FastQC to output its results,
–noextract: will tell FastQC not to uncompress the output file after creating it,
-f fastq: here we define that our input file is a fastq file (valid formats are bam,sam,bam_mapped,sam_mapped and fastq).

Open the generated .html report (that you find in ~/Course_Materials/Introduction/SS_DB/QC_Trimming/ folder) and go through each section carefully.

4 Quality based trimming using Cutadapt

Once you had a closer look at the quality report you can realize that the data quality is not toot bad, however we still might be able to improve the quality with a quality based trimming since the quality drops towards the end of the reads:

FastQC report of raw reads

We will use Cutadapt for trimming, so let’s have a look at its help page:

cutadapt --help

As you can see Cutadapt has many options for:

trimming based on quality threshold,
trimming some bases from the 5’ or 3’ ends of reads,
removing adapter contaminations.

In our case all we want to do is to remove low quality bases from our reads. We can use the following command to do this:

cutadapt -m 10 -q 20 -o QC_Trimming/tp53_r2.fastq_trimmed.fastq.gz RawData/ChIPseq/tp53_r2.fastq.gz

Let’s go through the parameters we are using in the command above:

-m 10: will discard all reads that would be shorter than a read length of 10 after the trimming,
-q 20: will trim low-quality bases from the 3’ end of the reads; if two comma-separated cutoffs are given, the 5’ end is trimmed with the first cutoff, the 3’ end with the second,
-o FILE_NAME: the output file name.

5 Checking the quality of trimmed reads using FastQC

Once the trimming has finished we will want to check the quality of our trimmed reads as well to make sure, we are happy with its results: the trimming improved the quality and it didn’t introduce new artefacts. So let’s run FastQC again on our trimmed .fastq file with the following command:

fastqc -o QC_Trimming/ --noextract -f fastq QC_Trimming/tp53_r2.fastq_trimmed.fastq.gz

Now open the generated report. As you can see after trimming:

the quality doesn’t drop that much at the end of the reads anymore,
we see that a new warning appeared under “Sequence Length Distribution” which was expected since our read length is not constant anymore,
if you check the “Basic Statistics” section you will see that we didn’t actually loose much data, the amount of sequences decreased, but only slightly.

Quality control and artefact removal

12 July, 2019

Contents

1 Introduction

2 Downloading data from SRA

3 Checking read quality using FastQC

4 Quality based trimming using Cutadapt

5 Checking the quality of trimmed reads using FastQC