The data-set for this practical is a publicly available dataset downloaded from the NCBI GEO repository with the accession: GSE15780. It looks at the genome wide binding of tp53 and tp73 (TAp73beta isoform) transcription factors in the human osteosarcoma cell lines Saos-2.
We downloaded the dataset (fastq files) from the Sequence Read Archive using the SRA-toolkit. There are multiple ways of doing this.
https://www.ncbi.nlm.nih.gov/sra
https://ncbi.github.io/sra-tools/install_config.html
https://bioconductor.org/packages/release/bioc/html/SRAdb.html
Genomic alignments can be time consuming and not realistic to do in the short time we have. Therefore, we downloaded and preprocessed a single chromosome from the above dataset to save time. This preprocessing step included aligning to a GRCh38 genome with a sponge database (which removes artefacts and non-chromosomal sequences) and then regenerating the chr3 fastq files.
In this tutorial we will first generate a quality report about a single sample (tp73_rep2) from this dataset using FastQC, based on this report we will decide what type of trimming is needed in order to improve the quality of our reads before alignment. We will use Cutadapt to trim the reads, and finally we will generate a new quality report from the trimmed reads and compare it with the original one.
Let’s open a terminal window and change the directory to ~/Course_Materials/Introduction/SS_DB/RawData/ChIPseq/:
cd ~/Course_Materials/Introduction/SS_DB/RawData/ChIPseq/
We will use FastQC to check the quality of our sequence reads. Here we will use the command line version of FastQC. To check what kind of options this tool has, type:
fastqc --help
This command will display in your terminal window all the parameters you can use when running FastQC. Now we will run a fairly simple command:
fastqc -o ~/Course_Materials/Introduction/SS_DB/QC/ --noextract -f fastq tp53_r2.fastq.gz
The options we were using:
Open the generated .html report (that you find in ~/Course_Materials/Introduction/SS_DB/QC/ folder) and go through each section carefully.
Once you had a closer look at the quality report you can realize that the data quality is not toot bad, however we still might be able to improve the quality with a quality based trimming since the quality drops towards the end of the reads:
We will use Cutadapt for trimming, so let’s have a look at its help page:
cutadapt --help
As you can see Cutadapt has many options for:
In our case all we want to do is to remove low quality bases from our reads. We can use the following command to do this:
cutadapt -m 10 -q 20 -o tp53_r2.fastq_trimmed.fastq.gz tp53_r2.fastq.gz
Let’s go through the parameters we are using in the command above:
Once the trimming has finished we will want to check the quality of our trimmed reads as well to make sure, we are happy with its results: the trimming improved the quality and it didn’t introduce new artefacts. So let’s run FastQC again on our trimmed .fastq file with the following command:
fastqc -o ~/Course_Materials/Introduction/SS_DB/QC/ --noextract -f fastq tp53_r2.fastq_trimmed.fastq.gz
Now open the generated report. As you can see after trimming: