This document gives a very brief introduction to read trimming. Read trimming may desirable to remove adapter sequence or poor quality sequence from reads prior to analysis.
Whilst most aligners and the kmer quantification methods can copy with adapter contamination without trimming the reads. It can also be helpful to trim reads in order to get a better idea of the quality of the remaining sequence.
There are a number of tools that can be used for read trimming e.g.:
They have a varying range of clipping and trimming features, but for simple removal of adapter sequences they all perform the same. The usage is different for each.
fastp` in particular has an extensive set of options for trimming and processing reads in various ways.
In this example we will be using Trimmomatic (Bolger, Lohse, and Usadel 2014).
We have provided a toy data set which features adapter contamination: fastq/Test_adapter_contamination.fq.gz
.
First run fastqc on the sample:
mkdir QC
fastqc -o QC fastq/Test_adapter_contamination.fq.gz
If you open the resulting FASTQC report you should see that a number of the plots show problems with the data. In particular the “Per sequence GC content” plot and the “Adapter Content” plot:
This shows that there is significant contaiminaion with “Illumina Universal Adapter.”
To trim the adapter we need to provide Trimmomatic with a fasta file containing the adapters we want to remove. For common Illumina adapters, these are provided in the Trimmomatic directory under adapters
. Here we have single end data, so we will use the fasta Trimmomatic-0.39/adapters/TruSeq3-SE.fa
.
There are various trimming steps that Trimmomatic will apply. We will only use
ILLUMINACLIP:
MINLEN:
Details of all the parameters can be found in the documentation on the Trimmomatic website.
The command we need use is:
java -jar trimmomatic/trimmomatic-0.39.jar \
SE \
-phred33 \
-trimlog fastq/trimlog.txt \
fastq/Test_adapter_contamination.fq.gz \
fastq/Test_adapter_contamination.trimmed.fastq \
ILLUMINACLIP:trimmomatic/adapters/TruSeq3-SE.fa:2:30:7 \
MINLEN:15
You should see the following message:
TrimmomaticSE: Started with arguments:
-phred33 -trimlog fastq/trimlog.txt fastq/Test_adapter_contamination.fq.gz fastq/Test_adapter_contamination.trimmed.fastq ILLUMINACLIP:trimmomatic/adapters/TruSeq3-SE.fa:2:30:7 MINLEN:15
Automatically using 4 threads
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
ILLUMINACLIP: Using 0 prefix pairs, 2 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Reads: 31469 Surviving: 26069 (82.84%) Dropped: 5400 (17.16%)
TrimmomaticSE: Completed successfully
Trimmomatic has run sucessfully. ~17% of reads have been discarded as they after trimming they are < 15 nucleotides in length.
The file fastq/trimlog.txt
contains a log of what has happened to each and every read. It contains 5 columns:
For full size fastq files, these log files will be very large. We recommend that if you wish to keep them, you should compress them with zip
or gzip
.
Also, note that the output fastq files that Trimmomatic outputs also need be compressed to save disk space.
Run FASTQC on the new trimmed reads.
fastqc -o QC fastq/Test_adapter_contamination.trimmed.fastq
You should now see that the Illumina Universal Adapter has been removed:
If you look at the “Overrepresented sequences” table. You may observe that there are other contaminants that remain. You may with to modify the adapter fasta file to include these so that they are also removed.