CRUKCI Cluster Transition - Hands-on training
Once you had a closer look at the quality report you may realise that the data quality is not too bad, however we still might be able to improve it with a quality based trimming since the quality usually drops towards the end of the reads. We will use Cutadapt
for trimming.
As the current Bioinformatics’ core installation is broken, the easiest is to install it in your home directory on the cluster:
pip install --user --upgrade cutadapt
Let’s have a look at the help page
~/.local/bin/cutadapt --help
In some case, all we want to do is to remove low quality bases from our reads. We can use the following command to do this:
~/.local/bin/cutadapt -m 10 -q 20 -o my_file_trimmed.fastq.gz my_file.fastq.gz
The parameters are:
-m 10
: discards all reads that would be shorter than a read length of 10 after the trimming-q 20
: trims low-quality bases from the 3’ end of the reads; if two comma-separated cutoffs are given, the 5’ end is trimmed with the first cutoff, the 3’ end with the second-o FILE_NAME
: the output file nameOnce the trimming has finished we will want to check the quality of our trimmed reads as well to make sure, we are happy with its results: the trimming improved the quality and it didn’t introduce new artefacts.
EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node.
- Navigate to your project data.
- Create a new
job_cutadapt.sh
to run Cutadapt- Send job to the cluster
- Check the output while running and your job using
sacct
on the cluster- Update your
README.txt
file with what you’ve done- Rerun FastQC and view the new html report in a web browser and let’s spot the differences
Congratulations!
You did it!
Let’s prepare a small fastQ file to avoid waiting for too long for the alignment to run.
EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node.
- Navigate to your project data.
- Combine
zcat
andhead
to extract 100,000 reads from a fastQ file- Redirect the output into a new filename e.g.
SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq
- Compress your output using
gzip
Congratulations!
You did it!
BWA can map sequences against a large reference genome, such as the human genome. BWA MEM can map longer sequences (70bp to 1Mbp) and is generally recommended for high-quality queries as it is faster and more accurate.
Our installed version is located in /home/bioinformatics/software/bwa/bwa-0.7.15/bwa
, to get specific help from the command line for BWA MEM run /home/bioinformatics/software/bwa/bwa-0.7.15/bwa mem
or for general help /home/bioinformatics/software/bwa/bwa-0.7.15/bwa
.
For 75 bp reads against GRCh38 reference genome, we are going to run bwa mem
:
/home/bioinformatics/software/bwa/bwa-0.7.15/bwa mem -M -t 4 -R "@RG\tID:1\tLB:SLX-14572.i706_i517\tSM:SLX-14572\tPU:HHMJ3BGX3.1" \
/scratchb/bioinformatics/reference_data/reference_genomes/homo_sapiens/GRCh38/bwa/hsa.GRCh38 \
/scratchb/xxlab/my_username/SLX-14572/SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq.gz \
> /scratchb/xxlab/my_username/SLX-14572/SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq.sam
-M
: leaves the best (longest) alignment for a read as is but marks additional alignments for the read as secondary-t 4
number of processor cores-R "@RG\tID:1\tLB:SLX-14572.i706_i517\tSM:SLX-14572\tPU:HHMJ3BGX3.1"
add read group header to identify your aligned reads which will help when merging bam files later, ID
for identifier, LB
for library identifier (SLX-ID) and barcode, SM
for sample name and PU
for platform unit including flow cell and lane numberThe output of BWA is a SAM file, samtools to convert it to bam format, we will be using samtools view -b
, our installed version is located here /home/bioinformatics/software/samtools/samtools-1.6/bin/samtools
. We are going to pipe the output of bwa mem
into samtools
to avoid writing multiple files on disk:
/home/bioinformatics/software/bwa/bwa-0.7.15/bwa mem -M -t 4 -R "@RG\tID:1\tLB:SLX-14572.i706_i517\tSM:SLX-14572\tPU:HHMJ3BGX3.1" \
/scratchb/bioinformatics/reference_data/reference_genomes/homo_sapiens/GRCh38/bwa/hsa.GRCh38 \
/scratchb/xxlab/my_username/SLX-14572/SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq.gz \
| /home/bioinformatics/software/samtools/samtools-1.6/bin/samtools view -b \
> /scratchb/xxlab/my_username/SLX-14572/alignment/SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq.bam
To view the header of your aligned reads, you can use samtools view -H my_file.bam
or to view some aligned reads use samtools view my_file.bam | tail -10
.
If you have short sequence reads (< 70bp), you will need to run two steps bwa aln
followed by bwa samse
for single end or bwa sampe
for paired end data.
EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node. Keep two terminal windows open onto the cluster when possible.
- Navigate to your project data.
- Create an
alignment
directory- Create a new
job.sh
to run bwa mem on the small fastQ file created earlierSLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq
- Send job to the cluster
- Check the output while running and your job using
sacct
on the cluster- Update your
README.txt
file with what you’ve done
Congratulations!
You did it!
If you’re getting failed job, you may wish to check that you have enough disk space on your scratch space using:
lfs quota -h /scratchb/xxlab/
/home/bioinformatics/software/tophat/tophat-2.1.1/tophat
tophat \
--GTF $GeneAnnotationFile \
--bowtie1 --min-anchor "3" --num-threads "4" \
--tmp-dir $TempDirectory \
--output $OutputDirectory \
$TopHatIndexPrefix \
$FastqFile
$GeneAnnotationFile=
/scratchb/bioinformatics/reference_data/reference_genomes/homo_sapiens/hg38/annotation/hsa.hg38.gtf
$TopHatIndexPrefix=
/scratchb/bioinformatics/reference_data/reference_genomes/homo_sapiens/hg38/tophat/hsa.hg38
bowtie
onto your path, by adding three symbolic links from bowtie
installation directory to your ~/bin/
using the command ln -s
(It does work because ~/bin/
is on our path, try echo $PATH
). This way you could also add all the other tools to have them at you convenience on your path without having to type the full path to access them.
mkdir ~/bin
ln -s /home/bioinformatics/software/bowtie/bowtie-1.2.1.1/bowtie ~/bin/.
ln -s /home/bioinformatics/software/bowtie/bowtie-1.2.1.1/bowtie-build ~/bin/.
ln -s /home/bioinformatics/software/bowtie/bowtie-1.2.1.1/bowtie-inspect ~/bin/.
EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node.
- Navigate to your project data.
- Create an new
alignment_tophat
directory- Create a new
job.sh
to run tophat on the small fastQ fileSLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq
- Send job to the cluster
- Check the output while running and your job using
sacct
on the cluster- Update your
README.txt
file with what you’ve done while job is running
Congratulations!
You did it!
Explore alignment data file format from SAM/BAM and related specifications and the SAM Format Specification.
To view the header of your aligned reads, you can use samtools view -H my_file.bam
or to view some aligned reads use samtools view my_file.bam | tail -10
.
/home/bioinformatics/software/samtools/samtools-1.6/bin/samtools
.ln -s /home/bioinformatics/software/samtools/samtools-1.6/bin/samtools ~/bin/.
An extract of a BAM read:
NS500222:320:HHMJ3BGX3:1:11110:23779:10874 16 8 79810356 60 75M * 0 0 CAAGGATGCAGCTGTTAGCCCCAATATTTTTTTTATCTTCTCTTGGCACTTTCCTTCACATCTCCTCAGTTGTGC EEE<EEEEEEAEAEEEEEEEEEAEEEEEEEEEAEEEEAAEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA NM:i:0 MD:Z:75 AS:i:75 XS:i:21 RG:Z:1
Each aligned reads have:
Before being able to visualised your aligned reads, we will need to sort them and create an index, we can use samtools sort
and samtools index
. Then download IGV to visualised your BAM files.
EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node.
- Navigate to your project data and into your
alignment
directory.- Create a new
job_samtools.sh
to runsamtools sort
andsamtools index
on the small bam file generated previouslySLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq.bam
- Send job to the cluster
- Check the output while running and your job using
sacct
on the cluster- Update your
README.txt
file with what you’ve done while job is running- Install IGV on you local computer
- Mount your scratch space onto your local computer and visualise your aligned reads in IGV
Congratulations!
You did it!
/home/bioinformatics/software/picard/picard-2.14.0/picard.jar
, you need java
installedjava -jar PICARDJAR SortSam \
I=InputBam \
O=OutputBam \
SORT_ORDER=coordinate - One of {unsorted, queryname, coordinate, duplicate, unknown}
java -jar PICARDJAR MarkDuplicates \
I=InputBam \
O=OutputBam \
M=OutputMetricsFile \
REMOVE_DUPLICATES=false
EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node.
- Navigate to your project data and into your
alignment
directory.- Create a new
job_picard.sh
to runSortSam
andMarkDuplicates
on the small bam file generated previouslySLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq.bam
- Send job to the cluster
- Check the output while running and your job using
sacct
on the cluster- Update your
README.txt
file with what you’ve done while job is running
Congratulations!
You did it!
/home/bioinformatics/software/picard/picard-2.14.0/picard.jar
, you need java
installedjava -jar PICARDJAR CollectAlignmentSummaryMetrics \
I=InputBam \
O=OutputMetricsFile \
REF_FLAT=GenomeReferenceInFASTA
java -jar PICARDJAR CollectInsertSizeMetrics \
I=InputBam \
O=OutputMetricsFile \
H=OutputHistogramPlotPDF \
VALIDATION_STRINGENCY=SILENT
EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node.
- Navigate to your project data and into your
alignment
directory.- Create a new
job_picardmetrics.sh
to runCollectAlignmentSummaryMetrics
andCollectInsertSizeMetrics
on the small bam file generated previouslySLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq.bam
- Send job to the cluster
- Check the output while running and your job using
sacct
on the cluster- Update your
README.txt
file with what you’ve done while job is running
Congratulations!
You did it!