crukci-cluster-transition

CRUKCI Cluster Transition - Hands-on training

View the Project on GitHub

Session 4: Analysis steps by steps

Learning Objectives

Artefact removal in reads

Once you had a closer look at the quality report you may realise that the data quality is not too bad, however we still might be able to improve it with a quality based trimming since the quality usually drops towards the end of the reads. We will use Cutadapt for trimming.

As the current Bioinformatics’ core installation is broken, the easiest is to install it in your home directory on the cluster:

pip install --user --upgrade cutadapt

Let’s have a look at the help page

~/.local/bin/cutadapt --help

In some case, all we want to do is to remove low quality bases from our reads. We can use the following command to do this:

~/.local/bin/cutadapt -m 10 -q 20 -o my_file_trimmed.fastq.gz my_file.fastq.gz

The parameters are:

Once the trimming has finished we will want to check the quality of our trimmed reads as well to make sure, we are happy with its results: the trimming improved the quality and it didn’t introduce new artefacts.

:computer: EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node.

  • Navigate to your project data.
  • Create a new job_cutadapt.sh to run Cutadapt
  • Send job to the cluster
  • Check the output while running and your job using sacct on the cluster
  • Update your README.txt file with what you’ve done
  • Rerun FastQC and view the new html report in a web browser and let’s spot the differences

:tada: Congratulations! :thumbsup: You did it! :wink:

Short sequencing read alignment to a reference genome

Let’s prepare a small fastQ file to avoid waiting for too long for the alignment to run.

:computer: EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node.

  • Navigate to your project data.
  • Combine zcat and head to extract 100,000 reads from a fastQ file
  • Redirect the output into a new filename e.g. SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq
  • Compress your output using gzip

:tada: Congratulations! :thumbsup: You did it! :wink:

ChipSeq - alignment with BWA

BWA can map sequences against a large reference genome, such as the human genome. BWA MEM can map longer sequences (70bp to 1Mbp) and is generally recommended for high-quality queries as it is faster and more accurate.

Our installed version is located in /home/bioinformatics/software/bwa/bwa-0.7.15/bwa, to get specific help from the command line for BWA MEM run /home/bioinformatics/software/bwa/bwa-0.7.15/bwa mem or for general help /home/bioinformatics/software/bwa/bwa-0.7.15/bwa.

For 75 bp reads against GRCh38 reference genome, we are going to run bwa mem:

/home/bioinformatics/software/bwa/bwa-0.7.15/bwa mem -M -t 4 -R "@RG\tID:1\tLB:SLX-14572.i706_i517\tSM:SLX-14572\tPU:HHMJ3BGX3.1" \
    /scratchb/bioinformatics/reference_data/reference_genomes/homo_sapiens/GRCh38/bwa/hsa.GRCh38 \
    /scratchb/xxlab/my_username/SLX-14572/SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq.gz \
    > /scratchb/xxlab/my_username/SLX-14572/SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq.sam

The output of BWA is a SAM file, samtools to convert it to bam format, we will be using samtools view -b, our installed version is located here /home/bioinformatics/software/samtools/samtools-1.6/bin/samtools. We are going to pipe the output of bwa mem into samtools to avoid writing multiple files on disk:

/home/bioinformatics/software/bwa/bwa-0.7.15/bwa mem -M -t 4 -R "@RG\tID:1\tLB:SLX-14572.i706_i517\tSM:SLX-14572\tPU:HHMJ3BGX3.1" \    
    /scratchb/bioinformatics/reference_data/reference_genomes/homo_sapiens/GRCh38/bwa/hsa.GRCh38 \
    /scratchb/xxlab/my_username/SLX-14572/SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq.gz \
    | /home/bioinformatics/software/samtools/samtools-1.6/bin/samtools view -b \
    > /scratchb/xxlab/my_username/SLX-14572/alignment/SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq.bam

To view the header of your aligned reads, you can use samtools view -H my_file.bam or to view some aligned reads use samtools view my_file.bam | tail -10.

If you have short sequence reads (< 70bp), you will need to run two steps bwa aln followed by bwa samse for single end or bwa sampe for paired end data.

:computer: EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node. Keep two terminal windows open onto the cluster when possible.

  • Navigate to your project data.
  • Create an alignment directory
  • Create a new job.sh to run bwa mem on the small fastQ file created earlier SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq
  • Send job to the cluster
  • Check the output while running and your job using sacct on the cluster
  • Update your README.txt file with what you’ve done

:tada: Congratulations! :thumbsup: You did it! :wink:

If you’re getting failed job, you may wish to check that you have enough disk space on your scratch space using:

lfs quota -h /scratchb/xxlab/

RNASeq - alignment with TopHat

:computer: EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node.

  • Navigate to your project data.
  • Create an new alignment_tophat directory
  • Create a new job.sh to run tophat on the small fastQ file SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq
  • Send job to the cluster
  • Check the output while running and your job using sacct on the cluster
  • Update your README.txt file with what you’ve done while job is running

:tada: Congratulations! :thumbsup: You did it! :wink:

Viewing alignment data

Explore alignment data file format from SAM/BAM and related specifications and the SAM Format Specification.

To view the header of your aligned reads, you can use samtools view -H my_file.bam or to view some aligned reads use samtools view my_file.bam | tail -10.

An extract of a BAM read:

NS500222:320:HHMJ3BGX3:1:11110:23779:10874	16	8	79810356	60	75M	*	0	0	CAAGGATGCAGCTGTTAGCCCCAATATTTTTTTTATCTTCTCTTGGCACTTTCCTTCACATCTCCTCAGTTGTGC	EEE<EEEEEEAEAEEEEEEEEEAEEEEEEEEEAEEEEAAEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA	NM:i:0	MD:Z:75	AS:i:75	XS:i:21	RG:Z:1

Each aligned reads have:

Before being able to visualised your aligned reads, we will need to sort them and create an index, we can use samtools sort and samtools index. Then download IGV to visualised your BAM files.

:computer: EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node.

  • Navigate to your project data and into your alignment directory.
  • Create a new job_samtools.sh to run samtools sort and samtools index on the small bam file generated previously SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq.bam
  • Send job to the cluster
  • Check the output while running and your job using sacct on the cluster
  • Update your README.txt file with what you’ve done while job is running
  • Install IGV on you local computer
  • Mount your scratch space onto your local computer and visualise your aligned reads in IGV

:tada: Congratulations! :thumbsup: You did it! :wink:

Sort and mark duplicates using Picard tools

:computer: EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node.

  • Navigate to your project data and into your alignment directory.
  • Create a new job_picard.sh to run SortSam and MarkDuplicates on the small bam file generated previously SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq.bam
  • Send job to the cluster
  • Check the output while running and your job using sacct on the cluster
  • Update your README.txt file with what you’ve done while job is running

:tada: Congratulations! :thumbsup: You did it! :wink:

Alignment quality metrics using Picard tools

Reference materials