The raw reads, in fastq files, have been aligned using HISAT2. The alignment process produces a set of BAM files, where each file contains the read alignments for each library. The BAM files containing the aligned reads can be found in the data
directory under the CourseMaterials
Once our reads have been aligned against the genome, we need to summarise the information across genes or exons. In the BAM file, there is a chromosomal location for every read that mapped uniquely. We can determine if the region each read is aligned to corresponds to a particular gene or exon and then summarise across the entire BAM file to get total read counts for each gene or exon.
We will use featureCounts
(Liao, Smyth, and Shi 2014) programme from the subRead package to do the counting. In addition to the BAM files, we also need to provide featureCounts
with an annotation file. Usually this will be a GTF/GFF file corresponding to the genome assembly used (a description of the GTF format can be found at UCSC website). featureCounts
can also use a simpler annotation format called SAF, this is particularly useful for defining custom/novel features that you wish to count against.
GTF/GFF files define genomic regions covered by different types of genomic features, e.g. genes, transcripts, exons, or UTRs. The necessary GTF is already in the directory Course_Materials/data
. Instruction on where and how to download the GTF can be found in the Supplementary Materials.
When using a GTF/GFF file we need to tell featureCounts
what feature type to use to count reads, and what attribute type to summarise the results at. For RNAseq we most commonly wish to count reads aligning to exons, and then to summarise at the gene level.
Lets have a quick look at the top of a GTF file so we can see what data it contains and what feature type and attribute type mean:
cd ~/CourseMaterials/data/counts
head Mus_musculus.GRCm38.97.gtf
The code below uses featureCounts
to count reads in a BAM file against a GTF for the mouse GRCm38 genome assembly.
featureCounts \
-t exon \
-g gene_id \
--primary \
-a Mus_musculus.GRCm38.97.gtf \
-o MCL1_DG.featureCounts \
MCL1_DG.bam
-t exon
- the feature type to count reads against, in this case exons-g gene_id
- the attribute type to summarise counts by, in this case the gene ID--primary
- only count primary alignmentfeatureCounts
has many additional options that can be used to alter the ways in which it does the counting.
featureCounts --help
Running featureCounts generates two output files. A summary statistics table (MCL1.DG.featureCounts.summary
) and a full table of counts (MCL1.DG.featureCounts
) for each feature (gene in this case). Let take a look at each file.
cat MCL1.DG.featureCounts.summary
The summary table reports the numbers of unassigned reads and the reasons why they are not assigned (eg. ambiguity, multi-mapping, secondary alignment, mapping quality, fragment length, chimera, read duplicate, non-junction and so on), in addition to the number of successfully assigned reads for each library. See subread documentation (‘Program output’ section).
head MCL1.DG.featureCounts
The full results table begins with a line containing the command used to generate the counts. It then has a table of 7 columns. The first column is the gene identifier, this will vary depending on the GTF file used, in our case this is an Ensembl gene id. The second to fifth columns describe the genes location, and the sixth column is the length of the gene. The final column contains the number of reads assigned to the gene. Note that featureCounts
outputs a row for every gene in the GTF, even the ones with no reads assigned, and the row order is determined by the order in the GTF. This means that if featureCounts is used on mutliple samples with same GTF file, the separate files can be combined easily as the rows always refer to the same gene.
In fact we can have featureCounts
do this for us by running all of our libraries together in one command. To save time for this we will use versions of our BAM files that only include alignments to chromosome 15. You can find these in small_bams
directory.
featureCounts \
-t exon \
-g gene_id \
--primary \
-a small_bams/Mus_musculus.GRCm38.97.chr15.gtf \
-o GSE60450_Lactation.chr15.featureCounts \
small_bams/MCL*.bam
This gives us a combined counts table as an output with the first 6 columns as before but now the folowing columns contain the counts for all of the samples.
head GSE60450_Lactation.chr15.featureCounts
Challenge
- Redo the counting over the exons, rather than the genes. Use
featureCounts --help
to find the option you need to use. Make sure featureCounts outputs the results to a new file.- Redo the counting over genes, allowing for multimapping reads. Compare the results to our intial counts.
Notes
Liao, Yang, Gordon K Smyth, and Wei Shi. 2014. “featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.” Bioinformatics (Oxford, England) 30 (7): 923–30. doi:10.1093/bioinformatics/btt656.