The raw reads, in fastq files, have been aligned using HISAT2. The alignment process produces a set of BAM files, where each file contains the read alignments for each library. The BAM files containing the aligned reads can be found in the bam
directory under the CourseMaterials
In the BAM file, there is a chromosomal location for every read that has been aligned. By matching the genomic location of each read to a gene annotation that provided the genomic location of the gene, we can determine if the region the read is aligned to corresponds to a particular gene and then summarise across the entire BAM file to get total read counts for each gene.
We will use the featureCounts
(Liao, Smyth, and Shi 2014) programme from the subRead package to do the counting.
In addition to the BAM files, we also need to provide featureCounts
with an annotation file. Usually this will be a GTF/GFF file corresponding to the genome assembly used (a description of the GTF format can be found at UCSC website). featureCounts
can also use a simpler annotation format called SAF, this is particularly useful for defining custom/novel features that you wish to count against.
GTF/GFF files define genomic regions covered by different types of genomic features, e.g. genes, transcripts, exons, or UTRs. The necessary GTF is already in the directory Course_Materials/references
. Instruction on where and how to download the GTF can be found in the Supplementary Materials.
We can view how many of each feature type there are in the GTF by counting the entries in the 3rd column of the GTF:
cd ~/CourseMaterials/ tail -n +6 references/Mus_musculus.GRCm38.97.gtf | cut -f 3 | sort | uniq -c
N.B. The
|
symbol is known as the “pipe”, it “pipes” the output of the first command into the next command.The above commands are:
tail -n +6
- start on the 6th line (skipping the first 5 lines which contain the GTF headers)cut -f 3
- extract (cut out) the third column (it assumes the columns are separated by tabs; this can be changed)sort
- sort the lines alphabeticallyuniq -c
- collapse consecutive lines that are same and (-c
) report how many there were.Q. How many genes are there in the in the GTF? Is this roughly what you would expect?
featureCounts
When using a GTF/GFF file we need to tell featureCounts
what feature type to use to count reads, and what attribute type to summarise the results at. For RNAseq we most commonly wish to count reads aligning to exons, and then to summarise at the gene level.
Running featureCounts -h
gives the full help page. At the top is the generalised usage:
Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ...
This means that at minimum we need to specify an annotation file -a
, and output file -o
and then follow the command with a series of input files.
There are many other options we could specify, including the feature to count over -t
and the attribute to summarise to -g
.
Run code below to count reads in the bam file
MCL1.DL.sorted.bam
against the GTF for the mouse GRCm38 genome assembly.mkdir counts featureCounts \ -t exon \ -g gene_id \ --primary \ -a references/Mus_musculus.GRCm38.97.gtf \ -o counts/MCL1.DL.featureCounts \ bam/MCL1.DL.sorted.bam
-t exon
- the feature type to count reads against, in this case exons-g gene_id
- the attribute type to summarise counts by, in this case the gene ID--primary
- only count primary alignments-a
- the gene annotation reference file-o
- the name for the output filesQ. Check the help page again, did we need to specify
-t
and-g
?
This should have generated two output files. A summary statistics table (MCL1.DG.featureCounts.summary
) and a full table of counts (MCL1.DG.featureCounts
) for each feature (gene in this case).
To view the summary table: cat counts/MCL1.DL.featureCounts.summary
The summary table reports the numbers of unassigned reads and the reasons why they are not assigned (e.g. ambiguity, multi-mapping, secondary alignment, mapping quality, fragment length, chimera, read duplicate, non-junction and so on), in addition to the number of successfully assigned reads for each library. See subread documentation (‘Program output’ section).
To view the first few lines of the main counts output: head counts/MCL1.DL.featureCounts
The full results table begins with a line containing the command used to generate the counts. It then has a table of 7 columns:
Note that featureCounts
outputs a row for every gene in the GTF, even the ones with no reads assigned, and the row order is determined by the order in the GTF. This means that if featureCounts is used on multiple samples with same GTF file, the separate files can be combined easily as the rows always refer to the same gene.
featureCounts
on multiple samplesWe can also run featureCounts
on all of our BAM together in one command. As explained by the usage in the help page, we can specify multiple input files (bam files in our case) one by one after the command.
If we had many bam files to include it would be very time consuming to type out each file name and would result in a very long and difficult to read command. Instead, there is a quick way to refer to many files at once on the command line using “wild cards”.
Suppose we had a directory called myData
containing:
myFile_A01.txt
myFile_B02.txt
myFile_C03.txt
myFile_D04.txt
myFile_E15.txt
We could refer to all of these files by replacing the part of the file name that changes with an asterix:
myData/myFile_*.txt
The asterix means essentially means ‘replace this with anything’. So myFile_*.txt
will be expanded to a list of all the files that could match.
In the metrics directory we have an RNAseq metrics file for each bam file called MCL1.DG.rna_metrics.txt
, MCL1.DG.rna_metrics.txt
, MCL1.DG.rna_metrics.txt
… MCL1.LE.rna_metrics.txt
.
Try head -n 1 metrics/MCL1.*.rna_metrics.txt
to look at the first line of each RNAseq metrics file in the metrics
directory.
Rerun featureCounts on all the bam files at once.
- To save time we will use versions of our BAM files that only include reads aligned to chromosome 15. You can find these in
small_bams
directory.- There is also a gtf that only contains chr 15 in the reference directory, use this instead of the full gtf.
- output the results to a new file called
counts/GSE60450_Lactation.chr15.featureCounts
- Specify the input (bam) files using the * wildcard.
The output should be a combined a summary file and the counts table with the initial columns as before, but now there should be one results/counts column for each bam file.
featureCounts
Some mRNAseq library preparation protocols return reads on the same genomic strand as the transcript was read from, some return reads on the opposite strand, and others return reads from both strands. When working with data from a stranded protocol, specifying strandness will avoid counting reads that are aligned to the wrong strand - these may be due to overlapping genes or due to sequencing/mapping errors.
featureCounts
has a parameter that allows us to perform strand-specific read counting by specifying the “strandness” of the library prep method used. The default is to assume an unstranded protocol.
Rerun
featureCounts
onbam/MCL1.DL.sorted.bam
, but this time specify “reversely stranded”. Use the help to find out which option you need to set to do this:featureCounts -h
Output the results to a new file called
counts/MCL1.DL.reverse.featureCounts
.Q. Compare the summary of the read count assignments to our intial results where we used the default settings (unstrandned). Which type of protocol do you think was used to generate this library?
featureCounts
now includes the option “–extraAttributes” that enables us to extract additional information from the attributes column for each gene and include it in the output counts table.
Rerun
featureCounts
onbam/MCL1.DL.sorted.bam
, but this time also extract the “gene_biotype” from the GTF file and output to a new file calledcounts/MCL1.DL.gene_biotype.featureCounts
.Q. What proporion of the genes are protein coding? (Hint: To count the bioyptes, see the section above where we counted the feature types in the GTF file)
Liao, Yang, Gordon K Smyth, and Wei Shi. 2014. “featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.” Bioinformatics (Oxford, England) 30 (7): 923–30. https://doi.org/10.1093/bioinformatics/btt656.