Introduction and data import

The raw reads, in fastq files, have been aligned using HISAT2. The alignment process produces a set of BAM files, where each file contains the read alignments for each library. The BAM files containing the aligned reads can be found in the bam directory under the CourseMaterials

In the BAM file, there is a chromosomal location for every read that has been aligned. By matching the genomic location of each read to a gene annotation that provided the genomic location of the gene, we can determine if the region the read is aligned to corresponds to a particular gene and then summarise across the entire BAM file to get total read counts for each gene.

We will use the featureCounts (Liao, Smyth, and Shi 2014) programme from the subRead package to do the counting.

1. Gene annotations in GTF format

In addition to the BAM files, we also need to provide featureCounts with an annotation file. Usually this will be a GTF/GFF file corresponding to the genome assembly used (a description of the GTF format can be found at UCSC website). featureCounts can also use a simpler annotation format called SAF, this is particularly useful for defining custom/novel features that you wish to count against.

GTF/GFF files define genomic regions covered by different types of genomic features, e.g. genes, transcripts, exons, or UTRs. The necessary GTF is already in the directory Course_Materials/references. Instruction on where and how to download the GTF can be found in the Supplementary Materials.

Exercise 1

We can view how many of each feature type there are in the GTF by counting the entries in the 3rd column of the GTF:

cd ~/CourseMaterials/

tail -n +6 references/Mus_musculus.GRCm38.97.gtf | cut -f 3 | sort | uniq -c

N.B. The | symbol is known as the “pipe”, it “pipes” the output of the first command into the next command.

The above commands are:

  • tail -n +6 - start on the 6th line (skipping the first 5 lines which contain the GTF headers)
  • cut -f 3 - extract (cut out) the third column (it assumes the columns are separated by tabs; this can be changed)
  • sort - sort the lines alphabetically
  • uniq -c - collapse consecutive lines that are same and (-c) report how many there were.

Q. How many genes are there in the in the GTF? Is this roughly what you would expect?

2. Running featureCounts

When using a GTF/GFF file we need to tell featureCounts what feature type to use to count reads, and what attribute type to summarise the results at. For RNAseq we most commonly wish to count reads aligning to exons, and then to summarise at the gene level.

Running featureCounts -h gives the full help page. At the top is the generalised usage:

Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ... 

This means that at minimum we need to specify an annotation file -a, and output file -o and then follow the command with a series of input files.

There are many other options we could specify, including the feature to count over -t and the attribute to summarise to -g.

Exercise 2

Run code below to count reads in the bam file MCL1.DL.sorted.bam against the GTF for the mouse GRCm38 genome assembly.

  mkdir counts

  featureCounts \
      -t exon \
      -g gene_id \
      --primary \
      -a references/Mus_musculus.GRCm38.97.gtf \
      -o counts/MCL1.DL.featureCounts \
      bam/MCL1.DL.sorted.bam
  • -t exon - the feature type to count reads against, in this case exons
  • -g gene_id - the attribute type to summarise counts by, in this case the gene ID
  • --primary - only count primary alignments
  • -a - the gene annotation reference file
  • -o - the name for the output files

Q. Check the help page again, did we need to specify -t and -g?

3. The output files

This should have generated two output files. A summary statistics table (MCL1.DG.featureCounts.summary) and a full table of counts (MCL1.DG.featureCounts) for each feature (gene in this case).

To view the summary table: cat counts/MCL1.DL.featureCounts.summary

The summary table reports the numbers of unassigned reads and the reasons why they are not assigned (e.g. ambiguity, multi-mapping, secondary alignment, mapping quality, fragment length, chimera, read duplicate, non-junction and so on), in addition to the number of successfully assigned reads for each library. See subread documentation (‘Program output’ section).

To view the first few lines of the main counts output: head counts/MCL1.DL.featureCounts

The full results table begins with a line containing the command used to generate the counts. It then has a table of 7 columns:

  1. The gene identifier; this will vary depending on the GTF file used, in our case this is an Ensembl gene id
  2. Chromosome
  3. Start position for each exon in the gene
  4. End position for each exon in the gene
  5. Transcription strand for each exon in the gene
  6. The total length of the gene in nucleotides
  7. The number of reads assigned to the gene.

Note that featureCounts outputs a row for every gene in the GTF, even the ones with no reads assigned, and the row order is determined by the order in the GTF. This means that if featureCounts is used on multiple samples with same GTF file, the separate files can be combined easily as the rows always refer to the same gene.

4. Running featureCounts on multiple samples

We can also run featureCounts on all of our BAM together in one command. As explained by the usage in the help page, we can specify multiple input files (bam files in our case) one by one after the command.

If we had many bam files to include it would be very time consuming to type out each file name and would result in a very long and difficult to read command. Instead, there is a quick way to refer to many files at once on the command line using “wild cards”.

Wildcards in file names

Suppose we had a directory called myData containing:

myFile_A01.txt
myFile_B02.txt
myFile_C03.txt
myFile_D04.txt
myFile_E15.txt

We could refer to all of these files by replacing the part of the file name that changes with an asterix:

myData/myFile_*.txt

The asterix means essentially means ‘replace this with anything’. So myFile_*.txt will be expanded to a list of all the files that could match.

In the metrics directory we have an RNAseq metrics file for each bam file called MCL1.DG.rna_metrics.txt, MCL1.DG.rna_metrics.txt, MCL1.DG.rna_metrics.txtMCL1.LE.rna_metrics.txt.

Try head -n 1 metrics/MCL1.*.rna_metrics.txt to look at the first line of each RNAseq metrics file in the metrics directory.

Exercise 4

Rerun featureCounts on all the bam files at once.

  1. To save time we will use versions of our BAM files that only include reads aligned to chromosome 15. You can find these in small_bams directory.
  2. There is also a gtf that only contains chr 15 in the reference directory, use this instead of the full gtf.
  3. output the results to a new file called counts/GSE60450_Lactation.chr15.featureCounts
  4. Specify the input (bam) files using the * wildcard.

The output should be a combined a summary file and the counts table with the initial columns as before, but now there should be one results/counts column for each bam file.

5. Library preparation strandedness and featureCounts

Some mRNAseq library preparation protocols return reads on the same genomic strand as the transcript was read from, some return reads on the opposite strand, and others return reads from both strands. When working with data from a stranded protocol, specifying strandness will avoid counting reads that are aligned to the wrong strand - these may be due to overlapping genes or due to sequencing/mapping errors.

featureCounts has a parameter that allows us to perform strand-specific read counting by specifying the “strandness” of the library prep method used. The default is to assume an unstranded protocol.

Exercise 5

Rerun featureCounts on bam/MCL1.DL.sorted.bam, but this time specify “reversely stranded”. Use the help to find out which option you need to set to do this: featureCounts -h

Output the results to a new file called counts/MCL1.DL.reverse.featureCounts.

Q. Compare the summary of the read count assignments to our intial results where we used the default settings (unstrandned). Which type of protocol do you think was used to generate this library?

6. Supplementary: extracting additional attribute information

featureCounts now includes the option “–extraAttributes” that enables us to extract additional information from the attributes column for each gene and include it in the output counts table.

Exercise 6

Rerun featureCounts on bam/MCL1.DL.sorted.bam, but this time also extract the “gene_biotype” from the GTF file and output to a new file called counts/MCL1.DL.gene_biotype.featureCounts.

Q. What proporion of the genes are protein coding? (Hint: To count the bioyptes, see the section above where we counted the feature types in the GTF file)

Liao, Yang, Gordon K Smyth, and Wei Shi. 2014. “featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.” Bioinformatics (Oxford, England) 30 (7): 923–30. https://doi.org/10.1093/bioinformatics/btt656.