Introduction and data import
For the purpose of this workshop, we are going to be working with a small part of the mouse reference genome (chromosome 1) to demonstrate how to do read counting using Subread
.
The raw reads, in fastq files, have been aligned using Bowtie2. The bam files containing the aligned reads can be found in the RNAseq/data directory
under the Course_Materials
.
Counting
Once our reads have been aligned against the genome, we need to summarise the information across genes or exons. The alignment process produces a set of BAM files, where each file contains the read alignments for each library. In the BAM file, there is a chromosomal location for every read that mapped uniquely. We can determine if the region each read is aligned to corresponds to a particular gene or exon and then summarise across the entire BAM file to get total read counts for each gene or exon.
We will use featureCounts
programme from the subRead package to do the counting. In addition to the BAM files, we also need to provide featureCounts
with an annotation file. Usually this will be a GTF/GFF file corresponding to the genome assembly used (a description of the GTF format can be found at UCSC website). featureCounts
can also use a simpler annotation format called SAF, this is particularly useful for defining custom/novel features that you wish to count against.
GTF/GFF files define genomic regions covered by different types of genomic features, e.g. genes, transcripts, exons, or UTRs. When using a GTF/GFF file we need to tell featureCounts
what feature type to use to count reads, and what attribute type to summarise the results at. For RNAseq we most commonly wish to count reads aligning to exons, and then to summarise at the gene level. Lets have a quick look at the top of a GTF file so we can see what data it contains and what feature type and attribute type mean:
cd RNASeq/data
head mmu.mm10.gtf
The code below uses featureCounts
to count reads in a bam file against a GTF for the GRCh38 genome assembly.
featureCounts \
--primary \
-C \
-t exon \
-g gene_id \
-a mmu.mm10.gtf \
-o MCL1.DJ.featureCounts \
MCL1.DJ.bam
--primary
- only count primary alignment
-C
- do not count reads where the pairs are mapped to different chromosomes
-t exon
- the feature type to count reads against, in this case exons
-g gene_id
- the attribute type to summarise counts by, in this case the gene ID
featureCounts
has many additional options that can be used to alter the ways in which it does the counting.
featureCounts --help
Running featureCounts generates two output file. A summary statistics table (MCL1.DJ.featureCounts.summary
) and a full table of counts (MCL1.DJ.featureCounts
) for each feature (gene in this case). Let take a look at each file.
cat MCL1.DJ.featureCounts.summary
The summary table reports the numbers of unassigned reads and the reasons why they are not assigned (eg. ambiguity, multi-mapping, secondary alignment, mapping quality, fragment length, chimera, read duplicate, non-junction and so on), in addition to the number of successfully assigned reads for each library. See subread documentation (‘Program output’ section).
head MCL1.DJ.featureCounts
The full results table begins with a line containing the command used to generate the counts. It then has a table of 7 columns. The first column is the gene identifier, this will vary depending on the GTF file used, in our case this is a UCSC gene id. The second to fifth columns describe the genes location, and the sixth column is the length of the gene. The final column contains the number of reads assigned to the gene. Note that featureCounts
outputs a row for every gene in the GTF, even the ones with no reads assigned, and the row order is determined by the order in the GTF. This means that if featureCounts is used on mutliple samples with same GTF file, the separate files can be combined easily as the rows always refer to the same gene.
Challenge
- Redo the counting over the exons, rather than the genes. Use
featureCounts --help
to find the option you need to use. Make sure featureCounts outputs the results to a new file.
- Redo the counting over genes, allowing for multimapping reads. Compare the results to our intial counts.
Notes
- If you are sequencing your own data, the sequencing facility will almost always provide fastq files.
- For publicly available sequence data from GEO/SRA, the files are usually in the Sequence Read Archive (SRA) format. Prior to read alignment, these files need to be converted into the FASTQ format using the fastq-dump utility from the SRA Toolkit. See http: //www.ncbi.nlm.nih.gov/books/NBK158900 for how to download and use the SRA Toolkit.
LS0tCnRpdGxlOiAiUk5BLXNlcSBhbmFseXNpcyBpbiBSIgphdXRob3I6ICJTdGVwaGFuZSBCYWxsZXJlYXUsIE1hcmsgRHVubmluZywgT3NjYXIgUnVlZGEsIEFzaGxleSBTYXdsZSIKZGF0ZTogJ2ByIGZvcm1hdChTeXMudGltZSgpLCAiTGFzdCBtb2RpZmllZDogJWQgJWIgJVkiKWAnCm91dHB1dDoKICBodG1sX25vdGVib29rOgogICAgdG9jOiB5ZXMKICAgIHRvY19mbG9hdDogeWVzCiAgaHRtbF9kb2N1bWVudDoKICAgIHRvYzogeWVzCiAgICB0b2NfZmxvYXQ6IHllcwptaW51dGVzOiAzMDAKbGF5b3V0OiBwYWdlCnN1YnRpdGxlOiBDb3VudGluZwpiaWJsaW9ncmFwaHk6IHJlZi5iaWIKLS0tCgoKIyMgSW50cm9kdWN0aW9uIGFuZCBkYXRhIGltcG9ydAoKRm9yIHRoZSBwdXJwb3NlIG9mIHRoaXMgd29ya3Nob3AsIHdlIGFyZSBnb2luZyB0byBiZSB3b3JraW5nIHdpdGggYSBzbWFsbCBwYXJ0IG9mIHRoZSBtb3VzZSByZWZlcmVuY2UgZ2Vub21lIChjaHJvbW9zb21lIDEpIHRvIGRlbW9uc3RyYXRlIGhvdyB0byBkbyByZWFkIGNvdW50aW5nIHVzaW5nIGBTdWJyZWFkYC4KClRoZSByYXcgcmVhZHMsIGluIGZhc3RxIGZpbGVzLCBoYXZlIGJlZW4gYWxpZ25lZCB1c2luZyBCb3d0aWUyLiBUaGUgYmFtIGZpbGVzIGNvbnRhaW5pbmcgdGhlIGFsaWduZWQgcmVhZHMgY2FuIGJlIGZvdW5kIGluIHRoZSAqKmBSTkFzZXEvZGF0YSBkaXJlY3RvcnlgKiogdW5kZXIgdGhlICoqYENvdXJzZV9NYXRlcmlhbHNgKiouCgojIyBDb3VudGluZwoKT25jZSBvdXIgcmVhZHMgaGF2ZSBiZWVuIGFsaWduZWQgYWdhaW5zdCB0aGUgZ2Vub21lLCB3ZSBuZWVkIHRvIHN1bW1hcmlzZSB0aGUgaW5mb3JtYXRpb24gYWNyb3NzIGdlbmVzIG9yIGV4b25zLiBUaGUgYWxpZ25tZW50IHByb2Nlc3MgcHJvZHVjZXMgYSBzZXQgb2YgQkFNIGZpbGVzLCB3aGVyZSBlYWNoIGZpbGUgY29udGFpbnMgdGhlIHJlYWQgYWxpZ25tZW50cyBmb3IgZWFjaCBsaWJyYXJ5LiBJbiB0aGUgQkFNIGZpbGUsIHRoZXJlIGlzIGEgY2hyb21vc29tYWwgbG9jYXRpb24gZm9yIGV2ZXJ5IHJlYWQgdGhhdCBtYXBwZWQgdW5pcXVlbHkuIFdlIGNhbiBkZXRlcm1pbmUgaWYgdGhlIHJlZ2lvbiBlYWNoIHJlYWQgaXMgYWxpZ25lZCB0byBjb3JyZXNwb25kcyB0byBhIHBhcnRpY3VsYXIgZ2VuZSBvciBleG9uIGFuZCB0aGVuIHN1bW1hcmlzZSBhY3Jvc3MgdGhlIGVudGlyZSBCQU0gZmlsZSB0byBnZXQgdG90YWwgcmVhZCBjb3VudHMgZm9yIGVhY2ggZ2VuZSBvciBleG9uLiAKCldlIHdpbGwgdXNlICoqYGZlYXR1cmVDb3VudHNgKiogcHJvZ3JhbW1lIGZyb20gdGhlIFtzdWJSZWFkIHBhY2thZ2VdKGh0dHA6Ly9zdWJyZWFkLnNvdXJjZWZvcmdlLm5ldC8pIHRvIGRvIHRoZSBjb3VudGluZy4gSW4gYWRkaXRpb24gdG8gdGhlIEJBTSBmaWxlcywgd2UgYWxzbyBuZWVkIHRvIHByb3ZpZGUgKipgZmVhdHVyZUNvdW50c2AqKiB3aXRoIGFuIGFubm90YXRpb24gZmlsZS4gVXN1YWxseSB0aGlzIHdpbGwgYmUgYSBHVEYvR0ZGIGZpbGUgY29ycmVzcG9uZGluZyB0byB0aGUgZ2Vub21lIGFzc2VtYmx5IHVzZWQgKGEgZGVzY3JpcHRpb24gb2YgdGhlIEdURiAgZm9ybWF0ICBjYW4gIGJlICBmb3VuZCAgYXQgIFtVQ1NDICB3ZWJzaXRlXShodHRwOi8vZ2Vub21lLnVjc2MuZWR1L0ZBUS9GQVFmb3JtYXQuaHRtbCNmb3JtYXQ0KSkuICoqYGZlYXR1cmVDb3VudHNgKiogY2FuIGFsc28gdXNlIGEgc2ltcGxlciBhbm5vdGF0aW9uIGZvcm1hdCBjYWxsZWQgU0FGLCB0aGlzIGlzIHBhcnRpY3VsYXJseSB1c2VmdWwgZm9yIGRlZmluaW5nIGN1c3RvbS9ub3ZlbCBmZWF0dXJlcyB0aGF0IHlvdSB3aXNoIHRvIGNvdW50IGFnYWluc3QuCgpHVEYvR0ZGIGZpbGVzIGRlZmluZSBnZW5vbWljIHJlZ2lvbnMgY292ZXJlZCBieSBkaWZmZXJlbnQgdHlwZXMgb2YgZ2Vub21pYyBmZWF0dXJlcywgZS5nLiBnZW5lcywgdHJhbnNjcmlwdHMsIGV4b25zLCBvciBVVFJzLiBXaGVuIHVzaW5nIGEgR1RGL0dGRiBmaWxlIHdlIG5lZWQgdG8gdGVsbCAqKmBmZWF0dXJlQ291bnRzYCoqIHdoYXQgZmVhdHVyZSB0eXBlIHRvIHVzZSB0byBjb3VudCByZWFkcywgYW5kIHdoYXQgYXR0cmlidXRlIHR5cGUgdG8gc3VtbWFyaXNlIHRoZSByZXN1bHRzIGF0LiBGb3IgUk5Bc2VxIHdlIG1vc3QgY29tbW9ubHkgd2lzaCB0byBjb3VudCByZWFkcyBhbGlnbmluZyB0byBleG9ucywgYW5kIHRoZW4gdG8gc3VtbWFyaXNlIGF0IHRoZSBnZW5lIGxldmVsLiAKTGV0cyBoYXZlIGEgcXVpY2sgbG9vayBhdCB0aGUgdG9wIG9mIGEgR1RGIGZpbGUgc28gd2UgY2FuIHNlZSB3aGF0IGRhdGEgaXQgY29udGFpbnMgYW5kIHdoYXQgKipmZWF0dXJlIHR5cGUqKiBhbmQgKiphdHRyaWJ1dGUgdHlwZSoqIG1lYW46CgpgYGAKY2QgUk5BU2VxL2RhdGEKICAKaGVhZCBtbXUubW0xMC5ndGYKYGBgCgpUaGUgY29kZSBiZWxvdyB1c2VzICoqYGZlYXR1cmVDb3VudHNgKiogdG8gY291bnQgcmVhZHMgaW4gYSBiYW0gZmlsZSBhZ2FpbnN0IGEgR1RGIGZvciB0aGUgR1JDaDM4IGdlbm9tZSBhc3NlbWJseS4KCmBgYAogIGZlYXR1cmVDb3VudHMgXAogICAgICAtLXByaW1hcnkgXAogICAgICAtQyBcCiAgICAgIC10IGV4b24gXAogICAgICAtZyBnZW5lX2lkIFwKICAgICAgLWEgbW11Lm1tMTAuZ3RmIFwKICAgICAgLW8gTUNMMS5ESi5mZWF0dXJlQ291bnRzIFwKICAgICAgTUNMMS5ESi5iYW0KYGBgCiogKipgLS1wcmltYXJ5YCoqIC0gb25seSBjb3VudCBwcmltYXJ5IGFsaWdubWVudAoqICoqYC1DYCoqIC0gZG8gbm90IGNvdW50IHJlYWRzIHdoZXJlIHRoZSBwYWlycyBhcmUgbWFwcGVkIHRvIGRpZmZlcmVudCBjaHJvbW9zb21lcwoqICoqYC10IGV4b25gKiogLSB0aGUgKipmZWF0dXJlKiogdHlwZSB0byBjb3VudCByZWFkcyBhZ2FpbnN0LCBpbiB0aGlzIGNhc2UgZXhvbnMKKiAqKmAtZyBnZW5lX2lkYCoqIC0gdGhlICoqYXR0cmlidXRlKiogdHlwZSB0byBzdW1tYXJpc2UgY291bnRzIGJ5LCBpbiB0aGlzIGNhc2UgdGhlIGdlbmUgSUQKCioqYGZlYXR1cmVDb3VudHNgKiogaGFzIG1hbnkgYWRkaXRpb25hbCBvcHRpb25zIHRoYXQgY2FuIGJlIHVzZWQgdG8gYWx0ZXIgdGhlIHdheXMgaW4gd2hpY2ggaXQgZG9lcyB0aGUgY291bnRpbmcuCgpgYGAKZmVhdHVyZUNvdW50cyAtLWhlbHAKYGBgCgpSdW5uaW5nIGZlYXR1cmVDb3VudHMgZ2VuZXJhdGVzIHR3byBvdXRwdXQgZmlsZS4gQSBzdW1tYXJ5IHN0YXRpc3RpY3MgdGFibGUgKCoqYE1DTDEuREouZmVhdHVyZUNvdW50cy5zdW1tYXJ5YCoqKSBhbmQgYSBmdWxsIHRhYmxlIG9mIGNvdW50cyAoKipgTUNMMS5ESi5mZWF0dXJlQ291bnRzYCoqKSBmb3IgZWFjaCBmZWF0dXJlIChnZW5lIGluIHRoaXMgY2FzZSkuIExldCB0YWtlIGEgbG9vayBhdCBlYWNoIGZpbGUuCgpgYGAKY2F0IE1DTDEuREouZmVhdHVyZUNvdW50cy5zdW1tYXJ5CmBgYAoKVGhlIHN1bW1hcnkgdGFibGUgcmVwb3J0cyB0aGUgbnVtYmVycyBvZiB1bmFzc2lnbmVkIHJlYWRzIGFuZCB0aGUgcmVhc29ucyB3aHkgdGhleSBhcmUgbm90IGFzc2lnbmVkIChlZy4gYW1iaWd1aXR5LCBtdWx0aS1tYXBwaW5nLCBzZWNvbmRhcnkgYWxpZ25tZW50LCBtYXBwaW5nIHF1YWxpdHksIGZyYWdtZW50IGxlbmd0aCwgY2hpbWVyYSwgcmVhZCBkdXBsaWNhdGUsIG5vbi1qdW5jdGlvbiBhbmQgc28gb24pLCBpbiBhZGRpdGlvbiB0byB0aGUgbnVtYmVyIG9mIHN1Y2Nlc3NmdWxseSBhc3NpZ25lZCByZWFkcyBmb3IgZWFjaCBsaWJyYXJ5LiBTZWUgW3N1YnJlYWQgZG9jdW1lbnRhdGlvbl0oaHR0cDovL2Jpb2luZi53ZWhpLmVkdS5hdS9zdWJyZWFkLXBhY2thZ2UvU3VicmVhZFVzZXJzR3VpZGUucGRmKSAoJ1Byb2dyYW0gb3V0cHV0JyBzZWN0aW9uKS4KCmBgYApoZWFkIE1DTDEuREouZmVhdHVyZUNvdW50cwpgYGAKClRoZSBmdWxsIHJlc3VsdHMgdGFibGUgYmVnaW5zIHdpdGggYSBsaW5lIGNvbnRhaW5pbmcgdGhlIGNvbW1hbmQgdXNlZCB0byBnZW5lcmF0ZSB0aGUgY291bnRzLiBJdCB0aGVuIGhhcyBhIHRhYmxlIG9mIDcgY29sdW1ucy4gVGhlIGZpcnN0IGNvbHVtbiBpcyB0aGUgZ2VuZSBpZGVudGlmaWVyLCB0aGlzIHdpbGwgdmFyeSBkZXBlbmRpbmcgb24gdGhlIEdURiBmaWxlIHVzZWQsIGluIG91ciBjYXNlIHRoaXMgaXMgYSBVQ1NDIGdlbmUgaWQuIFRoZSBzZWNvbmQgdG8gZmlmdGggY29sdW1ucyBkZXNjcmliZSB0aGUgZ2VuZXMgbG9jYXRpb24sIGFuZCB0aGUgc2l4dGggY29sdW1uIGlzIHRoZSBsZW5ndGggb2YgdGhlIGdlbmUuIFRoZSBmaW5hbCBjb2x1bW4gY29udGFpbnMgdGhlIG51bWJlciBvZiByZWFkcyBhc3NpZ25lZCB0byB0aGUgZ2VuZS4gTm90ZSB0aGF0ICoqYGZlYXR1cmVDb3VudHNgKiogb3V0cHV0cyBhIHJvdyBmb3IgZXZlcnkgZ2VuZSBpbiB0aGUgR1RGLCBldmVuIHRoZSBvbmVzIHdpdGggbm8gcmVhZHMgYXNzaWduZWQsIGFuZCB0aGUgcm93IG9yZGVyIGlzIGRldGVybWluZWQgYnkgdGhlIG9yZGVyIGluIHRoZSBHVEYuIFRoaXMgbWVhbnMgdGhhdCBpZiBmZWF0dXJlQ291bnRzIGlzIHVzZWQgb24gbXV0bGlwbGUgc2FtcGxlcyB3aXRoIHNhbWUgR1RGIGZpbGUsIHRoZSBzZXBhcmF0ZSBmaWxlcyBjYW4gYmUgY29tYmluZWQgZWFzaWx5IGFzIHRoZSByb3dzIGFsd2F5cyByZWZlciB0byB0aGUgc2FtZSBnZW5lLgoKCj4gIyMgQ2hhbGxlbmdlIHsuY2hhbGxlbmdlfQo+Cj4gMS4gUmVkbyB0aGUgY291bnRpbmcgb3ZlciB0aGUgZXhvbnMsIHJhdGhlciB0aGFuIHRoZSBnZW5lcy4gVXNlIGBmZWF0dXJlQ291bnRzIC0taGVscGAgdG8gZmluZCB0aGUgb3B0aW9uIHlvdSBuZWVkIHRvIHVzZS4gTWFrZSBzdXJlIGZlYXR1cmVDb3VudHMgb3V0cHV0cyB0aGUgcmVzdWx0cyB0byBhIG5ldyBmaWxlLgo+IDEuIFJlZG8gdGhlIGNvdW50aW5nIG92ZXIgZ2VuZXMsIGFsbG93aW5nIGZvciBtdWx0aW1hcHBpbmcgcmVhZHMuIENvbXBhcmUgdGhlIHJlc3VsdHMgdG8gb3VyIGludGlhbCBjb3VudHMuCj4KCk5vdGVzCgoqIElmIHlvdSBhcmUgc2VxdWVuY2luZyB5b3VyIG93biBkYXRhLCB0aGUgc2VxdWVuY2luZyBmYWNpbGl0eSB3aWxsIGFsbW9zdCBhbHdheXMgcHJvdmlkZSBmYXN0cSBmaWxlcy4gIAoqIEZvciBwdWJsaWNseSBhdmFpbGFibGUgc2VxdWVuY2UgZGF0YSBmcm9tIEdFTy9TUkEsIHRoZSBmaWxlcyBhcmUgdXN1YWxseSBpbiB0aGUgU2VxdWVuY2UgUmVhZCBBcmNoaXZlCihTUkEpIGZvcm1hdC4gUHJpb3IgdG8gcmVhZCBhbGlnbm1lbnQsIHRoZXNlIGZpbGVzIG5lZWQgdG8gYmUgY29udmVydGVkIGludG8gdGhlCkZBU1RRIGZvcm1hdCB1c2luZyB0aGUgZmFzdHEtZHVtcCB1dGlsaXR5IGZyb20gdGhlIFNSQSBUb29sa2l0LiBTZWUgaHR0cDoKLy93d3cubmNiaS5ubG0ubmloLmdvdi9ib29rcy9OQksxNTg5MDAgZm9yIGhvdyB0byBkb3dubG9hZCBhbmQgdXNlIHRoZQpTUkEgVG9vbGtpdC4gIAoK