We can view how many of each feature type there are in the GTF by counting the entries in the 3rd column of the GTF:
cd ~/CourseMaterials/ tail -n +6 references/Mus_musculus.GRCm38.97.gtf | cut -f 3 | sort | uniq -c
N.B. The
|
symbol is known as the “pipe”, it “pipes” the output of the first command into the next command.The above commands are:
tail -n +6
- start on the 6th line (skipping the first 5 lines which contain the GTF headers)cut -f 3
- extract (cut out) the third column (it assumes the columns are separated by tabs; this can be changed)sort
- sort the lines alphabeticallyuniq -c
- collapse consecutive lines that are same and (-c
) report how many there were.Q. How many genes are there in the in the GTF? Is this roughly what you would expect?
tail -n +6 references/Mus_musculus.GRCm38.97.gtf | cut -f 3 | sort | uniq -c
524729 CDS
65 Selenocysteine
839112 exon
94894 five_prime_utr
55573 gene
59734 start_codon
55473 stop_codon
86668 three_prime_utr
142333 transcript
There are 55,573 genes in the GTF.
Run code below to count reads in the bam file
MCL1.DL.sorted.bam
against the GTF for the mouse GRCm38 genome assembly.mkdir counts featureCounts \ -t exon \ -g gene_id \ --primary \ -a references/Mus_musculus.GRCm38.97.gtf \ -o counts/MCL1.DL.featureCounts \ bam/MCL1.DL.sorted.bam
-t exon
- the feature type to count reads against, in this case exons-g gene_id
- the attribute type to summarise counts by, in this case the gene ID--primary
- only count primary alignments-a
- the gene annotation reference file-o
- the name for the output filesQ. Check the help page again, did we need to specify
-t
and-g
?
If we check the help page using featureCounts -h
, we find the entries for these two options:
-t <string> Specify feature type(s) in a GTF annotation. If multiple
types are provided, they should be separated by ',' with
no space in between. 'exon' by default. Rows in the
annotation with a matched feature will be extracted and
used for read mapping.
-g <string> Specify attribute type in GTF annotation. 'gene_id' by
default. Meta-features used for read counting will be
extracted from annotation using the provided value.
For -t
the default is “exon” and for -g
the default is gene_id
. These are the most commonly used settings, so if we don’t specify them explicitly featureCounts
assumes that this is what we want. We can omit them, but you may wish to include them in your scripts for clarity.
Rerun featureCounts on all the bam files at once.
- To save time we will use versions of our BAM files that only include reads aligned to chromosome 15. You can find these in
small_bams
directory.- There is also a gtf that only contains chr 15 in the reference directory, use this instead of the full gtf.
- output the results to a new file called
counts/GSE60450_Lactation.chr15.featureCounts
- Specify the inpute (bam) files to count using the "*" wildcard.
featureCounts \
--primary \
-a references/Mus_musculus.GRCm38.97.chr15.gtf \
-o counts/GSE60450_Lactation.chr15.featureCounts \
small_bams/MCL1.*.15.sm.bam
Creates two files with the results for all samples: counts/GSE60450_Lactation.chr14.featureCounts
, counts/GSE60450_Lactation.chr14.featureCounts.summary
Rerun
featureCounts
onbam/MCL1.DL.sorted.bam
, but this time specify “reversely stranded” (use the help to see how to do this) and output the results to a new file calledcounts/MCL1.DL.reverse.featureCounts
.Q. Compare the summary of the read count assignments to our intial results using the default. Which type of protocol do you think was used to generate this library?
View the help page featureCounts -h
:
-s <int or string> Perform strand-specific read counting. A single integer
value (applied to all input files) or a string of comma-
separated values (applied to each corresponding input
file) should be provided. Possible values include:
0 (unstranded), 1 (stranded) and 2 (reversely stranded).
Default value is 0 (ie. unstranded read counting carried
out for all input files).
To specify the a reversely stranded protocol we need to add -s 2
:
featureCounts \
--primary \
-s 2 \
-a references/Mus_musculus.GRCm38.97.gtf \
-o counts/MCL1.DL.reverse.featureCounts \
bam/MCL1.DL.sorted.bam
Creates two files with the results: counts/MCL1.DL.reverse.featureCounts
, counts/MCL1.DL.reverse.featureCounts.summary
If we look in the summaries, we can compare the number of assigned reads and the number of reads unassigned as they are not aligned to a valid feature (exon).
cat counts/MCL1.DL.featureCounts.summary
Status bam/MCL1.DL.sorted.bam
Assigned 19950300
...
...
Unassigned_NoFeatures 1655868
...
cat counts/MCL1.DL.reverse.featureCounts.summary
Status bam/MCL1.DL.sorted.bam
Assigned 10755074
...
...
Unassigned_NoFeatures 11781687
...
From this we can see that requiring the read to be on the opposite strand to the transcript for it to be assigned to an exon causes us to lose about 50% of the counts. From this we can infer that the original protocol was unstranded. In practice, it is better to have this information ahead of time by knowing what protocol was used to generate your data.
Rerun
featureCounts
onbam/MCL1.DL.sorted.bam
, but this time also extract the “gene_biotype” from the GTF file and output to a new file calledcounts/MCL1.DL.gene_biotype.featureCounts
.Q. What proporion of the genes are protein coding? (Hint: To count the bioyptes, see the section above where we counted the feature types in the GTF file)
featureCounts \
--primary \
--extraAttributes "gene_biotype" \
-a references/Mus_musculus.GRCm38.97.gtf \
-o counts/MCL1.DL.gene_biotype.featureCounts \
bam/MCL1.DL.sorted.bam
Creates two files with the results: counts/MCL1.DL.gene_biotype.featureCounts
, counts/MCL1.DL.gene_biotype.featureCounts.summary
To biotype is now in the 7th column:
head -n 3 counts/MCL1.DL.gene_biotype.featureCounts
# Program:featureCounts v2.0.1; Command:"featureCounts" "--primary" "--extraAttributes" "gene_biotype" "-a" "references/Mus_musculus.GRCm38.97.gtf" "-o" "counts/MCL1.DL.gene_biotype.featureCounts" "bam/MCL1.DL.sorted.bam"
Geneid Chr Start End Strand Length gene_biotype bam/MCL1.DL.sorted.bam
ENSMUSG00000102693 1 3073253 3074322 + 1070 TEC 0
To count the biotypes:
tail -n +3 counts/MCL1.DL.gene_biotype.featureCounts | cut -f 7| sort | uniq -c
tail -n +3
- start on the 3rd line, skipping the first 2 lines which contain the command used and column namescut -f 7
- extract (cut out) the 7th columnsort
- sort the lines alphabeticallyuniq -c
- collapse consecutive lines that are same and report how many there were. 13 IG_C_gene
1 IG_C_pseudogene
19 IG_D_gene
3 IG_D_pseudogene
...
21900 protein_coding
...
From this we can see that we have 21,900 protein coding genes, with most of the rest being either non-coding RNAs or pseudogenes.