RNA-seq analysis in R

Exercise 1

We can view how many of each feature type there are in the GTF by counting the entries in the 3rd column of the GTF:
cd ~/CourseMaterials/

tail -n +6 references/Mus_musculus.GRCm38.97.gtf | cut -f 3 | sort | uniq -c
N.B. The | symbol is known as the “pipe”, it “pipes” the output of the first command into the next command.

The above commands are:

tail -n +6 - start on the 6th line (skipping the first 5 lines which contain the GTF headers)

cut -f 3 - extract (cut out) the third column (it assumes the columns are separated by tabs; this can be changed)

sort - sort the lines alphabetically

uniq -c - collapse consecutive lines that are same and (-c) report how many there were.

Q. How many genes are there in the in the GTF? Is this roughly what you would expect?

tail -n +6 references/Mus_musculus.GRCm38.97.gtf | cut -f 3 | sort | uniq -c

    524729 CDS  
        65 Selenocysteine  
    839112 exon  
     94894 five_prime_utr  
     55573 gene  
     59734 start_codon  
     55473 stop_codon  
     86668 three_prime_utr  
    142333 transcript

There are 55,573 genes in the GTF.

Exercise 2

Run code below to count reads in the bam file MCL1.DL.sorted.bam against the GTF for the mouse GRCm38 genome assembly.
  mkdir counts

  featureCounts \
      -t exon \
      -g gene_id \
      --primary \
      -a references/Mus_musculus.GRCm38.97.gtf \
      -o counts/MCL1.DL.featureCounts \
      bam/MCL1.DL.sorted.bam
-t exon - the feature type to count reads against, in this case exons

-g gene_id - the attribute type to summarise counts by, in this case the gene ID

--primary - only count primary alignments

-a - the gene annotation reference file

-o - the name for the output files

Q. Check the help page again, did we need to specify -t and -g?

If we check the help page using featureCounts -h, we find the entries for these two options:

  -t <string>         Specify feature type(s) in a GTF annotation. If multiple
                      types are provided, they should be separated by ',' with
                      no space in between. 'exon' by default. Rows in the
                      annotation with a matched feature will be extracted and
                      used for read mapping. 

  -g <string>         Specify attribute type in GTF annotation. 'gene_id' by 
                      default. Meta-features used for read counting will be 
                      extracted from annotation using the provided value.

For -t the default is “exon” and for -g the default is gene_id. These are the most commonly used settings, so if we don’t specify them explicitly featureCounts assumes that this is what we want. We can omit them, but you may wish to include them in your scripts for clarity.

3. The output files

Exercise 4

Rerun featureCounts on all the bam files at once.

To save time we will use versions of our BAM files that only include reads aligned to chromosome 15. You can find these in small_bams directory.

There is also a gtf that only contains chr 15 in the reference directory, use this instead of the full gtf.

output the results to a new file called counts/GSE60450_Lactation.chr15.featureCounts

Specify the inpute (bam) files to count using the "*" wildcard.

featureCounts \
    --primary \
    -a references/Mus_musculus.GRCm38.97.chr15.gtf \
    -o counts/GSE60450_Lactation.chr15.featureCounts \
    small_bams/MCL1.*.15.sm.bam

Creates two files with the results for all samples: counts/GSE60450_Lactation.chr14.featureCounts, counts/GSE60450_Lactation.chr14.featureCounts.summary

Exercise 5

Rerun featureCounts on bam/MCL1.DL.sorted.bam, but this time specify “reversely stranded” (use the help to see how to do this) and output the results to a new file called counts/MCL1.DL.reverse.featureCounts.

Q. Compare the summary of the read count assignments to our intial results using the default. Which type of protocol do you think was used to generate this library?

View the help page featureCounts -h:

  -s <int or string>  Perform strand-specific read counting. A single integer
                      value (applied to all input files) or a string of comma-
                      separated values (applied to each corresponding input
                      file) should be provided. Possible values include:
                      0 (unstranded), 1 (stranded) and 2 (reversely stranded).
                      Default value is 0 (ie. unstranded read counting carried
                      out for all input files).

To specify the a reversely stranded protocol we need to add -s 2:

featureCounts \
  --primary \
  -s 2 \
  -a references/Mus_musculus.GRCm38.97.gtf \
  -o counts/MCL1.DL.reverse.featureCounts \
  bam/MCL1.DL.sorted.bam

Creates two files with the results: counts/MCL1.DL.reverse.featureCounts, counts/MCL1.DL.reverse.featureCounts.summary

If we look in the summaries, we can compare the number of assigned reads and the number of reads unassigned as they are not aligned to a valid feature (exon).

cat counts/MCL1.DL.featureCounts.summary

  Status    bam/MCL1.DL.sorted.bam
  Assigned    19950300
  ...
  ...
  Unassigned_NoFeatures   1655868
  ...

cat counts/MCL1.DL.reverse.featureCounts.summary

  Status    bam/MCL1.DL.sorted.bam
  Assigned    10755074
  ...
  ...
  Unassigned_NoFeatures   11781687
  ...

From this we can see that requiring the read to be on the opposite strand to the transcript for it to be assigned to an exon causes us to lose about 50% of the counts. From this we can infer that the original protocol was unstranded. In practice, it is better to have this information ahead of time by knowing what protocol was used to generate your data.

Exercise 6

Rerun featureCounts on bam/MCL1.DL.sorted.bam, but this time also extract the “gene_biotype” from the GTF file and output to a new file called counts/MCL1.DL.gene_biotype.featureCounts.

Q. What proporion of the genes are protein coding? (Hint: To count the bioyptes, see the section above where we counted the feature types in the GTF file)

featureCounts \
  --primary \
  --extraAttributes "gene_biotype" \
  -a references/Mus_musculus.GRCm38.97.gtf \
  -o counts/MCL1.DL.gene_biotype.featureCounts \
  bam/MCL1.DL.sorted.bam

Creates two files with the results: counts/MCL1.DL.gene_biotype.featureCounts, counts/MCL1.DL.gene_biotype.featureCounts.summary

To biotype is now in the 7th column:

head -n 3 counts/MCL1.DL.gene_biotype.featureCounts

  # Program:featureCounts v2.0.1; Command:"featureCounts" "--primary" "--extraAttributes" "gene_biotype" "-a" "references/Mus_musculus.GRCm38.97.gtf" "-o" "counts/MCL1.DL.gene_biotype.featureCounts" "bam/MCL1.DL.sorted.bam"
  Geneid    Chr   Start   End   Strand    Length    gene_biotype    bam/MCL1.DL.sorted.bam
  ENSMUSG00000102693    1   3073253   3074322   +   1070    TEC   0

To count the biotypes:

tail -n +3 counts/MCL1.DL.gene_biotype.featureCounts | cut -f 7| sort | uniq -c

tail -n +3 - start on the 3rd line, skipping the first 2 lines which contain the command used and column names
cut -f 7 - extract (cut out) the 7th column
sort - sort the lines alphabetically
uniq -c - collapse consecutive lines that are same and report how many there were.

  13 IG_C_gene
  1 IG_C_pseudogene
  19 IG_D_gene
  3 IG_D_pseudogene
  ...
  21900 protein_coding
  ...

From this we can see that we have 21,900 protein coding genes, with most of the rest being either non-coding RNAs or pseudogenes.