No announcement yet.

FASTQ alignment metrics (RNA-Seq)?

  • Filter
  • Time
  • Show
Clear All
new posts

  • FASTQ alignment metrics (RNA-Seq)?


    How do people judge the quality of a FASTQ (short read) alignment? In particular I'm interested in evaluating RNA-Seq alignments, typically (but not exclusively) from ILLUMINA instruments.

    What comes to mind is:
    * Fraction of reads mapped
    * Fraction of reads mapped uniquely
    * Fraction of 'good' pairs (right orientation, right distance)

    and for RNA-Seq specifically
    * Fraction of reads mapping within a gene

    Anything based on read mapping quality?

    What other metrics can we think of?
    Homepage: Dan Bolser
    MetaBase the database of biological databases.

  • #2
    hi Dan,

    Have a look at "samtools flagstat"

    The output will looks something like this and I think it contains all the info you requested.

    7276199 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    7276199 + 0 mapped (100.00%:-nan%)
    7276199 + 0 paired in sequencing
    3787000 + 0 read1
    3489199 + 0 read2
    6195536 + 0 properly paired (85.15%:-nan%)
    6795026 + 0 with itself and mate mapped
    481173 + 0 singletons (6.61%:-nan%)
    480036 + 0 with mate mapped to a different chr
    480036 + 0 with mate mapped to a different chr (mapQ>=5)
    good luck


    • #3
      Also take a look at RSeQC:

      Most aligners will produce stats on alignments e.g. BBMap, TopHat and probably STAR as well.


      • #4
        FastQC may also be of general use: http://www.bioinformatics.babraham.a...ojects/fastqc/


        • #5
          Originally posted by maxsalm View Post
          I agree it's useful, but it's not what I want here.
          Homepage: Dan Bolser
          MetaBase the database of biological databases.


          • #6
            How about proportion of duplicate fragments? This will depend on whether you've done single- or paired-end reads, though, since with single RNA-seq reads you do expect a certain amount of duplication by chance (with paired reads it's a much smaller chance).


            • #7
              I do primarily single ended reads, but for alignment quality I look primarily at
              1) pct of reads mapped
              2) pct of reads uniquely mapped

              It sounds like you are also asking about post-alignment qc in general and I add
              3) read duplication (ie how many reads align to identical location) - most reads should have only one or several.
              4) reads biotype distribution (most should map to protein-coding regions)
              5) cumulative pct measures - I sort genes by count or fpkm and graph # of genes vs cumulative percentage. That will tell you if you are sinking a lot of reads into very common transcripts and tell you that you might need more depth to see certain less common transcripts.