Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Percent aligned reads for RNA-seq

    Hello all,

    I was curious what sort of estimate do people use when looking at RNA-seq experiments' alignment rates. I normally use tophat for such alignments.

    Bowtie gives an unambiguous value for percent of aligned reads, while tophat does not offer any estimates. One can use percentage from bowtie log in "logs" folder - I believe that spliced reads should add maybe 10% on top of that. But still, it would be nice to have an exact metric.

    Problem of course is that there are some reads that are aligned to more than one place, so just getting the number of reads in your final BAM file and dividing it by the number of reads in FASTQ file would not necessarily produce a meaningful results.

    So, what do you do then? I have some ideas and in-house scripts, but I'm interested about what others do in this situations.

  • #2
    I do use Tophat.

    As it should be obvious, alignment percentages varies a lot depending on the sequenced organism and what reference you can get hold of. A recent project with Arabidopsis is getting 98% to 99% of the reads mapping using Tophat. Of course Arabidopsis is "simple" and highly characterized. A slightly older project with whale samples mapped against human reference (I couldn't find a better reference) had 8-9% mapping. Apple came out around 64%-66%. So choose your target number ... I've got all of them covered. :-)


    My percentages *are* from the accepted_hits.bam file compared to the original fastq files. Of course you need to count up only the unique reads but this is easy enough to do via a script (mine is a one-liner perl script).
    Last edited by westerman; 08-08-2013, 09:45 AM. Reason: "come" to "came"

    Comment


    • #3
      Yes that's a very good elaboration. I meant more specific case of well known genomes (I normally work with mouse and some human samples), so I use percentage as a primitive but effective identifier of library quality.

      A problem with taking unique reads only is in the fact that in RNA-seq, not all reads are actually unique - some strongly expressed genes are expected to generate quite a bit of exactly same reads, especially at substantial sequencing depths.

      One can, of course, calculate (unique aligned)/(unique all) reads, but that's still not quite what I'm looking for. I guess I want to know the number of those reads that just did not align anywhere at all. What would be a good way to estimate that number?

      Comment


      • #4
        You can write the unmapped reads to a file and count them if you are using bowtie.

        http://seqanswers.com/forums/showthread.php?t=31242 (post #3).
        http://seqanswers.com/forums/showthread.php?t=32639 (post #6)

        Comment


        • #5
          Yes, that's really not a problem with bowtie. First of all the percent of aligned reads is reported in the log (correctly, as far as I can tell), and second, bowtie by default writes all non-aligned reads into the BAM file with an appropriate binary flag (4) in the second column. So simple "samtools flagstat file.bam" command should also produce correct alignment statistics.

          Tophat, however, skips the un-aligned reads when writing accepted_hits.bam. I'm not sure if there's an option to write them to the BAM file as well, I've checked the manual and I don't see it so far.

          Comment


          • #6
            I did a little work on one set of RNA-seq data here's what I found, and there's some related comments.

            Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)

            Comment


            • #7
              Ok, so I was not attentive enough :/

              There's a file unmapped_left.fq.z that has all of the un-mapped reads in a compressed form.

              something like

              Code:
              gunzip -c unmapped_left.fq.z | wc -l | awk '{print $1/4}'
              should give you the number of all un-aligned reads.

              Thanks everybody!

              Comment


              • #8
                Originally posted by apredeus View Post
                A problem with taking unique reads only is in the fact that in RNA-seq, not all reads are actually unique - some strongly expressed genes are expected to generate quite a bit of exactly same reads, especially at substantial sequencing depths.
                Actually by "unique" I meant counting the reads by their name not by their sequence. Any given read is only counted once no matter how many times it is in the accepted_hits.bam file. I agree that counting by sequence would give poor information especially at high sequencing depths.

                Anyway, we don't do human samples here nor do we have many mouse samples coming through thus I can't say what a good percentage mapping would be for RNA-seq experiments. Certainly I would expect above 90%. Maybe close to the 98% I get for Arabidopsis.

                Comment


                • #9
                  Originally posted by westerman View Post
                  Actually by "unique" I meant counting the reads by their name not by their sequence. Any given read is only counted once no matter how many times it is in the accepted_hits.bam file. I agree that counting by sequence would give poor information especially at high sequencing depths.
                  I see, I misunderstood what you meant. Thank you for the clarification.

                  Meanwhile, I am still somewhat confused. As an example, I have a small RNA-Seq experiment, FASTQ file for which contained 357706 reads.

                  After the alignment, I've obtained accepted_hits.bam file, that has 135627 entries, of them 103735 unique (by their name, as mentioned above).

                  The unmapped_left.fq.z file has 29358 entries in it (all unique by name).

                  Thus I have about 224613 reads unaccounted for. I assume this is due to "prefilter_multihits" option, that prevents reads that map in too many places from being considered (e.g. repeats etc). To confirm this, my prep_reads.log file states:

                  224603 out of 357706 reads have been filtered out
                  (219406 filtered out due to <skipped>/left_multimapped.fq)
                  The number is indeed very close to the difference that I've just calculated.

                  So, overall, it seems that, while I do have only ~8-9% of reads completely un-alignable to mouse genome, only about a third (~100k) of them are actually useful for differential gene expression, etc.

                  Correct me if I am wrong here.

                  Thanks!

                  Comment


                  • #10

                    So, overall, it seems that, while I do have only ~8-9% of reads completely un-alignable to mouse genome, only about a third (~100k) of them are actually useful for differential gene expression, etc.

                    Correct me if I am wrong here.

                    Thanks!
                    I do believe that you are correct. And such statistics should raise a red flag. In particular I wonder how many of your reads map to rRNA. Do you check for this?

                    Comment


                    • #11
                      Yes I realize that's not a good library That's just one of the experiments while testing new RNA-seq methods. We do check how many of the reads align to rRNA - we use Picard tools for this if I remember correctly.

                      Comment


                      • #12
                        Easiest way...install the latest version of Tophat. They finally got around to having Tophat spit out stats similar to the way bowtie2 does: number of mapped reads and number mapping multiple times.

                        Comment


                        • #13
                          Originally posted by chadn737 View Post
                          Easiest way...install the latest version of Tophat. They finally got around to having Tophat spit out stats similar to the way bowtie2 does: number of mapped reads and number mapping multiple times.
                          I'm running them on a cluster and Tophat 2.0.* just was not cooperating when I tried it... producing all kinds of random errors etc. Plus collaborators use 1.4.1 (or 1.4.1.1 to be more exact) so I'll stick with it for now...

                          But thank you for the advice. It's good to know that this issue was addressed.

                          Comment


                          • #14
                            Originally posted by apredeus View Post
                            I'm running them on a cluster and Tophat 2.0.* just was not cooperating when I tried it... producing all kinds of random errors etc. Plus collaborators use 1.4.1 (or 1.4.1.1 to be more exact) so I'll stick with it for now...

                            But thank you for the advice. It's good to know that this issue was addressed.
                            If you are using 1.4.1 then the mapping scores should all follow this formula:

                            255 = unique mapping
                            3 = maps to 2 locations in the target
                            2 = maps to 3 locations
                            1 = maps to 4-9 locations
                            0 = maps to 10 or more locations.

                            You can use this info then to quickly count reads for each category.

                            Comment


                            • #15
                              Originally posted by chadn737 View Post
                              If you are using 1.4.1 then the mapping scores should all follow this formula:

                              255 = unique mapping
                              3 = maps to 2 locations in the target
                              2 = maps to 3 locations
                              1 = maps to 4-9 locations
                              0 = maps to 10 or more locations.

                              You can use this info then to quickly count reads for each category.
                              That's what I've read but upon looking on the actual BAM outputs, this is not the case (anymore, probably).

                              The flags for single-end experiment are limited to 0,16,256 and 272 (which I'd imagine would be + and - reads aligned "as is" and + and - reads aligned with a junction).

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Recent Advances in Sequencing Analysis Tools
                                by seqadmin


                                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                                05-06-2024, 07:48 AM
                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 05-07-2024, 06:57 AM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 05-06-2024, 07:17 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 05-02-2024, 08:06 AM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-30-2024, 12:17 PM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X