Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

why low mapping rates for RNAseq?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hello,

    Originally posted by NGSfan View Post
    According to most estimates (Mortazavi, for example) only 3% of all reads (mapped and unmapped) fall on splice junctions - so this is quite small really.
    I would like to say : "Don't underestimate splice junctions !!"

    As Michael.James.Clark, I'm very suprised by the number of 3% (could you give us a link to this article)!!!

    For example, if you sequence a transcript of 1500nt, and obtain 555 reads of 25nt (so an average coverage of 9,25X). If your mRNA contains 10 exons, you will have around 25% of the reads that fall on splice junctions (So you'll only map 75% of the reads, without consider sequencing errors and artefacts).
    To obtain 3% with 25nt reads, your initial transcript should contains only two exons, so the number of unmapped reads is highly correlated with the number of exons per transcript and size of reads !!

    Cheers,

    Jean-marc

    Comment


    • #17
      Hi guys, the paper is:

      "Mapping and quantifying mammalian transcriptomes by RNA-Seq"

      http://www.nature.com/nmeth/journal/...meth.1226.html

      "Splice-crossing reads, such as are shown for Myf6 (Fig. 1b), were identified by mapping otherwise unassigned sequence reads to a library of all known splice events in all University of California Santa Cruz genome database (UCSC) Mouse July 2007 (mm9) gene model splices. When we summed over the entire dataset, including all otherwise unmappable reads, splice-spanning reads comprised approx3% (Supplementary Table 1), which is consistent with splice frequency in gene models across the genome."

      Comment


      • #18
        Originally posted by Michael.James.Clark View Post
        In some cases that's true. For mouse? Not as big a problem.

        My advice is to generate a reference genome with all possible splice variants and align to it.

        Otherwise you will see a massive drop in coverage at all exon-exon junctions as I said earlier, which will result in a large number of "unmappable reads" when you're only robust against two mismatches.

        We've done it this way in our lab with human and it has worked.

        Wow, this is interesting. What percentage of reads did you recover when you did this? how long are your reads?

        One thing for sure, as the reads get longer 50, 75, 100, etc.. the more chance they will cross a splice junction. So the recovery effect will be stronger for longer reads.

        When I use the Mortazavi dataset (25bp) and compare genome reference vs UCSC Known Genes transcripts (no introns), I see a very small increase in reads recovered ~2% , much like their paper states. Of course, when switching over to a transcript reference, I am also losing reads that are falling in places that are not in the set of annotated transcripts.

        Using a reference of transcripts with all splice variants instead of the whole genome has its caveats as well - you will miss the novel junctons not documented. But of course, this will be a very very small number of reads.

        You could run a newer program called TopHat that will handle splice junctions, but only it captures ~80% of them.

        The mystery to me is that jmaury's argument makes sense - one would expect more splice junction reads, so this is quite odd - isn't it?
        Last edited by NGSfan; 04-24-2009, 02:14 AM.

        Comment


        • #19
          Hi Jean-marc,

          Originally posted by jmaury View Post
          Hello,

          For example, if you sequence a transcript of 1500nt, and obtain 555 reads of 25nt (so an average coverage of 9,25X). If your mRNA contains 10 exons, you will have around 25% of the reads that fall on splice junctions (So you'll only map 75% of the reads, without consider sequencing errors and artefacts).

          Jean-marc
          I'm not quite sure I understand how you obtain your estimate of 25%. A 1500nt transcript with 10 x 150nt exons contains 1475 25mers, 216 of which (9 * 24) cross splice boundaries. I make this ~15% of reads crossing splice junctions in this example.

          But your point about dependence on transcript length, number of exons per transcript and read length is well made. Performing a similar calculation on the 1372 protein-coding transcript on human chr22 (Ensembl-53) gives:

          Read-length 25 => ~9% cross splice junctions
          36 => ~13%
          50 => ~18%

          However, this calculation assumes (a) uniform coverage across all transcripts, and (b) uniform expression of all transcripts. Both of these are clearly both gross simplifications! A figure of 3% might sound low, but if the sample contains a number of highly-expressed transcripts with long/few exons, it becomes more reasonable.

          Kevin

          Comment


          • #20
            Originally posted by klh View Post
            Hi Jean-marc,



            I'm not quite sure I understand how you obtain your estimate of 25%. A 1500nt transcript with 10 x 150nt exons contains 1475 25mers, 216 of which (9 * 24) cross splice boundaries. I make this ~15% of reads crossing splice junctions in this example.

            But your point about dependence on transcript length, number of exons per transcript and read length is well made. Performing a similar calculation on the 1372 protein-coding transcript on human chr22 (Ensembl-53) gives:

            Read-length 25 => ~9% cross splice junctions
            36 => ~13%
            50 => ~18%

            However, this calculation assumes (a) uniform coverage across all transcripts, and (b) uniform expression of all transcripts. Both of these are clearly both gross simplifications! A figure of 3% might sound low, but if the sample contains a number of highly-expressed transcripts with long/few exons, it becomes more reasonable.

            Kevin
            Hi Kevin,

            Thanks for the nice explanation. I think your points on the assumptions makes it much easier to understand why one would observe 3% instead of the expected higher frequency.

            Comment


            • #21
              Hi Kevin,

              Originally posted by klh View Post
              Hi Jean-marc,
              I'm not quite sure I understand how you obtain your estimate of 25%. A 1500nt transcript with 10 x 150nt exons contains 1475 25mers, 216 of which (9 * 24) cross splice boundaries. I make this ~15% of reads crossing splice junctions in this example.
              Be carefull, the number of reads which cross splice boundaries is 9 * 24 * 2 (*2 becaus reads before AND after the splice site cross the junction), so it gives ~29% of reads crossing splice junctions.

              But according to the Mortazavi et al paper, they've mapped RNA-Seq against the mouse transcriptome, so reads that come from an unknwon splice junction won't fall into the categorie of reads that cross a splice boundaries... However the number of unknown splice junctions in the mouse genome may be low !!!

              JM

              Comment


              • #22
                fraction of splice-crossing reads is a function of read length

                Hi,

                It's nice to see a spirited discussion of what the fractions of reads crossing splice junctions should be (and to be quoted in that debate ). I'll just restate what is in retrospect obvious, based on looking at a couple year's worth of data.

                The fraction of reads crossing known or novel junctions is a function of both the read length and the distribution of exons lengths. For mammalian genomes like human and mouse (I claim no knowledge of the grapevine genome), less than 3% of 25 mers end up crossing a known splice junctions; this number in the same samples & genomes goes up to 20% by the time you use 75 mers. This numbers does not change dramatically in a well annotated genome when you include novel splices!

                It's worth pointing out that since RNA-seq is a numbers game, many people, such as our lab, used raw reads rather than the solexa quality-filtered reads (and we can discuss whether Illumina's pre-1.0 quality scores or filters made any sense at all); this strategy is referred to sometimes as "quality-filtering by mapping".

                One particularly interesting aspect of Illumina's reads is that a fraction of the reads accumulate most of the errors in the 3' end of the reads [in fact we took 32mers and truncated them into 25mers just to avoid the errors accumulating in bases 26-32]. Given the error rates with the old reagents, it's not surprising at all that 40% of the reads would have more than 2 errors in 25 bp.

                Nowadays, 2x75 reads on a good run with the new reagents are of far better quality than those 25mers.

                Ali

                Comment


                • #23
                  Confused

                  Originally posted by alim View Post
                  Hi,

                  The fraction of reads crossing known or novel junctions is a function of both the read length and the distribution of exons lengths. For mammalian genomes like human and mouse (I claim no knowledge of the grapevine genome), less than 3% of 25 mers end up crossing a known splice junctions; this number in the same samples & genomes goes up to 20% by the time you use 75 mers. This numbers does not change dramatically in a well annotated genome when you include novel splices!

                  It's worth pointing out that since RNA-seq is a numbers game, many people, such as our lab, used raw reads rather than the solexa quality-filtered reads (and we can discuss whether Illumina's pre-1.0 quality scores or filters made any sense at all); this strategy is referred to sometimes as "quality-filtering by mapping".

                  One particularly interesting aspect of Illumina's reads is that a fraction of the reads accumulate most of the errors in the 3' end of the reads [in fact we took 32mers and truncated them into 25mers just to avoid the errors accumulating in bases 26-32]. Given the error rates with the old reagents, it's not surprising at all that 40% of the reads would have more than 2 errors in 25 bp.
                  Ali

                  Great to have you here! Thanks for the explanation on filtering. That really clear things a lot.

                  Part of the paper can be confusing. The RNA-seq was carried out on 3 types of tissues (brain, liver, muscle). The Methods stated that total RNA is pooled from tissues. That statement is misleading.

                  It'll be great you can clear me out here:
                  -The total data in your supplementary table 1 doesn't add up to 140 million reads
                  -What I understand from the paper is that all the "unmappable" reads that neither map to the genome and splice junctions are potentially novel transcript, is that right?
                  (Just to clarify from the author although others have stated that in the previous post)
                  -Regarding the percentage of reads at splice junctions...Don't you think that splice-spanning reads should be divided by (total reads that map to genome) instead of the entire datasets?

                  By now, everyone know that 25 bp reads are too short to map anything uniquely. With reads of only 25bp long, I would imagine all the reads will align to the genome.

                  Thanks in advance,
                  Melissa

                  Oops...I realized I made a big mistake when I said "Maybe some reads that are overlapping/spinning exon splice junctions are lost after mapping. " I forgot that RNA-seq data is not supposed to contain introns (unless intron retention occurs... which is less likely to happen compared to plants).

                  Comment


                  • #24
                    Hi Melissa,

                    Let me see if I can answer your questions.

                    The reference to "pooled tissues" is that the RNA-samples (coming from a third party) were not from a single mouse brain but from many mouse brains pooled (same goes for the liver sample and the muscle sample) and not that we mixed brain+liver+splice in most of the analysis or in any of the sequencing. Note that these RNA samples are from the inbred mouse strain that was sequenced, so we do not expect to have many transcripts affected by variation/SNPs/etc.....

                    - The total raw reads for all of the reads, including the unmapped ones adds up to 216M reads, but since only 65% of the reads mapped on average, we used 140M reads in the analysis.

                    - Remember that we map the reads to the mouse reference genome in addition to known splice junctions. Assuming that novel transcripts are like existing transcripts, then most of their reads should map to the mouse reference genome. While there are undoubtedly thousands of junctions that we are missing, these are unlikely to be from highly expressed transcripts (say the top 10 expressed genes in muscle or liver), which are the best characterized transcripts & contribute disproportionately to the read counts.

                    - A better explanation for the 40% that aren't mapping is that they are likely to have the same distribution of unique/splice/multi reads as the other 60%, but with more than 2 mismatches... most of them would map to the same places as the other reads, which means primarily on known transcripts!

                    - the numbers for supplemental table 1 are just for illustration purposes. ERANGE does all of its calculations using mapped (unique + spliced + multi) reads.

                    - actually we see a surprising amount of intron retention in some genes! Clearly, there is still much work to be done in getting the genes models to be comprehensive even in genomes as extensively studied as human and mouse (and C elegans & drosophila too).

                    I hope this helps!

                    Ali

                    Comment


                    • #25
                      Thanks Ali for the detailed explanation. Longer reads will definitely reduce unmappable reads as showed here. Longer read length increase unique reads in RNA-seq.

                      Comment


                      • #26
                        Originally posted by Melissa View Post
                        Thanks Ali for the detailed explanation. Longer reads will definitely reduce unmappable reads as showed here. Longer read length increase unique reads in RNA-seq.
                        That's not necessarily true. Consider 100bp paired end reads. You could have reads that have more than one exon junction (even three or four if the exons are small). This is a significant bioinformatic challenge, especially when we wish to find novel transcripts and new exons.

                        In the later case, you cannot map to all known or predicted exons (or all possible junctions) since you might be missing new exons (especially if the sample has a high mutation rate i.e. cancer, which could create gene fusions). So really you have to map back the full DNA reference, tolerating that a read can span many junctions. My cursory look has not found anything that can handle millions, if not billions of these reads (we want to sequence 1000's of human samples at good coverage), while searching for such junctions.

                        Finally, the mapping above is with very low sensitivity. If you allow only 2 mismatches, then that is requiring that the reads you map have <2% error rate (100bp reads), which is unrealistic and creates the majority of "unmapped" reads (you were not sensitive enough).

                        On the other hand, this means that people like us have a job!

                        Comment


                        • #27
                          Originally posted by nilshomer View Post
                          That's not necessarily true. Consider 100bp paired end reads. You could have reads that have more than one exon junction (even three or four if the exons are small). This is a significant bioinformatic challenge, especially when we wish to find novel transcripts and new exons.
                          This should not be a problem (for mapping) as long as your aligner is capable of finding the best match in the read. AFAIK it is only one method that claims to do this though, but at least AB is starting to go in this direction with the RNA-seq pipeline (by identifying the best match from either end of the read). And of course you could split your 2x100 b reads into 8x25 and use ISAS to map those reads really fast.

                          Comment


                          • #28
                            Originally posted by Chipper View Post
                            This should not be a problem (for mapping) as long as your aligner is capable of finding the best match in the read. AFAIK it is only one method that claims to do this though, but at least AB is starting to go in this direction with the RNA-seq pipeline (by identifying the best match from either end of the read). And of course you could split your 2x100 b reads into 8x25 and use ISAS to map those reads really fast.
                            1. Please explain to me how the "best" match would translate directly (without further reconstruction) into the current isoform/transcript. I would like to know how the "best match" works with a read covering 3 exons.

                            2. You cannot split the reads since you do not know where the splice junctions exist in the read. If you did split, you could use any aligner since they are all fast (with a few exceptions).

                            Comment


                            • #29
                              Melissas statment was that longer reads increases the possibility of getting a unique (exonic) sequence, and the op's question was about the fraction of unmappale reads. So a read spanning 3 exons would then be correctly aligned to the middle exon. Of course as you say the problem lies in defining the isoforms. Re 2) I did not say you have to be able to align all parts of the reads, but rather to increase the likelihood of finding the alignment of a read compared to having only one 25 mere from the same transcript.

                              Comment


                              • #30
                                Definitely longer reads will help aid in the mapping process for any seq experiment.

                                It may not be necessarily correct to map to the middle exon. For example if the exons are 40bp, 20bp, and 20bp respectively. Then in the majority of the cases (reads spanning the exons) the first exon contributes the most sequence to the read. If you want to find the "best fit" exon, it would be the first.

                                As long as it is admitted that the ultimate goal is to reconstruct the isoforms, which the above is an intermediate step, or the goal is to measure gene expression, where we simply care about the abundance of transcripts from a given gene and so do not need to find the isoforms, then I would tend to agree with any intermediate approach.

                                Its a difficult problem, and I am still unsure of the best path of attack.

                                Comment

                                Working...
                                X