Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • why low mapping rates for RNAseq?

    Hi everyone!

    I must say, I'm very happy to find a community where we can discuss this new technology.

    I have searched the forum, but could not turn up a thread that discusses the issue of unmappable RNAseq reads.

    According to the article "The digital generation" by Nathan Blow, Dr. Liu is quoted as saying that it is not unusual that only "40-50%" of the data generated are mappable. There is some mention that perhaps this unmappable sequence is from antisense transcripts or artefacts of the RNA processing.

    Interesting J.Shendure mentions being about to achieve 95% mapping with genomic DNA.

    Losing 60-50% of the RNA-seq data seems quite high. Has anyone looked into this more carefully? Are the majority of these unmappable reads just full of sequencing errors? could there be contamination? and what is meant by artefacts in making a sequencing library from RNA? what would these artefacts look like to make them unmappable?

    Thanks for any thoughts.

  • #2
    Hi,
    I think you must be more precise. What is the length of the reads you wont to place, and how many errors you allow for each read. The reads that are placed in multiple copies are considered placed or not?

    Actually I was wondering about not placed reads during last weekend (for the happiness of my girlfriend ). In particular I think that if we exclude reads that have low quality the unplaced reads hide some non trivial information...

    Comment


    • #3
      Hi francesco!

      thanks for joining in on the discussion.

      I have for example, examined some published data with 25bp long reads, and aligned them to a mouse genome allowing for 2 mismatches at most. About 60% of the reads are alignable under those parameters. I looked at the other unmapped reads for possible contamination, but only 2% mapped to human or ecoli for example.

      Perhaps I'll take another look at the unmapped reads and allow for 3 or 4 mismatches and see how many more reads I can recover for mapping. But I'm sure there will be many still unmapped - and I wonder if this is just because there are more errors than advertised by illumina, or what else could this be?

      by non-trivial, do you mean some functional sequences? any guesses what these other sequences could be? I'm very curious.

      In my opinion, it seems really odd, that after spending several thousand dollars on an expensive experiment to only get 50% to 60% of the data, no?

      Comment


      • #4
        Originally posted by NGSfan View Post
        Hi francesco!
        In my opinion, it seems really odd, that after spending several thousand dollars on an expensive experiment to only get 50% to 60% of the data, no?
        Well this depend a lot against what reference are you aligning. We have an Illumina Genome Analyzer and we have sequenced two years ago the genome of the grapevine. When we sequence the reference plant that we have use to obtain the assembly we align more then 74% of the reads with parameters that are quite similar to yours. If we sequenced another variety of grapevine we are able to align only the 60% of all the reads. This is not strange because the two organism are different.

        You are using 25 bases reads, it means that they are really old (now illumina can produce 75 paired ends reads) and probably analysed with the old pipeline. You can find one interesting data set having a look at http://tinyurl.com/68aeq3 and to the article "de novo assembly of the pseudomonas syringae pv syringae b728a genome using illumina/solexa short sequence reads".

        About the non trivial information of not aligned reads there are a lot of possibilities like repeated regions with a lot of errors (if compared to the reference sequence) or totally new inserts.

        Originally posted by NGSfan View Post
        In my opinion, it seems really odd, that after spending several thousand dollars on an expensive experiment to only get 50% to 60% of the data, no?
        No, I disagree with you. If we consider an Illumina experiment with 7 lines the 50% of the reads means more then 2Gigabases of data and all this amount of data is obtained at a fraction of the cost of the methods available only 1 year ago.
        The problem is the opposite, there is too much data to analyse and we are more or less using instruments that where develop for totally different kind of data.

        Obviously this is only an opinion of a four months PhD student.....

        Comment


        • #5
          Unmappable reads sounds new to me. I'm particularly interested in difficulties in de novo transcriptome assembly.

          Can you define these unmappable read more precisely? Are you referring to the raw data or high quality reads after filtering? If you are referring to filtered reads, then sequencing errors cannot contribute to this problem. I believe these reads are resulted from technical/experimental problem rather than the nature of the sequences. For example, low quality reads due to platform's temperature problem and artifacts created at the edges of the flow cell.

          Most RNA-seq data usually contain 30-40% rRNA. After filtering low quality, contaminating reads and polyA tails, 50-60% of reads sounds REALLY good to begin with. So, the answer is NO to whether it's a waste to get only 50-60% reads. High redundancy is another reason why some reads are not useful at all.

          I'm not sure the reason why some reads cannot map to the reference genome. Well, they just don't . Maybe some reads that are overlapping/spinning exon splice junctions are lost after mapping. RNA processing and other regulatory mechanisms sounds like a good explanation.

          Cheers,
          Melissa

          Comment


          • #6
            Originally posted by NGSfan View Post
            Hi everyone!

            I must say, I'm very happy to find a community where we can discuss this new technology.

            I have searched the forum, but could not turn up a thread that discusses the issue of unmappable RNAseq reads.

            According to the article "The digital generation" by Nathan Blow, Dr. Liu is quoted as saying that it is not unusual that only "40-50%" of the data generated are mappable. There is some mention that perhaps this unmappable sequence is from antisense transcripts or artefacts of the RNA processing.

            Interesting J.Shendure mentions being about to achieve 95% mapping with genomic DNA.

            Losing 60-50% of the RNA-seq data seems quite high. Has anyone looked into this more carefully? Are the majority of these unmappable reads just full of sequencing errors? could there be contamination? and what is meant by artefacts in making a sequencing library from RNA? what would these artefacts look like to make them unmappable?

            Thanks for any thoughts.
            What is your reference genome? Whole genome? Or just transcriptome? And if the latter, what's your definition of transcriptome? In the distant past, I used Refseq as a reference genome, but I now think that's too limiting.

            50-60% being "unmappable" sounds strangely high to me.

            In my experience, the major issue with RNA-seq is ribosomal RNA contamination. You must do something like a poly-A pull down for these experiments, or the majority of your data will be ribosomal RNA.

            For example, in an experiment I ran over a year ago (before RNA-seq was as well established as it is now), I used oligo(dT) cDNA synthesis thinking that would enrich for mRNA sequences enough to keep the rRNA sequences at a low level. Turns out that wasn't the case, and about 70% of my data from that lane of Solexa data aligned to ribosomal RNA sequence.

            In that experiment, only an additional 9.5% of the data aligned to Refseq. That means that only 20.5% of my data was "unmappable" for whatever reason. This was also using an older version of our aligner (BFAST), so it's possible more of them would be aligned if I were to re-run the data.

            Sufficed to say, in that experiment, my major issue was the rRNA contamination. As Michelle pointed out, a poly-A pulldown can do a good job alleviating this problem and from what I've read can enrich your sample for mRNA such that you only end up with 30-40% rRNA contamination after a single round of poly-A purification.

            That said, if your reference genome doesn't include ribosomal RNA, it's possible the "unmapped" reads are mostly ribosomal RNA. Do you know if your reference contains rRNA?
            Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
            Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
            Projects: U87MG whole genome sequence [Website] [Paper]

            Comment


            • #7
              Hi,

              As mentionned by Melissa, exon/exon junctions will significantly reduce the number of mappable reads ! And so the number of mapped reads will be determined by the characteristics of your genome. Particularly, the number of exons per gene have to be take into account !!
              Recently I've worked on RNA-Seq from grapevine genome and we've mapped around 80% of the initial reads, but the average number of exons per gene is less than 5, relatively low compared to mammalian species.
              You'll find more informations here : http://seqanswers.com/forums/showthread.php?t=1015

              Hope this helps,

              Cheers,
              Jean-Marc

              Comment


              • #8
                Originally posted by jmaury View Post
                Hi,
                Recently I've worked on RNA-Seq from grapevine genome and we've mapped around 80% of the initial reads, but the average number of exons per gene is less than 5, relatively low compared to mammalian species.
                We have develop similar experiments on grapevine (I work at IGA in Udine) obtaining the same results. Actually the number of reads that span over the exons introns junctions is really low.

                Comment


                • #9
                  Originally posted by francesco.vezzi View Post
                  Well this depend a lot against what reference are you aligning. We have an Illumina Genome Analyzer and we have sequenced two years ago the genome of the grapevine. When we sequence the reference plant that we have use to obtain the assembly we align more then 74% of the reads with parameters that are quite similar to yours. If we sequenced another variety of grapevine we are able to align only the 60% of all the reads. This is not strange because the two organism are different.
                  Yes, it definitely seems that one can get higher rates of mapped reads when one aligns reads generated from genomic DNA.

                  But for RNA-seq, when looking at the literature (RNA-seq mouse studies Mortazavi, Pan, Sultan, etc) the trend tends to be 50-60% of the all the generated reads can be mapped when using the entire genome as the reference sequence.

                  Originally posted by francesco.vezzi View Post
                  You are using 25 bases reads, it means that they are really old (now illumina can produce 75 paired ends reads) and probably analysed with the old pipeline. You can find one interesting data set having a look at http://tinyurl.com/68aeq3 and to the article "de novo assembly of the pseudomonas syringae pv syringae b728a genome using illumina/solexa short sequence reads".
                  For sure longer reads will help, especially for genome assembly and re-sequencing. But the problem for RNA-seq appears to be different.

                  Originally posted by francesco.vezzi View Post
                  About the non trivial information of not aligned reads there are a lot of possibilities like repeated regions with a lot of errors (if compared to the reference sequence) or totally new inserts.
                  Actually the 50-60% mapped reads I mentioned includes reads that are mapped to multiple locations on the genome (ie. repeat regions, paralogs, etc). If you count just the reads that *uniquely* map to just one location in the reference genome, then the percentage drops to 44%.


                  Originally posted by francesco.vezzi View Post
                  No, I disagree with you. If we consider an Illumina experiment with 7 lines the 50% of the reads means more then 2Gigabases of data and all this amount of data is obtained at a fraction of the cost of the methods available only 1 year ago.
                  The problem is the opposite, there is too much data to analyse and we are more or less using instruments that where develop for totally different kind of data.

                  Very true, compared to Sanger sequencing this is much more cost effective. But if people want to use RNA-seq for de-novo profiling or - taken a step further - quantitative gene expression measurement, then they should be aware that according to current results from published work, 40-50% of your data is not usable, that's 40-50% of a ~$6300 sequencing run. This is fine for labs with lots of money to burn, but smaller labs will have to consider this more carefully before they treat this as something routine.

                  It would be nice to know and figure out how to recover these reads, and take more advantage of the data.
                  Last edited by NGSfan; 04-23-2009, 05:33 AM.

                  Comment


                  • #10
                    Originally posted by Melissa View Post
                    Can you define these unmappable read more precisely? Are you referring to the raw data or high quality reads after filtering? If you are referring to filtered reads, then sequencing errors cannot contribute to this problem. I believe these reads are resulted from technical/experimental problem rather than the nature of the sequences. For example, low quality reads due to platform's temperature problem and artifacts created at the edges of the flow cell.
                    I am referring to the primary sequence data - all the raw reads generated from a RNA-seq experiment.

                    That is interesting what you mention about a low temp problem and flow cell edges - I did not know about these issues. It would be interesting to know what fraction of reads are unmappable because of those issues.

                    Originally posted by Melissa View Post
                    Most RNA-seq data usually contain 30-40% rRNA. After filtering low quality, contaminating reads and polyA tails, 50-60% of reads sounds REALLY good to begin with. So, the answer is NO to whether it's a waste to get only 50-60% reads. High redundancy is another reason why some reads are not useful at all.
                    but if you are aligning to the entire genome, the rRNA reads should align as well no?

                    Originally posted by Melissa View Post
                    I'm not sure the reason why some reads cannot map to the reference genome. Well, they just don't . Maybe some reads that are overlapping/spinning exon splice junctions are lost after mapping. RNA processing and other regulatory mechanisms sounds like a good explanation.
                    According to most estimates (Mortazavi, for example) only 3% of all reads (mapped and unmapped) fall on splice junctions - so this is quite small really.

                    Comment


                    • #11
                      Originally posted by NGSfan View Post
                      It would be nice to know and figure out how to recover these reads, and take more advantage of the data.
                      It looks like a philosophic problem... what to do with something that at a first sight has no importance?

                      Where you take the data that originated this discussion? I wont to align them with our tool and see what happens...

                      Comment


                      • #12
                        Originally posted by Michael.James.Clark View Post
                        50-60% being "unmappable" sounds strangely high to me.

                        In my experience, the major issue with RNA-seq is ribosomal RNA contamination. You must do something like a poly-A pull down for these experiments, or the majority of your data will be ribosomal RNA.

                        For example, in an experiment I ran over a year ago (before RNA-seq was as well established as it is now), I used oligo(dT) cDNA synthesis thinking that would enrich for mRNA sequences enough to keep the rRNA sequences at a low level. Turns out that wasn't the case, and about 70% of my data from that lane of Solexa data aligned to ribosomal RNA sequence.
                        Originally posted by Michael.James.Clark View Post
                        Do you know if your reference contains rRNA?
                        Yes I'm using the entire genome, so I would assume rRNA should be mapping as well, no?

                        Yes, 50-60% mapped is really odd (talking about just simply mapping reads to anything, 2 mismatches max, not talking about those uniquely mapping, which is only 44%), especially when considering that the whole genome is used as a reference, so things like poly-A, repeat regions, rRNA, should still be mapping.

                        Comment


                        • #13
                          Originally posted by NGSfan View Post
                          Yes I'm using the entire genome, so I would assume rRNA should be mapping as well, no?
                          Yes, it should. Of course, you can check.

                          Yes, 50-60% mapped is really odd (talking about just simply mapping reads to anything, 2 mismatches max, not talking about those uniquely mapping, which is only 44%), especially when considering that the whole genome is used as a reference, so things like poly-A, repeat regions, rRNA, should still be mapping.
                          Have you considered using something other than whole genome as a reference? Since you're using mouse, it should be easy to do.

                          Generate an adequate transcriptome reference and align to that. In our lab, we've used both Refseq and UCSC known genes as a reference successfully, although I feel we will do a better job the more permissive we become.

                          I've definitely seen in my own data even coverage across exon-exon junctions, and used exon coverage to identify splice variants.

                          For discovering novel transcripts, whole genome is fine. But for looking at expression, splice variants, et cetera, I think we're better off using a transcriptome reference.

                          Originally posted by NGSfan View Post
                          According to most estimates (Mortazavi, for example) only 3% of all reads (mapped and unmapped) fall on splice junctions - so this is quite small really.
                          Can you reference the source for some of these estimates so we can evaluate that claim? I'd just like to take a look at the papers myself. Thanks.

                          Also, what alignment algorithm are you using? You say you are only robust against two mismatches. That's pretty low. Consider that means if you have a SNP and a sequencing error, you've filled your quota. This also means since you're using whole genome with 25-base reads that as soon as you're within 23 bases of an exon junction, you won't align any reads. Considering the size of a lot of exons, you'll just completely miss quite a few of them.

                          Again, I think aligning to a better reference will help you out on that front.
                          Last edited by Michael.James.Clark; 04-23-2009, 09:21 AM.
                          Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                          Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                          Projects: U87MG whole genome sequence [Website] [Paper]

                          Comment


                          • #14
                            Originally posted by Michael.James.Clark View Post
                            Generate an adequate transcriptome reference and align to that.
                            We have performed an experiment like that, but the problem is that often the exon/intron junctions are not perfectly known. If you try to place short reads taken from the trascriptome and align them against the reference sequence allowing gaps (this allow you to place for example half of a read in one exon and the other half in the following exon) you will probably discover that some some exon junctions are not totally correct and maybe you can determine new exons.

                            S the experiment you are proposing is perfect in theory but in practice I'm not sure it will work.

                            Francesco

                            Comment


                            • #15
                              Originally posted by francesco.vezzi View Post
                              We have performed an experiment like that, but the problem is that often the exon/intron junctions are not perfectly known. If you try to place short reads taken from the trascriptome and align them against the reference sequence allowing gaps (this allow you to place for example half of a read in one exon and the other half in the following exon) you will probably discover that some some exon junctions are not totally correct and maybe you can determine new exons.

                              S the experiment you are proposing is perfect in theory but in practice I'm not sure it will work.

                              Francesco
                              In some cases that's true. For mouse? Not as big a problem.

                              My advice is to generate a reference genome with all possible splice variants and align to it.

                              Otherwise you will see a massive drop in coverage at all exon-exon junctions as I said earlier, which will result in a large number of "unmappable reads" when you're only robust against two mismatches.

                              We've done it this way in our lab with human and it has worked.
                              Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                              Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                              Projects: U87MG whole genome sequence [Website] [Paper]

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              103 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              111 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              1 response
                              114 views
                              0 likes
                              Last Post EmiTom
                              by EmiTom
                               
                              Started by seqadmin, 09-26-2024, 12:57 PM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X