Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicate reads ("same start" reads) in 454 FLX/Titanium shotgun runs

    Hi all,

    I have been performing an in-depth quality analysis of some of our 454 whole-genome shotgun runs for a fungal species (~35-70 Mb genome) and plant species (~1 Gb genome) from both FLX and Titanium runs. In both datasets between 15 and 35% of the reads in each individual run are duplicate reads, i.e. the first 100 nt or more are exactly same and they start at exactly the same nucleotide. Even though both genomes are repetitive (to some extent), this is far more than expected by chance alone. Our hypothesis at the moment is that these duplicates are a result of the emulsion PCR step, but we think the percentage is really on the high side! Between runs from the same library there are not so many duplicates, so it is not a library issue. Furthermore we observe roughly the same numbers for paired-end libraries, so this confirms our hypothesis of this being an emPCR problem.

    Does anyone here have any experience with such analyses, and if so, do you find similar numbers?
    Last edited by [c]oma; 03-27-2009, 09:34 AM.

  • #2
    I can only tell you that the emulsion pcr has huuuuge quantification bias.

    When we were running amplicon sequencing, we normalized the amplicon dna amount before the emulsion pcr and after the emulsion pcr there were fragments there was sometimes 100fold differences between coverage of certain amplicons.

    So yeah, it could be that certain sequences in the emPCR are preferentially amplified. Are those sequences the short sequences?

    How does the experiment work, is there any amplification prior to the emulsion pcr?

    Comment


    • #3
      Some of the duplicated reads are indeed short sequences, but a good majority of them fall inside the normal sequence length distribution. There is no PCR step involved prior to the emPCR as far as I know, since it's a genomic shotgun library.

      As for amplicon sequencing bias, a collegue of mine pointed me to this publication (granted, it's about Illumina data, but some of it might also apply to 454). Maybe it can help you understand your data better

      I realize now that I underestimated the contribution of repeats to this phenomenon, so I am currently looking into that. Nonetheless any insight into this is appreciated!

      Comment


      • #4
        [c]oma,

        We did have a researcher report discovering this in their data as well. I never had a chance to follow up on any other samples so I can't say how common this problem is on our hands. Our immediate thought was duplicates generated during the library amplification since that is the most logical explanation. Like your case though there was evidence that it was not occurring at this step (different duplicates observed from multiple runs of the same library). The only explanation we could think of is that during the emulsification step some micelles (micro reactors in 454 speak) were created that contained a single DNA molecule but multiple beads. This was only a hypothesis, we never did anything to test. Maybe when I get some free time (yeah, right) I'll look at some of our other 454 runs to see how many duplicates may exist

        Comment


        • #5
          I think I have noticed the same thing...

          I am looking at a public 454 GS20 dataset from the paper "Sampling the Arabidopsis Transcriptome with massively parallel pyrosequencing" (Weber et al, Plant Physiology May 2007) This was actually an early 'RNA-Seq' experiment, not a genome sequencing project. I have never worked with the equipment, though, so I'm no expert here.

          In any event, there seems to be an unexpected number of reads that are duplicates (multiple reads with exactly the same read start position and read length.) Often you can see the exact same read two or three times, and in one extreme case (the extremely highly-expressed Rubisco gene), there are about 4000 duplicates of one short 72bp read.

          In some cases, I suppose, the duplicates could be a result of the end of a transcript... i.e. any fragment starting x bp before an end of transcription will have the same length. But a lot of these reads occur in the middles of known gene models. Maybe occassionally they are short non-coding RNA. But there are so many that it seems like it must be a technical bias...

          Comment


          • #6
            I am quite familiar with that data set! As you pointed out this is an RNA sequencing project so that is an added complication. The cDNA was generated using the Clontech SMART PCR protocol which is supposed to generate full length cDNAs but this could introduce a bias. Also, if you look at some of the supplementary figures for the paper you will see that there is a bias for reads starting at the 5' or 3' end of the predicted cDNA. This is to be expected. Unlike genomic DNA where the fragmentation should produce random start points, cDNA will always have the fixed end points to start sequencing from.

            An important point to understand about the chemistry of the 454 sequencer is that if two reads start at exactly the same point and there are no missed or extra incorporations then they will end at exactly the same point. The 454 runs a fixed number of sequencing cycles so the read length is going to be fixed for a given sequence. The GS20 (which was used for this study) ran 42 cycles with the base order TACG. If the bases in the sample are randomly distributed you should see on average 2.5 bases incorporated per cycle or an average read length of 105nt. If the bases happen to be a repetitive stretch in the exact same order as the flow cycle you would get 4 bases incorporated per cycle for a read length of 168nt. If the base order of the read were adverse to the flow order you can see that you would end with a length shorter than the expected 100nt. This may explain the 4000 72nt RuBisCO reads. This library was prepared from leaf and was not normalized. There was toooooooon of RuBisCO mRNA present. In fact only 10 transcripts (RuBisCO and chlorophyll subunits) accounted for >50% of all reads.

            Comment


            • #7
              thanks!

              Hi kmcarr,

              Thanks for the response One thing I don't quite understand is when you say "if two reads start at exactly the same point and there are no missed or extra incorporations then they will end at exactly the same point." For this (RNA-Seq) data set, my understanding is that they did a nebulization step to randomly shear the cDNA. Is it possible that there could be differing length fragments at the same start position for this reason? Or should all the fragments be much larger than 100 or so nucleotides? I didn't see any explicit mention of a size selection step.

              Also, I have been looking at the read length distribution and there is one peak at around 70 nt or so, and one peak at around 100 nt. I normalized by transcript so that I only count a random 3 reads per transcript. This way genes like rubisco won't have an unfair contribution. What do you think would cause this bimodal distribution? It's important that I understand this because we are trying to model these distributions in an analysis method we are working on.

              Brian

              Comment


              • #8
                Yes, the cDNA is nebulized and the average fragment size should be 500-800 bp. (There is an issue that nebulization won't break dsDNA less than ~700 bp so that is another complication when dealing with cDNA; the shearing is not as "random" as it is with genomic DNA.) There is a size selection to remove fragments < 300 bp so the vast majority of the library sequences should be much longer than the expected 100 nt (for GS 20) sequence length.

                If I have 10 fragments all originating from the 5' end of 10 copies of the same cDNA but all of varying lengths between 500-800 nt, and the 454 adapters are ligated in the same orientation, then I should get exactly the same sequence from all of them. The sequence will start at the 5' end and will stop when the machine has completed its 42 cycles, regardless of how long the inserted fragments are. This is what I mean by "if two reads start.....".

                I would have to look at the data to be sure but I think that the bimodal distribution is an artifact of the cDNA preparation method and the read trimming process. Reads originating from either the 5' or 3' ends of cDNAs would include the SMART kit adapters which would then be clipped off by our trimming pipeline. The most commonly trimmed size from the 5' end is ~30nt. Reads not including these adapters would be closer to their full read length (~100nt) after trimming.

                Comment


                • #9
                  thanks, that's very helpful information. I will go back and check if the short pile of reads correlates with the ends of known gene models...

                  Comment


                  • #10
                    [c]oma,
                    We have found the same problem with our runs and found similar numbers. We think there is an inherent problem related with the emPCR. We have asked the provider technical assistance and after many inquires they told the normal range is around a 18%. By the way, we have not received any advice to reduce it.

                    Comment


                    • #11
                      Thank you all for your replies. It is reassuring to see others are seeing the same things we are, so it doesn't seem to be something we are doing wrong. But I still don't really like it...

                      Comment


                      • #12
                        I'll chime in to say that I've heard (through a colleague, who heard from someone else, etc.) that this is indeed an artifact of the emulsion PCR, where either (like kmcarr's explanation) droplets contained multiple beads but one piece of DNA, or DNA escapes from droplets during the PCR and colonizes empty beads ... in any case, same read start and stop, and base calls.

                        Note that newbler (gsAssembler) and gsMapper account for this by default; I don't know if they collapse identical reads and then treat them as one read, or if they collapse them to one, but add to the base qualities because of the technical replication, but in any case the code is "aware" of this issue. Doesn't help if you're not exclusively using the 454 pipeline, though. I've used CD-HIT to cluster near 100% identical reads with 1 or 2 overhanging bases ...

                        Comment


                        • #13
                          Originally posted by jnfass View Post
                          Note that newbler (gsAssembler) and gsMapper account for this by default; I don't know if they collapse identical reads and then treat them as one read, or if they collapse them to one, but add to the base qualities because of the technical replication, but in any case the code is "aware" of this issue.
                          Can anyone provide insight on how exactly newbler and gsMapper "account for" stacking reads? Or know where this is documented? This could be crucial for certain sequencing designs/applications...

                          Comment


                          • #14
                            stacking by chance alone?

                            Thought I'd bring this topic back up again to see if anyone can offer some additional advice. We are seeing this stacking effect in our shotgun library (reads with the same start). however, we have a dominant organism (~75% of the sample) which leads to an extremely high read depth in some regions (>700X). Couldn't we get reads starting in the same position by chance alone with such a high depth? Naively, let's look at a 500 bp region with 1000x coverage. Say one new read starts every 5 bp in the region, meaning that there are 100 total read starts. 1000x coverage/100 read starts = 10x coverage per read start by chance alone. How can this be differentiated from the duplicate read effect generated by emPCR? By read length or identity over the whole read? Could we implement a rule along the lines of that if reads start in the same position, but are different in length by more than 1% (could get incorporation errors changing the length of duplicate reads) then they are not duplicates?

                            There's also an interesting twist in some cases. In one instance, a bunch of reads start in the same location with a homopolymer run (say TTTT). Some reads have 3 T's, some have 2 T's, some have 4 T. Should we interpret this as being sequencing error alone?

                            Comment


                            • #15
                              might be your library prep. see the Turner et al paper.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X