Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ian Adams
    Junior Member
    • Oct 2008
    • 6

    454 homopolymer errors or????

    I have some 454 data produced from cDNA randomly amplified using proofreading taq, sheared and then sequenced. Once the amplification tags were removed, we'ved assembled it using CLC genomic work bench. Looking at the protein translations of two particular genes we have very good homology to genes from related organisms but in two places we appear to have frame shifts in the DNA sequences which introduce stop codons leading to incomplete proteins. Correcting the frameshifts would be CCC to CCCC in one case and CCCC to CCC in the other and completely restore the protein homology. Both regions have over 200 fold coverage reading the frameshift with only one read giving the "corrected" unshifted sequence. I have read that the 454 does have problems with homopolymers. What I am wondering is if this is biological (a mutation leading to an unusual terminated protein--I think unlikley these proteins are essential) a sequencing error (introduced by the 454 / emPCR) or sample prep (PCR error-but I was using proofreading taq). Does anyone have any experience of this sort of thing and any comments on the "introduced by the sequencing / emPCR option"?
    We are going to Sanger sequence across these regions and see what we get but that will take time.

    thx

    Ian
  • timread
    Member
    • Oct 2008
    • 14

    #2
    Ian -

    IMHO, when you have a ratio of 200 read to 1 it is very unlikely that you are seeing homoploymer miscalls. Typically this occurs when the base calling software has difficulty accurately setting the threshold between say 3C and 4C for a particular extension and you will see both reads well represented. I'd go for a PCR problem, or more interestingly, a real frameshift. Verify the result with a Sanger read.

    tim

    Comment

    • glacerda
      Member
      • Aug 2008
      • 27

      #3
      I have compared a segment of 100K that we have sequenced by sanger (at > 20X) and by 454 ( > 50X)

      Even at 50X, 454 made indel errors (all of them in homopolymers greater than 5 bp)

      in you case, you have a 4bp homopolymer in which an indel error would be very very rare. I also thin this is a pcr issue or a real frameshift.

      Comment

      • vasvale
        Member
        • Mar 2008
        • 29

        #4
        below is a reference for the PCR error, can somebody eplain why homopolymers error are seen with Roche and not with other platforms (as I heard)

        Clarke LA, Rebelo CS, Gonçalves J, Boavida MG, Jordan P.
        PCR amplification introduces errors into mononucleotide and dinucleotide repeat
        sequences.
        Mol Pathol. 2001 Oct;54(5):351-3.
        PMID: 11577179 [PubMed - indexed for MEDLINE]

        Comment

        • bioinfosm
          Senior Member
          • Jan 2008
          • 483

          #5
          As I understand..
          454 does not have reverse end-blockers to nucleotides and attach multiple bases at once, using the total intensity to determine how many bases may have been incorporated.

          Others use blockers to make sure only one nucleotide goes at a time, and make it a simple yes/no question with less chance for error
          --
          bioinfosm

          Comment

          • Alex Clop
            Member
            • Sep 2008
            • 17

            #6
            454 sequences by Pyrosequencing. As defined by Nature Reviews Genetics Glossary, Pyrosequencing is A DNA sequencing technique that relies on detection of pyrophosphate release on nucleotide incorporation rather than chain termination with dideoxynucleotides.

            This chemistry i) incorporates a known nucleotide to the sequencing reaction at a time, ii) then eliminates the remaining nucleotides and iii) then incorporates another known nucleotide again. It does this for the 4 nucleotides sequentially and then the cycle is repeated again. Each time a nucleotide is incorporated at the elongating chain, a luminiscent reaction (luciferase) will occur and detected by a CCD camera. Thus, and will permit the basecalling and the sequence "reading".

            Regarding your question, the intensity of the luciferase ("I") activity is proportional to the number of - identical - nucleotides incorporated ("N") in the synthesized chain. However, if "N" is too high (I don't think there is a well defined threshold), "I" will reach saturation which will lead to wrong "quantification" of the number of contiguous A, T, C, or G that have been incorporated.

            It is not simple to explain but it may help.

            Comment

            • timread
              Member
              • Oct 2008
              • 14

              #7
              The thresholds are calculated for each run by the analysis software based on the signal distribution. If you look at the 454BaseCallerThresholds.txt file in the Analysis (D_) directory you can see some stats for the run. e.g.

              distributionPeaks
              {
              signalPeak = 0, 0.12, Found;
              signalPeak = 1, 0.98, Found;
              signalPeak = 2, 2.00, Found;
              signalPeak = 3, 2.98, Found;
              signalPeak = 4, 3.92, Found;
              signalPeak = 5, 4.84, Found;
              signalPeak = 6, 5.68, Found;
              }

              thresholdsUsed
              {
              threshold = 0, 1, 0.68;
              threshold = 1, 2, 1.58;
              threshold = 2, 3, 2.52;
              threshold = 3, 4, 3.48;
              threshold = 4, 5, 4.44;
              threshold = 5, 6, 5.32;

              interpolationAmount = 0.94
              }


              For a good description of 454 errors see this manuscript ..

              Comment

              • Layla
                Member
                • Sep 2008
                • 58

                #8
                Dear 454 data analysers...
                We have data generated from 454 using nimblegens capture array experiment. I have read that 454 has issues with AT rich genomes and homopolymers. Having looked at the HCdiffs file I was going to focus on variances with Quality scores > 40, but I then read in a paper that stated 64% of errors were assigned a Newbler QS of 64! Does anyone know how to calculate the error rate from 454 data using the files created from gsMapper or if this error rate can be seen in any of the generated files? Ideally I would like to see how many of the variances seen are "real".

                New to sequence data analysis!
                Thankyou
                Layla

                Comment

                • timread
                  Member
                  • Oct 2008
                  • 14

                  #9
                  Originally posted by Layla View Post
                  Dear 454 data analysers...
                  ... but I then read in a paper that stated 64% of errors were assigned a Newbler QS of 64!

                  Layla
                  Layla - can you give the reference for this statement? Thanks in advance,

                  Tim

                  Comment

                  • Layla
                    Member
                    • Sep 2008
                    • 58

                    #10
                    Hi Tim,

                    It was a presentation I came across on the web by Stephan Trong on microbial genomes..viewed by downloading as a pdf. I must say this error was found using next gen sequencing on microbial genomes.

                    Layla

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Pathogen Surveillance with Advanced Genomic Tools
                      by seqadmin




                      The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                      03-24-2025, 11:48 AM
                    • seqadmin
                      New Genomics Tools and Methods Shared at AGBT 2025
                      by seqadmin


                      This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                      The Headliner
                      The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                      03-03-2025, 01:39 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 03-20-2025, 05:03 AM
                    0 responses
                    49 views
                    0 reactions
                    Last Post seqadmin  
                    Started by seqadmin, 03-19-2025, 07:27 AM
                    0 responses
                    57 views
                    0 reactions
                    Last Post seqadmin  
                    Started by seqadmin, 03-18-2025, 12:50 PM
                    0 responses
                    50 views
                    0 reactions
                    Last Post seqadmin  
                    Started by seqadmin, 03-03-2025, 01:15 PM
                    0 responses
                    201 views
                    0 reactions
                    Last Post seqadmin  
                    Working...