Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • kentk
    Member
    • Dec 2011
    • 17

    Overrepresented kmers at the start of reads

    I recently just discovered FastQC and I ran it in one of our datasets that's having difficulty in assembly. I was wondering how to interpret this piece of result from FastQC



    Any ideas?
  • pbluescript
    Senior Member
    • Nov 2009
    • 224

    #2
    Is this RNA-Seq? If so, this looks like it could be the result of random hexamer priming. Does the nucleotide distribution look off at the beginning too?

    Hansen, K. D., S. E. Brenner, et al. (2010). "Biases in Illumina transcriptome sequencing caused by random hexamer priming." Nucleic Acids Research 38(12): e131.

    Comment

    • kentk
      Member
      • Dec 2011
      • 17

      #3
      Originally posted by pbluescript View Post
      Is this RNA-Seq?
      Its a bacterial genome run prepared using Nextera. And yes the %A, %T, %C, %G graph also looks like the kmer graph

      Comment

      • pbluescript
        Senior Member
        • Nov 2009
        • 224

        #4
        I have seen Nextera libraries show a very similar bias. My guess is that this is just an artifact of the library prep. In the past, I would trim off these regions before mapping, but then I found that it didn't make a big difference, so I just left them there.

        Comment

        • pmiguel
          Senior Member
          • Aug 2008
          • 2328

          #5
          I agree. Probably reflects a sequence bias for the transposase used by Nextera. It will have its own agenda -- and it may not correspond perfectly with yours. But is it good enough? Assemble and see...

          --
          Phillip

          Comment

          • mattanswers
            Member
            • Oct 2009
            • 65

            #6
            Looking at the positions of the sequences, I would see if the sequences: CAGCACCAGCA or CAGCACCACC are part of your primers.

            Comment

            • mxr1895
              Junior Member
              • Feb 2012
              • 6

              #7
              Originally posted by pbluescript View Post
              I have seen Nextera libraries show a very similar bias. My guess is that this is just an artifact of the library prep. In the past, I would trim off these regions before mapping, but then I found that it didn't make a big difference, so I just left them there.
              Hi, what were you using your reads for?
              I have the same issue with 80 multiplexed Nextera libraries run on a HiSeq. Their QC graphs all look the same for the first 13bp.
              I'm wondering if I should just trim them?
              Attached Files

              Comment

              • pbluescript
                Senior Member
                • Nov 2009
                • 224

                #8
                Originally posted by mxr1895 View Post
                Hi, what were you using your reads for?
                I have the same issue with 80 multiplexed Nextera libraries run on a HiSeq. Their QC graphs all look the same for the first 13bp.
                I'm wondering if I should just trim them?
                I wouldn't bother trimming them. You could always take a sample of your reads and map them trimmed and untrimmed to see which works better. Whenever I did this, I never saw big differences.

                Comment

                • roliwilhelm
                  Member
                  • Jun 2012
                  • 38

                  #9
                  New Evidence of Strangeness re: a consistent k-mer bias for various Nextera preps

                  Hello All,

                  Well, I've actively pursued a similar question as the initial post and have found a variety of perspectives on the matter, but none really do the problem justice. It appears to be a far reaching phenomenon that appears across a variety of samples from a variety of users. I was able to find four different postings on the subject and EVERY single FastQC graph they show has an identical, or near identical patterning. I summarized all of the information in a blog post. I will be forwarding it to Illumina for their response. BUT, please comment if you think I'm missing something obvious. In short, I find the pattern too consistent for just transposon bias. I would expect there to be more variability in such an affect, one that would be less prominent in four out of four cases publicly reported.

                  Thanks!
                  Last edited by roliwilhelm; 05-02-2014, 07:10 PM.

                  Comment

                  • dpryan
                    Devon Ryan
                    • Jul 2011
                    • 3478

                    #10
                    Yeah, the random hexamer priming effect is almost always identical, regardless of who makes the library. This is unsurprising since the library prep. components are identical.

                    Comment

                    • roliwilhelm
                      Member
                      • Jun 2012
                      • 38

                      #11
                      I didn't think that the Nextera kits used random hexamers for amplification? I assumed that the tagmentation step inserted the sequence needed for annealing. Am I incorrect? Here's the best description of the process I could find.

                      You do make a good point, since all of the recurring sequences are hexamers.

                      Still, how would the hexamers which are initiating strand amplification end up included in the read during extension? Why would that occur more frequently and predictably at the start of the read?

                      Obviously these answers aren't completely relevant to the technical concerns of processing the data for assembly, but I would like to know more.
                      Last edited by roliwilhelm; 05-02-2014, 11:36 PM.

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        #12
                        Originally posted by roliwilhelm View Post
                        Obviously these answers aren't completely relevant to the technical concerns of processing the data for assembly, but I would like to know more.
                        See posts #261 and 263: http://seqanswers.com/forums/showthr...t=4846&page=14

                        Comment

                        • roliwilhelm
                          Member
                          • Jun 2012
                          • 38

                          #13
                          Thanks for your comment GenoMax, I would give you a penny if we had any left up here in Canada.

                          Perhaps I wasn't completely clear, but I'm not using multiple displacement amplification of my DNA, nor do I believe that there are any random hexamer priming steps in the Nextera library prep that I used. The information you linked to is related to those forms of sequencing prep.

                          But, I am in doubt about my understanding of the Nextera process, especially since the repeats appear to be random hexamers!

                          (Also: I couldn't find any examples of this on the FastQC help page, even though there was some suggestion there would be)

                          Comment

                          • kmcarr
                            Senior Member
                            • May 2008
                            • 1181

                            #14
                            Have you had a look at this paper "Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition", Adey et al. Genome Biology 2010, 11:R119? I would draw your attention to Supplementary Figure 1. The authors show a consistent base composition bias in the region surrounding the transposition site. This composition is found in both E. coli and H. sapiens gDNA. Despite the bias in locations of transposase activity the authors did not detect any bias in genome coverage in E. coli, H. sapiens or D. melanogaster compared to physical fragmentation (sonication) or endonuclease cleavage.

                            I don't really follow your argument that consistency of the base composition suggests that the effect is not due to the transposase. Such may be true in the case of the other fragmentation methods (and the authors of the above paper suggest this) as they include post fragmentation steps such as end repair and A-tailing which may introduce their own biases. The Nextera protocol includes only a PCR amplification, which primes off the inserted transposon, post fragmentation. An argument could be made that the PCR amplification of the fragmented DNA could contribute to a composition bias downstream of the fragmentation site but can not explain the composition bias upstream of the site as that chunk of DNA is long gone by the time PCR happens.

                            Comment

                            • nucacidhunter
                              Jafar Jabbari
                              • Jan 2013
                              • 1250

                              #15
                              I would like to make a distinction in 5’ bias observed in TruSeq RNA libraries and transposon based Nextera. During first strand synthesis, random hexamers with higher GC content are more likely to pair with their complementary bases for long enough to prime cDNA synthesis and therefore there is tendency toward higher GC in 5’ six nucleotides. I have seen this trend in EpiGnome kit used for of library prep from bisulfite converted DNA which uses random hexamers to prime complementary strand synthesis. Mapping reads from non-converted library reads prepared with that kit also reveals more mismatches at initial 1-4 nucleotides indicating that full complementarity along template is not required for progression of synthesis and two 3’ end nucleotide of hexamers provides enough contact for polymerase activity.

                              Tn5 transposase and by extension Nextera transposase uses a cut and paste mechanism to integrate its recognition sequence into DNA. During transposition a 9 base single stranded gaps is left in the fragments which results in duplication of termini. This gap is filled during initial 3 min incubation at 72°C before PCR cycling. If all the fragments in a library are sequenced by saturation (deeper sequencing or limited template use), duplicated region could be recognised and I think that Molecula uses this to stich back short read fragments to form longer synthetic reads. The unbalanced 5’ region observed in FASTQ graphs extends 9 bases in Nextera library reads and end duplication in combination with insertion site bias, might explain this observation.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM
                              • seqadmin
                                Investigating the Gut Microbiome Through Diet and Spatial Biology
                                by seqadmin




                                The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                                02-24-2025, 06:31 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 05:03 AM
                              0 responses
                              15 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 07:27 AM
                              0 responses
                              12 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              14 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-03-2025, 01:15 PM
                              0 responses
                              185 views
                              0 reactions
                              Last Post seqadmin  
                              Working...