Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Overrepresented kmers at the start of reads

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Overrepresented kmers at the start of reads

    I recently just discovered FastQC and I ran it in one of our datasets that's having difficulty in assembly. I was wondering how to interpret this piece of result from FastQC



    Any ideas?

  • #2
    Is this RNA-Seq? If so, this looks like it could be the result of random hexamer priming. Does the nucleotide distribution look off at the beginning too?

    Hansen, K. D., S. E. Brenner, et al. (2010). "Biases in Illumina transcriptome sequencing caused by random hexamer priming." Nucleic Acids Research 38(12): e131.

    Comment


    • #3
      Originally posted by pbluescript View Post
      Is this RNA-Seq?
      Its a bacterial genome run prepared using Nextera. And yes the %A, %T, %C, %G graph also looks like the kmer graph

      Comment


      • #4
        I have seen Nextera libraries show a very similar bias. My guess is that this is just an artifact of the library prep. In the past, I would trim off these regions before mapping, but then I found that it didn't make a big difference, so I just left them there.

        Comment


        • #5
          I agree. Probably reflects a sequence bias for the transposase used by Nextera. It will have its own agenda -- and it may not correspond perfectly with yours. But is it good enough? Assemble and see...

          --
          Phillip

          Comment


          • #6
            Looking at the positions of the sequences, I would see if the sequences: CAGCACCAGCA or CAGCACCACC are part of your primers.

            Comment


            • #7
              Originally posted by pbluescript View Post
              I have seen Nextera libraries show a very similar bias. My guess is that this is just an artifact of the library prep. In the past, I would trim off these regions before mapping, but then I found that it didn't make a big difference, so I just left them there.
              Hi, what were you using your reads for?
              I have the same issue with 80 multiplexed Nextera libraries run on a HiSeq. Their QC graphs all look the same for the first 13bp.
              I'm wondering if I should just trim them?
              Attached Files

              Comment


              • #8
                Originally posted by mxr1895 View Post
                Hi, what were you using your reads for?
                I have the same issue with 80 multiplexed Nextera libraries run on a HiSeq. Their QC graphs all look the same for the first 13bp.
                I'm wondering if I should just trim them?
                I wouldn't bother trimming them. You could always take a sample of your reads and map them trimmed and untrimmed to see which works better. Whenever I did this, I never saw big differences.

                Comment


                • #9
                  New Evidence of Strangeness re: a consistent k-mer bias for various Nextera preps

                  Hello All,

                  Well, I've actively pursued a similar question as the initial post and have found a variety of perspectives on the matter, but none really do the problem justice. It appears to be a far reaching phenomenon that appears across a variety of samples from a variety of users. I was able to find four different postings on the subject and EVERY single FastQC graph they show has an identical, or near identical patterning. I summarized all of the information in a blog post. I will be forwarding it to Illumina for their response. BUT, please comment if you think I'm missing something obvious. In short, I find the pattern too consistent for just transposon bias. I would expect there to be more variability in such an affect, one that would be less prominent in four out of four cases publicly reported.

                  Thanks!
                  Last edited by roliwilhelm; 05-02-2014, 07:10 PM.

                  Comment


                  • #10
                    Yeah, the random hexamer priming effect is almost always identical, regardless of who makes the library. This is unsurprising since the library prep. components are identical.

                    Comment


                    • #11
                      I didn't think that the Nextera kits used random hexamers for amplification? I assumed that the tagmentation step inserted the sequence needed for annealing. Am I incorrect? Here's the best description of the process I could find.

                      You do make a good point, since all of the recurring sequences are hexamers.

                      Still, how would the hexamers which are initiating strand amplification end up included in the read during extension? Why would that occur more frequently and predictably at the start of the read?

                      Obviously these answers aren't completely relevant to the technical concerns of processing the data for assembly, but I would like to know more.
                      Last edited by roliwilhelm; 05-02-2014, 11:36 PM.

                      Comment


                      • #12
                        Originally posted by roliwilhelm View Post
                        Obviously these answers aren't completely relevant to the technical concerns of processing the data for assembly, but I would like to know more.
                        See posts #261 and 263: http://seqanswers.com/forums/showthr...t=4846&page=14

                        Comment


                        • #13
                          Thanks for your comment GenoMax, I would give you a penny if we had any left up here in Canada.

                          Perhaps I wasn't completely clear, but I'm not using multiple displacement amplification of my DNA, nor do I believe that there are any random hexamer priming steps in the Nextera library prep that I used. The information you linked to is related to those forms of sequencing prep.

                          But, I am in doubt about my understanding of the Nextera process, especially since the repeats appear to be random hexamers!

                          (Also: I couldn't find any examples of this on the FastQC help page, even though there was some suggestion there would be)

                          Comment


                          • #14
                            Have you had a look at this paper "Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition", Adey et al. Genome Biology 2010, 11:R119? I would draw your attention to Supplementary Figure 1. The authors show a consistent base composition bias in the region surrounding the transposition site. This composition is found in both E. coli and H. sapiens gDNA. Despite the bias in locations of transposase activity the authors did not detect any bias in genome coverage in E. coli, H. sapiens or D. melanogaster compared to physical fragmentation (sonication) or endonuclease cleavage.

                            I don't really follow your argument that consistency of the base composition suggests that the effect is not due to the transposase. Such may be true in the case of the other fragmentation methods (and the authors of the above paper suggest this) as they include post fragmentation steps such as end repair and A-tailing which may introduce their own biases. The Nextera protocol includes only a PCR amplification, which primes off the inserted transposon, post fragmentation. An argument could be made that the PCR amplification of the fragmented DNA could contribute to a composition bias downstream of the fragmentation site but can not explain the composition bias upstream of the site as that chunk of DNA is long gone by the time PCR happens.

                            Comment


                            • #15
                              I would like to make a distinction in 5’ bias observed in TruSeq RNA libraries and transposon based Nextera. During first strand synthesis, random hexamers with higher GC content are more likely to pair with their complementary bases for long enough to prime cDNA synthesis and therefore there is tendency toward higher GC in 5’ six nucleotides. I have seen this trend in EpiGnome kit used for of library prep from bisulfite converted DNA which uses random hexamers to prime complementary strand synthesis. Mapping reads from non-converted library reads prepared with that kit also reveals more mismatches at initial 1-4 nucleotides indicating that full complementarity along template is not required for progression of synthesis and two 3’ end nucleotide of hexamers provides enough contact for polymerase activity.

                              Tn5 transposase and by extension Nextera transposase uses a cut and paste mechanism to integrate its recognition sequence into DNA. During transposition a 9 base single stranded gaps is left in the fragments which results in duplication of termini. This gap is filled during initial 3 min incubation at 72°C before PCR cycling. If all the fragments in a library are sequenced by saturation (deeper sequencing or limited template use), duplicated region could be recognised and I think that Molecula uses this to stich back short read fragments to form longer synthetic reads. The unbalanced 5’ region observed in FASTQ graphs extends 9 bases in Nextera library reads and end duplication in combination with insertion site bias, might explain this observation.

                              Comment

                              Working...
                              X