No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with Nextera WGS data


    I am wondering if anyone here can provide me an answer to my question.

    I am working on couple of WGS data - the libs are prepared using Illumina nextera and Truseq WG amplification kit . When initial post-sequencing QC was done, the Nextera samples showed a weird first 14 bp (5 prime ) nucleotide distribution,unlike the Truseq (Please see the attachments here ).

    we checked the data for any adapter / primer contamination against Illumina Nextera / Epicenter Nextera sequences, using various tools ( fastx , cross match allowing 2 mismatches ) - expecting some of them will map to these first 14 bp or more. But none or few thousands of the reads were mapped - indicating that these first ( 5 prime ) 14 bp are not adapters/primer products.

    Wondering if any user here experienced similar with Nextera kits or if any one could give me clue as to what these 5prime 14 bp could be...

    Thanks in advance,
    Attached Files
    Last edited by aparna; 06-04-2012, 01:11 PM. Reason: no attachements

  • #2
    Disclaimer: I haven't used Nextera (but my graduate work involved recombinase specificity in the human genome).

    I would bet that this is insertion site bias. Tn5's footprint is this doesn't surprise me at all (in fact I would be amazed if Epicentre evolved it out of their production enzyme...).

    Check out the figure in this has a consensus insertional bias for Tn5...(I'd love to see you run something like MEME on the starts of all your reads...)


    • #3
      Hi aparna,

      Yes, the Nextera kits use an engineered Tn5 that has a target site preference. In addition to the paper ECO mentioned you might look at the Supplementary Figure 1 for Adey et al 2010:

      where they show the nucleotide distribution for the first several sites sequenced with those kits and compare it to sonication.

      Whether this bias is a bane or a boon really depends on your application. For de novo assembly it may be troublesome and as I mentioned in another post the group I work in has found (this is not yet published! treat as anecdote!) that overdigesting with Tn5 and eliminating the small fragments, e.g. <350nt, helps to reduce the target site bias in the resulting library. If doing this kind of assembly you might want to investigate some of the newer assemblers designed for data with uneven coverage such as whole genome amplification data. IDBA-UD, diginorm, and one of the newer euler releases come to mind, probably there is also something from the group at UMD/Johns Hopkins that would work too. We developed our own pipeline called A5 for these assemblies. The paper is in press.

      On the other hand if you want to do SNP profiling the bias might be helpful in the same way that people use RAD-seq to focus sequencing reads to assay polymorphisms in a subset of the genome.


      • #4
        We have seen the Nextera bias as well. Our data has seemed to assemble ok though. At any rate, the effect is nothing like whole genome amplification - orders of magnitude different.

        For a 51 SE library in a reference-guided assembly we got N50 of 14 kb, 94% coverage of a reference genome (most all of the missing 6% appeared to be mobile DNA, presumably strain polymorphisms). This is a 44% GC bacterium with 2MB genome.

        I haven't compared it directly to Tru-Seq, would be very interested if someone has. But it shouldn't totally break de novo assembly from my experience.


        • #5
          Hi Koadman and ECO ,
          Thank you so much for your valuble insights and for the attachments.
          I have not used MEME yet but looks like those 5 prime 14 bps are IS bias indeed.
          I see the first 14 bp in the paired end sequencing data as CCCTAACCCTAACC or GGGTTAGGGTTAGG.

          We are comparing Nextera vs Truseq WG amplification methods.As a part of it I am also interested in variant calling to see the differences and take it from there. As part of this effort I originally mapped this data to human reference hg19 using bwa defualt settings.Difference is quite noticeable in mapping and mates pairing.

          Nextera Untrimmed:

          1,214,797,540 in total
          91017689 duplicates
          915682948 mapped (75.38%)
          843158634 properly paired (69.41%)
          17398062 singletons (1.43%)

          Nextera Trimmed: (using bwa aln -B 14 )

          1,214,797,540 in total
          90178338 duplicates
          900457327 mapped (74.12%)
          735216938 properly paired (60.52%)
          21143784 singletons (1.74%)

          With trimming we were expecting good mapping comparable to Truseq data wich was like 94% mapped reads with 92% pairing - but no. wondering what went wrong. Do you suggest any thing else?


          • #6
            Have you filtered out nextera adapter sequences from the reads with something like tagdust or scythe for 3' contamination? If not, what does your insert size distribution look like? Do you have a bioanalyzer trace? What method did you use for size selection?

            In our early attempts at nextera where we relied on the Ampure XP beads for size selection we would see high rates of adapter contamination. We now do a broad swath gel cut 320-600nt for all nextera libraries and the adapter contamination rates are much lower, usually 1% or less.

            Illumina has finally shared their Nextera adapter sequences so you could try filtering those reads and see whether your mapping rate goes up.

            The apparent duplicate rate of 10% is also a bit worrisome, although with nextera libraries this number can also be influenced by transposition bias and not just PCR cycles. If the tagmentation is heavily biased, two read pairs that are not PCR duplicates will be much more likely to start in the same positions.
            Last edited by koadman; 06-05-2012, 07:31 PM.


            • #7
              Hi Koadman,

              I had around 3million reads that were 3 prime contaminants and thats about it. As mentioned in my initial post, we have tried to look for Illumina/Epicenter adapters/ primers in our data and found few hundreds of them.
              I need to ask in the lab about the size selection and traces.
              Post mapping median insert size was falling at 200 bp compared to Truseq 400 bp.

              This data is real puzzling to me. We will figure out and post an update here if possible.


              • #8
                Oh, sorry, somehow I missed or didn't understand between your first two posts that the read mapping results in the 2nd post had been adapter filtered. Thanks for clarifying. 3 million reads out of 1.2 billion is not bad for adapter.

                As for the pairing and insert distribution issue, thanks for telling us the median insert size but what does the entire distribution look like? If you did not do a gel cut during library prep there might be a long tail to this distribution. I am not sure exactly what threshold bwa uses to decide whether a pairing is "proper" or not, I wonder if many of your reads are mapping just barely too far apart for bwa to call them proper.

                As for the first 14bp, this is indeed puzzling. I notice that the two sequences you're observing are reverse complements of each other, and that they also contain a 6nt direct tandem repeat. I wonder if this might be some kind of PCR artefact but really don't have much of a clue. Does the remaining portion of those reads contain the expected target sequence (human?). If so, I wonder how the mapping looks with those sequences trimmed?
                Last edited by koadman; 06-06-2012, 09:54 PM. Reason: oops mistakenly read bowtie as read mapper instead of bwa


                • #9

                  I'm doing RNAseq with Nextera and oberved exactly the same pattern on the first 15bp, with overrepresented sequences. I'm doing de novo assembly, and thought about running the assembly trimming this portion and not trimming it. I was interested in knowing what you finally did.



                  • #10
                    Originally posted by mariruilo View Post
                    I'm doing RNAseq with Nextera and oberved exactly the same pattern on the first 15bp, with overrepresented sequences. I'm doing de novo assembly, and thought about running the assembly trimming this portion and not trimming it. I was interested in knowing what you finally did.

                    This is a known observation in case of RNAseq experiments. You can see this thread (there are possibly others) for additional information:


                    • #11
                      Thank you so much GenoMax! I'm newbie to RNAseq and all that information has been really helpful...


                      • #12
                        Does anyone happen to have a motif file for this insertion site described in this reference above?

                        I'm trying to determine why there are some holes in coverage for a Nextera library mapped to our reference sequences, and thought it might be useful to search for the abundance of the transposase insertion site motifs.


                        Latest Articles


                        • seqadmin
                          Advanced Tools Transforming the Field of Cytogenomics
                          by seqadmin

                          At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                          09-26-2023, 06:26 AM
                        • seqadmin
                          How RNA-Seq is Transforming Cancer Studies
                          by seqadmin

                          Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                          09-07-2023, 11:15 PM





                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 09:38 AM
                        0 responses
                        Last Post seqadmin  
                        Started by seqadmin, 09-27-2023, 06:57 AM
                        0 responses
                        Last Post seqadmin  
                        Started by seqadmin, 09-26-2023, 07:53 AM
                        1 response
                        Last Post seed_phrase_metal_storage  
                        Started by seqadmin, 09-25-2023, 07:42 AM
                        0 responses
                        Last Post seqadmin