Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dave1
    Junior Member
    • Jul 2014
    • 2

    Illumina Nextera Pair-End Sequence Content Bias-Require trimming for DeNovo Assembly?

    I'm working on a bacterial data set that I was having difficulty assembling.

    Illumina. 300 bp reads. Pair End Data. Nextera library prep.

    The FastQC per-base-sequence-content chart (attached) shows high sequence content bias in the first 15-20 positions. Initially, I thought it was adapter contamination and tried to use a variety of trimming tools (trimmomatic, others) to remove what I thought were adapters. I found a blog here: (https://www.instapaper.com/read/496731324), that suggests this is a library problem due to Nextera kits.

    After running the data through trimmomatic, I used the paired data (ignored the data from the unpaired data sets for the time being) and then artificially trimmed off the first 20 positions from the subset of data that was showing the sequence bias. I was finally able to get a reasonable assembly.

    Questions:
    1) Does the sequence bias in the first 20 bases point to a problem with the library prep? Or is this typical with the Nextera/nothing to worry about?

    2) For DeNovo assembly, is it necessary to trim off the first ~20 bases? Is there a recommended tool/process? (rather than just arbitrarily clipping the first 20 bases)?

    3) I noticed Trimmomatic separates the reads into reads that are and are not paired. For DeNovo Assembly, is there any reason NOT to include the unpaired data?

    Thanks in advance
    Attached Files
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    Nextera has highly nonuniform first ~20bp, but it's neither adapter sequence nor errors; just a fragmentation site bias. You don't need to trim it. If you did trim it, though, the only way would be to trim the first X bases.

    For assembly, if you use a pair-aware assembler and have sufficient data, it's best to assemble from paired reads. Some assemblers allow you to specify both paired and unpaired reads in the same assembly, in which case you could use both. But if the assembler only allows you to give it paired OR unpaired reads, it's probably best to give it the paired reads only, rather than mixing all the reads together, which would require you running the data as unpaired. There is no strict answer that will be correct for all assemblers, as they make use of pairing data differently, or possibly not at all.

    Comment

    • dave1
      Junior Member
      • Jul 2014
      • 2

      #3
      Thanks for your help Brian.

      Your feedback that it isn't necessary to trim the first 15-20 bases due to fragmentation site bias led me to revisit my QC results.

      Another Question: Would you be willing to comment on the quality of the reverse read? Would you consider this a good run? ok run? Do you typically see the large quality range in the first few bases of the reverse read? The lab is tuning its protocols. Does this point to anything that might need to get changed?

      Adding this in case it helps others in the future.

      Working with Illumina Nextera prepped, pair-end 300 bp reads.

      I have typically been taking a quick glance at the FastQC results. If the results looked good, I didn't bother with trimming/filtering the data before de-novo assembly. (Was relying on the assembler to leverage quality score information)

      However, when I tried to go assemble the data, the assembly (using a variety of assemblers) were all terrible (thousands of small contigs). Mapping results looked fine.

      I was able to get a good assembly after running the data through trimmomatic first. As Brian suggested, it is not necessary to trim off the first 15-20 bases due to fragmentation site bias...
      Attached Files

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        I have never worked with 2x300bp data; so far, we only go up to 2x250. So I'm not sure how typical the quality is of the last bases on read 2, but it certainly looks like it should be trimmed. And overall the quality variability for read 2 seems higher than it should be, but I don't work on the wet-lab side, so I'm not sure what it might indicate.

        If you have plenty of data, you might experiment with throwing away reads with average quality below some threshold (or specifically, pairs in which either read is below the threshold), and see if that improves your assembly.

        Comment

        • GenoMax
          Senior Member
          • Feb 2008
          • 7142

          #5
          Since FastQC plots larger intervals it is difficult to see what may be going on with R2. You could turn-off the interval plotting on the command line and see if the tail end of R2 truly requires major trimming/throwing away the reads.

          If this is a bacterial genome I would suggest trying SPADes, if you have not already done so.

          Comment

          • avo
            Member
            • Sep 2013
            • 14

            #6
            In my experience the fastqc quality plots look similar to what we see with TruSeq libraries.
            However i always do the trimming for adapters and quality.
            Especially with Nextera, the bead size selection and 2x300bp reads you might end up with some adapter sequences in your read data.

            Do you do the trimming on the MiSeq directly or separately afterwards? To get a feel about the adapter contamination i would recommend to turn off the adapter trimming function on the MiSeq.

            Concerning the first 20 bp I agree with Brian and it looks the same for the Nextera libraries we sequenced so far.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Pathogen Surveillance with Advanced Genomic Tools
              by seqadmin




              The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
              03-24-2025, 11:48 AM
            • seqadmin
              New Genomics Tools and Methods Shared at AGBT 2025
              by seqadmin


              This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

              The Headliner
              The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
              03-03-2025, 01:39 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 03-20-2025, 05:03 AM
            0 responses
            49 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-19-2025, 07:27 AM
            0 responses
            57 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-18-2025, 12:50 PM
            0 responses
            50 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-03-2025, 01:15 PM
            0 responses
            201 views
            0 reactions
            Last Post seqadmin  
            Working...