Header Leaderboard Ad

Collapse

Merging 16S reads with FLASH - parameters?

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging 16S reads with FLASH - parameters?

    I have 300 bp paired-end Illumina reads generated on the MiSeq using Illumina's V3V4 16S protocol. The amplicon size is 460 bp.

    As the first step in my analysis, I'm using FLASH to merge these reads. I'm using the following command line:

    FLASH --min-overlap=20 --max-overlap= 140 --read-len=300 --fragment-len=460 --fragment-len-stddev=1 --output-directory=MERGED --output-prefix=MERGED 612A-plate-1-H04_S88_L001_R1_001.fastq 612A-plate-1-H04_S88_L001_R2_001.fastq

    After FLASH completes, it gives the following warning:

    [FLASH] WARNING: An unexpectedly high proportion of combined pairs (62.47%) overlapped by more than 140 bp, the --max-overlap (-M) parameter. Considerincreasing this parameter. (As-is, FLASH is penalizing overlaps longer than 140 bp when considering them for possible combining!)

    Since the theoretical max overlap should be 140 bp, that's what I set the max-overlap parameter to. How is it possible that so many reads overlap significantly more than 140 bp? Running a few iterations of this, I have found that I have to set 'max-overlap' at 159 to eliminate this error.

    Just trying to understand how this parameter actually works. Maybe my amplicon is a little smaller than expected?

    EDIT: I just realized that I'm using both the 'read-len'/'fragment-len'/'fragment-len-stddev' parameters together with 'max-overlap' above, so the first three are ignored. If I use them without 'max-overlap', the calculated max-overlap is 152. I used 'max-overlap' to determine that 159 eliminates the warning.
    Last edited by cheezemeister; 05-19-2015, 01:45 PM.

  • #2
    Have you scanned this data (with a trimming program) to see how much adapter dimers or read-through it has? Did FastQC indicate this as a possibility?

    Comment


    • #3
      Originally posted by GenoMax View Post
      Have you scanned this data (with a trimming program) to see how much adapter dimers or read-through it has? Did FastQC indicate this as a possibility?
      Haven't done that, however adapters are trimmed at source by the MiSeq. I haven't quality-trimmed the data yet since everything I've read says that merging first is the preferred method.

      Not sure why I would have read-through on a 460 bp amplicon using a 300 bp read.

      I can run FastQC and see.

      Comment


      • #4
        Originally posted by cheezemeister View Post
        Haven't done that, however adapters are trimmed at source by the MiSeq. I haven't quality-trimmed the data yet since everything I've read says that merging first is the preferred method.

        Not sure why I would have read-through on a 460 bp amplicon using a 300 bp read.

        I can run FastQC and see.
        Wasn't asking about quality trimming. You certainly want to first merge and then trim (if needed, for quality). Since we don't use onboard MiSeq analysis I tend to forget that adapters may have already been trimmed (though in that instance you probably no longer have uniform 300 bp reads, trimmed reads could be short and will overlap more than you expect them to, FastQC will tell you about the size spread).

        Give BBMerge a try as well (from BBMap).
        Last edited by GenoMax; 05-19-2015, 03:06 PM.

        Comment


        • #5
          Just selecting a representative file, FastQC reports my sequence length as 35-300 bp, though 70% are 300 bp and pretty much 100% are >280 bp.

          Since max-overlap at 159 eliminates the error, and increasing beyond that does not increase % merged, that seems to jive with 100% of bases being 280 bp or greater.

          I'll also try BBmerge. Do you happen to know if BBmerge can do batch processing (I've got several thousand samples of data) and output the %merge in a table?

          Comment


          • #6
            Originally posted by cheezemeister View Post
            Just selecting a representative file, FastQC reports my sequence length as 35-300 bp, though 70% are 300 bp and pretty much 100% are >280 bp.
            To clarify, was the only trimming done adapter-trimming by the machine? There should not really be anything in the 280-299bp range if trimming was done correctly and the library was made correctly. Adapter-trimming is not necessary prior to merging; the position of adapters (if any) is obvious based on the overlap, and a good read-merger will trim them if present. I suggest you turn it off in this case unless you first generate an insert-size histogram and specifically note adapter sequence. If ~30% are getting trimmed to between 280 and 299bp (when it should be 0%), perhaps the algorithm being used is a greedy one that matches even 1 bp. The end result will be inferior merging as the overlap region is unnecessarily reduced.
            I'll also try BBmerge. Do you happen to know if BBmerge can do batch processing (I've got several thousand samples of data) and output the %merge in a table?
            BBMerge does not have a batch mode; you'd have to script that. It does print the percent merged for each dataset, though, which can be parsed from stderr.
            Last edited by Brian Bushnell; 05-19-2015, 04:36 PM.

            Comment


            • #7
              For future flash use this should be noted:

              --read-len (-r) has no effect when --max-overlap (-M) is also specified!

              --fragment-len-stddev (-s) has no effect when --max-overlap

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Improved Targeted Sequencing: A Comprehensive Guide to Amplicon Sequencing
                by seqadmin



                Amplicon sequencing is a targeted approach that allows researchers to investigate specific regions of the genome. This technique is routinely used in applications such as variant identification, clinical research, and infectious disease surveillance. The amplicon sequencing process begins by designing primers that flank the regions of interest. The DNA sequences are then amplified through PCR (typically multiplex PCR) to produce amplicons complementary to the targets. RNA targets...
                03-21-2023, 01:49 PM
              • seqadmin
                Targeted Sequencing: Choosing Between Hybridization Capture and Amplicon Sequencing
                by seqadmin




                Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...
                03-10-2023, 05:31 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-31-2023, 01:40 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-29-2023, 11:44 AM
              0 responses
              12 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-24-2023, 02:45 PM
              0 responses
              20 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2023, 12:26 PM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Working...
              X