Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Introducing BBMerge: A paired-end read merger

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by peerah View Post
    Hi Brian! I have a question: I am working on a fungal ITS metagenomic amplicon library with a pretty wide variation in sizes (200-500 bp). We are doing 2x300, and my second reads are a little bit lower in quality compared to the firsts. Is there any setting on the BBMerge that I should modify in order to get the most out of the data? I'm pretty new to the field, so please let me know if you need more information! Thank you.
    Hi Peerah,

    We are having the same problem in our lab with 2x300 miseq runs- very poor Read 2 >Q30 scores- and I was wondering if Brian`s recommendation improved the number of paired-sequences you obtained from that run.

    Cheers

    Comment


    • Originally posted by mdavrandi View Post
      Hi Peerah,

      We are having the same problem in our lab with 2x300 miseq runs- very poor Read 2 >Q30 scores- and I was wondering if Brian`s recommendation improved the number of paired-sequences you obtained from that run.

      Cheers
      In case you had missed this post that has first explanation for poor read 2 scores.

      Comment


      • Confusion regarding read merging

        Dear Brian, or anybody else who could help me,

        I used the following command for BBMerge:
        bbmerge.sh in=reads.fq out=merged.fq pfilter=1

        I got theses stats:
        Pairs: 2545201
        Joined: 1491688 58.61%
        Ambiguous: 439613 17.27%
        No Solution: 613393 24.10%
        Too Short: 0 0.00%
        Avg Insert: 322.6

        My questions:
        1. What happens to the bases while read merging if there is a mismatch outside of the 12 bases this command considers. As per my understanding, Minimum number of overlapping bases to allow merging is 12. In other words, could you please explain exactly how does the merge happen between two paired end reads when I use the above mentioned command for a perfect overlap?

        2. Could you please explain, what do "Ambiguous" and "No solution" mean?

        Thank you so much,
        Ashu

        Comment


        • Hi Ashu,

          "Ambiguous" means there are multiple possible overlaps. For example, if read 1 and read 2 both end with "ACACACACACACACACACACAC", there are lots of possible overlap frames, none of which is particularly better than another. So, that would be ambiguous.

          "No solution" means there is no overlap satisfying BBMerge's fairly strict criteria for the number of matching and mismatching bases in the best possible overlap frame.

          If there is no frame in which the length, entropy (this determines the minimum necessary length), number of matching bases, and number of mismatching bases satisfy the cutoffs, the pair will not be merged and it will be declared "No solution". If there are multiple frames satisfying those cutoffs, and the second-best frame is sufficiently close to the best frame that it's really hard to tell which one is correct, the pair will not be merged and it will be declared "Ambiguous".

          The pair will only be merged if there seems to be an unambiguously good solution.

          "minoverlap=12" means that reads will never be merged if the best overlap is shorter than 12 bp. pfilter=1 will prevent reads from merging if there are any mismatches (I don't particularly recommend this, but it might be useful in some situations...). pfilter means probability filter, and considers the base qualities, so a read with a mismatch on a Q2 base might pass while an otherwise identical read with a mismatch in a Q40 base might fail. BBMerge will still look for all possible overlaps, and if, say, you have a 30bp overlap with 1 mismatch and a 20bp overlap with 0 mismatches, that would still be declared ambiguous.

          Incidentally! The BBMerge paper was accepted by PLOS ONE and will be published soon, so you can read all the algorithmic details there =) But I don't actually know the date it will be published, so feel free to ask me more questions in the meantime if I have not sufficiently clarified things.

          Comment


          • Thank you Brian for your reply. I have to merge paired end reads from a Miseq run( I quality trimmed them at Q30). The overlap is around 100bp according to the experimentalist. What options would you recommend to merge these reads? Once I have the merged reads, I will use dedup to get all unique merged reads and run further analysis on them.

            Ashu

            Originally posted by Brian Bushnell View Post
            Hi Ashu,

            "Ambiguous" means there are multiple possible overlaps. For example, if read 1 and read 2 both end with "ACACACACACACACACACACAC", there are lots of possible overlap frames, none of which is particularly better than another. So, that would be ambiguous.

            "No solution" means there is no overlap satisfying BBMerge's fairly strict criteria for the number of matching and mismatching bases in the best possible overlap frame.

            If there is no frame in which the length, entropy (this determines the minimum necessary length), number of matching bases, and number of mismatching bases satisfy the cutoffs, the pair will not be merged and it will be declared "No solution". If there are multiple frames satisfying those cutoffs, and the second-best frame is sufficiently close to the best frame that it's really hard to tell which one is correct, the pair will not be merged and it will be declared "Ambiguous".

            The pair will only be merged if there seems to be an unambiguously good solution.

            "minoverlap=12" means that reads will never be merged if the best overlap is shorter than 12 bp. pfilter=1 will prevent reads from merging if there are any mismatches (I don't particularly recommend this, but it might be useful in some situations...). pfilter means probability filter, and considers the base qualities, so a read with a mismatch on a Q2 base might pass while an otherwise identical read with a mismatch in a Q40 base might fail. BBMerge will still look for all possible overlaps, and if, say, you have a 30bp overlap with 1 mismatch and a 20bp overlap with 0 mismatches, that would still be declared ambiguous.

            Incidentally! The BBMerge paper was accepted by PLOS ONE and will be published soon, so you can read all the algorithmic details there =) But I don't actually know the date it will be published, so feel free to ask me more questions in the meantime if I have not sufficiently clarified things.

            Comment


            • I quality trimmed them at Q30
              That is overly strict. What type of dataset is this and do you have a reference genome available?

              Comment


              • As GenoMax says, trimming to Q30 is not beneficial before merging reads. BBMerge has some internal quality-trimming options, so it can try to merge, then quality-trim if it is unsuccessful, then try to merge again, etc. That can slightly increase the merge rate. But typically I just use the whole untrimmed reads as input. The longer the input reads are, the less likely it is for BBMerge to make an accidental incorrect merge, and it does take quality scores into account, so I do not recommend quality-trimming prior to BBMerge. Adapter-trimming is fine though.

                Comment


                • Hello Brian,

                  Originally posted by Brian Bushnell View Post
                  Aapter-trimming is fine though.
                  Do you recommend adapter trimming prior use of bbmerge? I thought if I provide the adapter sequence to bbmerge, it can find those paires which completly overlap more easy.

                  fin swimmer

                  Comment


                  • Merge pairs before normalisation?

                    Hello, I'm building a pipeline for metagenomics.

                    I follow the bb tools user guide and do:
                    - normalization with bbnorm
                    - error correction with tedpole
                    - merge (with extension) with bbmerge

                    I want to increase the merging to get a better assembly.
                    I suspect that many reads, which could be merge are thrown away during the normalisation.

                    Wouldn't it be better to do merging (without extension) first than taking primarily the merged reads, normalize, error-correct and merge with extension?

                    What is the best way of normalising paired end and merged pairs or singletons in bbnorm?
                    For now I do two rounds of bbnorm and supply the other reads via the `extra` parameter, is there a better way to do?

                    Comment


                    • Hi,

                      I have the shotgun data. Paired-end reads 100bp each end. I want to do MetaPhlAn2 next to know the general taxonomy profile.

                      So I am considering to merge them before the MetaPhlAn2. However, I do not know I need to run bbmap first to do quality control, OR to run bbmerge first to merge the sequence. Any suggestions?

                      Thanks in advance

                      Comment


                      • @chloe - It's normally simplest and most effective to do QC first on the raw data, then anything else (such as merging) later.

                        @silask - they way you are doing it is currently the most effective way. It's a little bit annoying to have to run BBNorm twice, but that's the only way to process both paired and unpaired reads.

                        Comment


                        • Hi, Brian,

                          Thanks for the reply. However, I have tried the QC. I used
                          bbduk.sh in=R1.fastq.gz out=filter_R1.fq maq=30
                          bbduk.sh in=R2.fastq.gz out=filter_R2.fq maq=30
                          (no reads in R1R2 will be trimmed)

                          bbduk.sh in=R1.fastq.gz out=clean_R1.fq trimq=30
                          bbduk.sh in=R2.fastq.gz out=clean_R2.fq trimq=30
                          (it will trim 50% of reverse reads, but no forward reads)

                          bbduk.sh in1=R1.fastq.gz in2=R2.fastq.gz out1=R1_001.fq out2=R2.fq outm=fail.fq bhist=hist_base.txt qhist=hist_q.txt aqhist=hist_aq.txt bqhist=hist_bq.txt ecco=t
                          (Also no reads will be trimmed)

                          But when I run the code BBmerge, only 32.268% of the reads can be joined.

                          Do you have any suggestions?

                          Thanks in advance.

                          Comment


                          • @chloe1005: It is possible that only 32% of your reads have inserts of a size that the reads can merge.

                            `trimq=30` is too severe a bar for trimming. If you have a reference genome then not doing any trimming for quality works fine. If you are doing any de novo work then you may want to trim at Q20 or Q25.

                            Comment


                            • Hi,
                              I am still confusing about the difference between the quality trimming and quality filtering. What is the difference between them?
                              May also know how to get the reference genome? Since I also see the first threads in this post.
                              Looking forward to getting the answer.

                              Comment


                              • Hi Brian, somehow the t=x flag doesn't reduce the number of nodes in use. Any suggestions what goes wrong or can I somehow include Java flags?
                                Bests,
                                Ulrike

                                Comment

                                Working...
                                X