No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    To expand upon that, BBMap has the options for internally quality-trimming (and untrimming) reads, but I only use them in special cases. For example, BBMap is used a lot for filtering out contamination. To ensure high specificity, I only want to remove reads that map to contaminant sequences with high identity; but I don't want the identity calculation to include low-quality bases. So, I map with "qtrim=rl trimq=15 untrim" or similar, which will trim before mapping but then output the untrimmed read.

    For normal cases such as variant calling or coverage calculation, don't quality-trim unless you have really low quality data. I would not expect that to be the case for a 100bp library.


    • #17
      Aah, I see. Thanks to both of you. Would these flags work with BBDuk then?

      Data quality is decent. I will map to a de novo metagenome assembly generated from the same data. I am kind of hopeful to be able to do a little variant calling after metagenome binning and mapping. Given this situation, would you still rather not quality trim?



      • #18
        Those flags work in BBDuk as well, except for "untrim", which is unique to BBMap.

        A good variant caller will take the quality into account when weighing possible variations. Quality-trimming before mapping gives the mapper less information, and thus a higher probability of ambiguously-mapped or incorrectly-mapped reads (though adapter-trimming before mapping is always good). Even if the trimmed bases are low quality - say, 50 bases at Q13 - that's still an expected 45 matches and 5 mismatches, which can greatly increase mapping confidence in a repetitive area, and thus improve the ability to call the variations from the high-quality part of the read.

        On the other hand, quality-trimming before assembly is often a good idea, though the exact threshold depends on the assembler and dataset.


        • #19
          Is there a way of efficiently using the BBMap tools to cut variable length and variable sequence primer pairs from amplicon sequencing data either pre-alignment (fastq) or post-alignment (BAM)? I am thinking in regards to RainDance and Fluidigm data where there are hundreds of amplicons, many of which overlap, each with different forward/reverse primers.


          • #20
            BBTools has is a pair of programs for cutting (and outputting to a file) the sequence between primer pairs, but not for cutting out primers themselves. Is that what you are after?

            The usage is like this:

            Assume your set of primers for one end is AAAAAA and AAAAAT, and for the other end is GGGGGG and GGGCCC:

   in=amplicons.fq out=primer1.sam literal=AAAAAA,AAAAAT
   in=amplicons.fq out=primer2.sam literal=GGGGGG,GGGCCC

            That will find the optimal alignment for the optimal primer, and output one line per amplicon (twice). Then run this:

   in=amplicons.fq out=middle.fq sam1=primer1.sam sam2=primer2.sam

            "middle.fq" will contain the sequence between the primers, one per amplicon, using the best alignment. I designed this to cut V4 regions from full-length 16s.

            Notes: stands for MultiStateAligner, not for Multiple Sequence Alignment. All sequences can be in any orientation (forward or reverse complement).

            Please let me know if that doesn't address your question; I may have misunderstood it.


            • #21
              I think @duane.hassane is asking to remove primers from amplicons with a common core sequence but with variable (length and/or sequence?) primers on the ends of that core sequence for each group.


              • #22
                Thanks. To be more clear, I am after removing the primers themselves (either from the fastq or BAM) and preserving the intervening sequence only. In this use case, there are hundreds of different primer pairs with known mapping locations (because they are used for targeted enrichment of different exons). Thus, for each primer pair, the intervening sequence (target being enriched) is different.


                • #23
                  Have you tried to remove these primer sequences with bbduk?


                  • #24
                    Yes, bbduk with restrictright and/or restrictleft set to constrain the search to the ends where primers are located. This is actually extremely fast and efficient even with hundreds of primers. Unfortunately, we have encountered some specificity issues with this approach (over- and under-trimming) leading to small gaps in coverage ... with no one set of parameters being optimal for all amplicons.


                    • #25
                      Hmm, I don't currently have a better solution than BBDuk or the msa/cutprimers pair. You might get a better result if you first bin the amplicons by which primers they use, though, so that then you can trim them only using the correct primers using aggressive settings, rather than with all possible primers. You can try using Seal for that. For example:

             in=amplicons.fq ref=primer1.fa pattern=out%.fq k=(whatever)

                      That will produce one file per sequence in the primer file, with all the reads that best match it (meaning, share the most kmers). For that purpose you should only use the left primers or right primers as reference, not both at the same time. Or, ideally, the reference would look like this:

                      Primer pairs: {(AAAT, GGCC), (TTTT, GGGG)}

                      Reference file:


                      ...etc, where each primer pair is concatenated together with a single N between the sequences.
                      Last edited by Brian Bushnell; 06-11-2015, 11:17 AM.


                      • #26
                        Ok. Will take a look at this approach. Thank you!


                        • #27
                          Just a comment, not sure if this is the place:
                          It would be helpful for downstream analysis if the refstat files that can be output using bbsplit produced 0s (zeros) for references that have no reads mapped to them instead of omitting data for these references. I was working on generating a table for visualization when I noticed missing data.


                          • #28
                            Zero-count lines are suppressed by default, but they should be printed if you include the flag "nzo=f" (nonzeroonly=false). That's not documented in BBSplit's shellscript, just BBMap's; I'll add it to the documentation.


                            • #29
                              Wonderful, thanks!


                              • #30
                                I have been using bbmap (and other bbtools) for some of my analyses, and so far they are a great set of tools! Unfortunately I have come across a problem, and some preliminary searching hasn't turned up an answer.

                                I am trying to build bbmap into a galaxy workflow, using it to remove reads that map to a reference of a contaminant (E.coli in this case). The problem is that I just want the unmapped reads in fastq format, but because galaxy names everything with the .dat extension, I cannot tell bbmap to output in fastq format and it defaults to sam. Is there another way to specify the output format?

                                I'm sure I could convert the sam back to fastq, but the picard tools sam to fastq tool is throwing an error saying that there are unpaired reads in the sam. I know I can probably do this another way, but it seems silly to use another tool when bbmap can already output in the format I want!

                                I'm also confused as to why there would be unpaired reads, since the bbmap documentation for "outu" says:
                                Write only unmapped reads to this file. Does not include unmapped paired reads with a mapped mate.
                                Sounds to me like all the reads should be paired?

                                Thank you in advance, and thanks for the great tools!


                                Latest Articles


                                • seqadmin
                                  Advanced Tools Transforming the Field of Cytogenomics
                                  by seqadmin

                                  At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                                  Yesterday, 06:26 AM
                                • seqadmin
                                  How RNA-Seq is Transforming Cancer Studies
                                  by seqadmin

                                  Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                                  09-07-2023, 11:15 PM
                                • seqadmin
                                  Methods for Investigating the Transcriptome
                                  by seqadmin

                                  Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

                                  Whole Transcriptome RNA-seq
                                  Whole transcriptome sequencing...
                                  08-31-2023, 11:07 AM





                                Topics Statistics Last Post
                                Started by seqadmin, Today, 06:57 AM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 07:53 AM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 09-25-2023, 07:42 AM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 09-22-2023, 09:05 AM
                                0 responses
                                Last Post seqadmin