Unconfigured Ad

**Brian Bushnell** · 05-19-2015, 09:29 AM

To expand upon that, BBMap has the options for internally quality-trimming (and untrimming) reads, but I only use them in special cases. For example, BBMap is used a lot for filtering out contamination. To ensure high specificity, I only want to remove reads that map to contaminant sequences with high identity; but I don't want the identity calculation to include low-quality bases. So, I map with "qtrim=rl trimq=15 untrim" or similar, which will trim before mapping but then output the untrimmed read.

For normal cases such as variant calling or coverage calculation, don't quality-trim unless you have really low quality data. I would not expect that to be the case for a 100bp library.

**fdts** · 05-20-2015, 12:27 AM

Aah, I see. Thanks to both of you. Would these flags work with BBDuk then?

Data quality is decent. I will map to a de novo metagenome assembly generated from the same data. I am kind of hopeful to be able to do a little variant calling after metagenome binning and mapping. Given this situation, would you still rather not quality trim?

Thanks

**Brian Bushnell** · 05-20-2015, 09:27 AM

Those flags work in BBDuk as well, except for "untrim", which is unique to BBMap.

A good variant caller will take the quality into account when weighing possible variations. Quality-trimming before mapping gives the mapper less information, and thus a higher probability of ambiguously-mapped or incorrectly-mapped reads (though adapter-trimming before mapping is always good). Even if the trimmed bases are low quality - say, 50 bases at Q13 - that's still an expected 45 matches and 5 mismatches, which can greatly increase mapping confidence in a repetitive area, and thus improve the ability to call the variations from the high-quality part of the read.

On the other hand, quality-trimming before assembly is often a good idea, though the exact threshold depends on the assembler and dataset.

**duane.hassane** · 06-11-2015, 04:58 AM

Is there a way of efficiently using the BBMap tools to cut variable length and variable sequence primer pairs from amplicon sequencing data either pre-alignment (fastq) or post-alignment (BAM)? I am thinking in regards to RainDance and Fluidigm data where there are hundreds of amplicons, many of which overlap, each with different forward/reverse primers.

**Brian Bushnell** · 06-11-2015, 08:47 AM

BBTools has is a pair of programs for cutting (and outputting to a file) the sequence between primer pairs, but not for cutting out primers themselves. Is that what you are after?

The usage is like this:

Assume your set of primers for one end is AAAAAA and AAAAAT, and for the other end is GGGGGG and GGGCCC:

msa.sh in=amplicons.fq out=primer1.sam literal=AAAAAA,AAAAAT
msa.sh in=amplicons.fq out=primer2.sam literal=GGGGGG,GGGCCC

That will find the optimal alignment for the optimal primer, and output one line per amplicon (twice). Then run this:

cutprimers.sh in=amplicons.fq out=middle.fq sam1=primer1.sam sam2=primer2.sam

"middle.fq" will contain the sequence between the primers, one per amplicon, using the best alignment. I designed this to cut V4 regions from full-length 16s.

Notes: msa.sh stands for MultiStateAligner, not for Multiple Sequence Alignment. All sequences can be in any orientation (forward or reverse complement).

Please let me know if that doesn't address your question; I may have misunderstood it.

**GenoMax** · 06-11-2015, 09:24 AM

I think @duane.hassane is asking to remove primers from amplicons with a common core sequence but with variable (length and/or sequence?) primers on the ends of that core sequence for each group.

**duane.hassane** · 06-11-2015, 09:44 AM

Thanks. To be more clear, I am after removing the primers themselves (either from the fastq or BAM) and preserving the intervening sequence only. In this use case, there are hundreds of different primer pairs with known mapping locations (because they are used for targeted enrichment of different exons). Thus, for each primer pair, the intervening sequence (target being enriched) is different.

**luc** · 06-11-2015, 09:58 AM

Have you tried to remove these primer sequences with bbduk?

**duane.hassane** · 06-11-2015, 10:18 AM

Yes, bbduk with restrictright and/or restrictleft set to constrain the search to the ends where primers are located. This is actually extremely fast and efficient even with hundreds of primers. Unfortunately, we have encountered some specificity issues with this approach (over- and under-trimming) leading to small gaps in coverage ... with no one set of parameters being optimal for all amplicons.

**Brian Bushnell** · 06-11-2015, 10:41 AM

Hmm, I don't currently have a better solution than BBDuk or the msa/cutprimers pair. You might get a better result if you first bin the amplicons by which primers they use, though, so that then you can trim them only using the correct primers using aggressive settings, rather than with all possible primers. You can try using Seal for that. For example:

seal.sh in=amplicons.fq ref=primer1.fa pattern=out%.fq k=(whatever)

That will produce one file per sequence in the primer file, with all the reads that best match it (meaning, share the most kmers). For that purpose you should only use the left primers or right primers as reference, not both at the same time. Or, ideally, the reference would look like this:

Primer pairs: {(AAAT, GGCC), (TTTT, GGGG)}

Reference file:

>1
AAATNGGCC
>2
TTTTNGGGG

...etc, where each primer pair is concatenated together with a single N between the sequences.

**duane.hassane** · 06-11-2015, 10:47 AM

Ok. Will take a look at this approach. Thank you!

**fdts** · 07-01-2015, 05:50 AM

Just a comment, not sure if this is the place:
It would be helpful for downstream analysis if the refstat files that can be output using bbsplit produced 0s (zeros) for references that have no reads mapped to them instead of omitting data for these references. I was working on generating a table for visualization when I noticed missing data.

**Brian Bushnell** · 07-01-2015, 09:38 AM

Zero-count lines are suppressed by default, but they should be printed if you include the flag "nzo=f" (nonzeroonly=false). That's not documented in BBSplit's shellscript, just BBMap's; I'll add it to the documentation.

**fdts** · 07-01-2015, 11:39 PM

Wonderful, thanks!

**Thowell** · 07-11-2015, 06:35 PM

I have been using bbmap (and other bbtools) for some of my analyses, and so far they are a great set of tools! Unfortunately I have come across a problem, and some preliminary searching hasn't turned up an answer.

I am trying to build bbmap into a galaxy workflow, using it to remove reads that map to a reference of a contaminant (E.coli in this case). The problem is that I just want the unmapped reads in fastq format, but because galaxy names everything with the .dat extension, I cannot tell bbmap to output in fastq format and it defaults to sam. Is there another way to specify the output format?

I'm sure I could convert the sam back to fastq, but the picard tools sam to fastq tool is throwing an error saying that there are unpaired reads in the sam. I know I can probably do this another way, but it seems silly to use another tool when bbmap can already output in the format I want!

I'm also confused as to why there would be unpaired reads, since the bbmap documentation for "outu" says:

Write only unmapped reads to this file. Does not include unmapped paired reads with a mapped mate.

Sounds to me like all the reads should be paired?

Thank you in advance, and thanks for the great tools!

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, Today, 05:37 AM	0 responses 5 views 0 reactions	Last Post by SEQadmin2 Today, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 49 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 109 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News