Header Leaderboard Ad


BBMap (aligner for DNA/RNAseq) is now open-source and available for download.



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • santiagorevale
    Hi Brian,

    I ran filterbyname.sh to extract some FastQ reads from a bigger FastQ file and I noticed that on the quality line of the output, the # (hashtag symbol) gets replaced by an ! (exclamation mark symbol). I checked the documentation and the forums about this behaviour but couldn't find anything.

    Is this expected? If so, why?

    Mine were latest FastQ files produced by BaseSpace from an Illumina MiSeq instrument.

    Thanks in advance.


    Leave a comment:

  • Esteban Gongora
    Hi Brian Bushnell, thank you very much for developing BBMap. It is such a useful tool!

    I am trying to map sequencing reads to a metagenome I had previously assembled and I am getting a very strange output. After running the alignment, I also ran pileup.sh to obtain coverage stats and I got the following output:

    pileup.sh in=mapped.sam out=stats.txt
    Reads: 18093134
    Mapped reads: 3652730
    Mapped bases: 331276393
    Ref scaffolds: 134515
    Ref bases: 43282856
    Percent mapped: 20.188
    Percent proper pairs: 12.341
    Average coverage: 7.654
    Average coverage with deletions: 7.689
    Standard deviation: 9.676
    Percent scaffolds with any coverage: 99.51
    Percent of reference bases covered: 99.34
    If I interpreted this summary correctly, while most of my contigs/scaffolds are being covered, only ~20% of my input reads are being mapped to said contigs. However, I do no understand what happened with the other ~80% since those were the same reads I used to assemble the contigs. Does this mean that my assembler (metaSPAdes) was not able to assemble that ~80% into any kind of contigs? And if that is the case, do you have any idea why this happens? Sorry if this last question is beyond the scope of this forum.

    Thanks in advance with any help you might be able to provide,


    Leave a comment:

  • KenC
    Brian Bushnell Thanks for providing this very useful toolset.

    I have a few questions wrt the use of BBSplit, and I guess it's OK to put them in a single post...

    [Edited: well, there has been no reply for a while and I figured out some of the answers myself, and I add them below just in case someone else run into the same problems...]

    1. Is there any recommendation in terms of computation resource allocation and speeding up a run? For human+mouse genome (a total of 5.6GB .fasta files) and decompressed fastq files at about 50GB each (paired-end), I found that nearly 200GB memory was needed when using 1 thread, and it can take several days or even up to a week to run. Is this normal/expected?

    [Edited: in one of my later applications where I only needed to map one of the read-pairs (decompressed at around 50G), I managed to increase the number of threads to 16, with 10G memory per thread, setting java heap size around 50G, and BBSplit finished running overnight.]

    2. From the user guide I read that the `maxindel` parameter should be set to a large number for RNA-seq (e.g. 200k for human according to the guide). Another example in the guide for BBMap says:
    To map vertebrate RNA-seq reads to a genome:
    bbmap.sh in=reads.fq out=mapped.sam maxindel=200k ambig=random intronlen=20 xstag=us
    I could not find additional information on properly different parameter settings for e.g. whole exome sequencing (WES) vs RNA-seq data. Is the above a good default for human+mouse RNA-seq while running BBSplit? (except for that intronlen is probably irrelevant for BBSplit, and xstag may be application-specific)

    3. I have this problem on re-using genome index as follows:

    In this post, someone asked:
    After running bbsplit once using the syntax: ref=ref1.fa,ref2.fa, is it possible to re-use that index on subsequent runs using the path= parameter?
    And the reply was:
    Yes, it is. Also, with BBSplit, I think it will try to regenerate the index every time as long as "ref=" is specified, even if it already exists, so only do that once.
    However, I have some trouble figuring out how exactly to reuse pre-built genome index when running BBSplit. Below is a toy example.

    Let's say I first build an index for two reference genomes, x.fa and y.fa as follows:
    bbsplit.sh build=1 ref=x.fa,y.fa path=./test
    Next I would like to reuse the pre-built index for splitting many fastq files; for one pair of files, I tried, e.g.:
    bbsplit.sh in1=r1.fq in2=r2.fq basename=o%_#.fq build=1 path=./test
    However, when I run it, it seems that the index was re-built; contents under ./test/ref/genome/1 and ./test/ref/index/1 were removed and then re-generated from scratch. Indeed, there is this message below in the log.
    NOTE: Deleting contents of ./test/ref/genome/1 because reference is specified and overwrite=true
    NOTE: Deleting contents of ./test/ref/index/1 because reference is specified and overwrite=true
    Writing reference.
    Executing dna.FastaToChromArrays2 [ref/genome/1/merged_ref_1939486580590361.fa.gz, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]
    Set genScaffoldInfo=true
    Writing chunk 1
    Writing chunk 2
    Even if I do not specify `path=./test` in both commands and just leave it to the default (i.e. the current dir), the above still happens. I also tried to specify `overwrite=f` or `ow=f`, but that did not seem to have an effect either.

    Besides, I also have a related question, which is whether I can pre-build indices separately for two genomes x.fa and y.fa, saved in different paths or with different build numbers, and then later reuse them for aligning with BBSplit (potentially with something like `build=1,2` or `path=<first_path>,<second_path>`).

    [Edited: I figured this out. I first build the index with `bbsplit.sh ref=x.fa,y.fa`, which generated the index in a folder named `ref`, then in the subsequent BBSplit run for each sample organized in its own sample-specific dir, I first link the `ref` folder to the sample dir, then from the sample dir run with something like `bbsplit.sh ref=x.fa,y.fa in1=r1.fq in2=r2.fq basename=o%_#.fq`, now BBSplit can find the index without any issue.]

    I am using BBMap version 39.01, the latest version to date.

    I appreciate your kind help in advance.
    Last edited by KenC; 11-14-2022, 12:43 AM.

    Leave a comment:

  • jmw86069
    Is there anything else I need to do to test some Java changes to debug where the issue is happening?
    For example, I would like to turn on verboseS=t and have it spit out debug info. I think the pairlen (MAX_PAIR_DIST) is just not being applied from what I can tell from verbose=t output.

    Leave a comment:

  • jmw86069
    Thank you for the response Brian Bushnell !
    We are using the "pairedonly=t" flag, however the reads are still returned with PROPER_PAIR in the SAM file (sam flags 83 and 163).
    It appears the MAPQ is penalized, probably due to the repetitive sequence in read2, however the reads are still shown as properly paired.

    Maybe you'll spot something in the arguments below?
    I've tried all combinations and all SAM alignments have the tag PROPERLY_PAIRED, and are included in the mapped output.
    You can see I also added killbadpairs=t and pairedonly=t, but maybe something else has gone wrong.

    bbmap.sh \
       in=testA_1.subset.fq.gz in2=testA_2.subset.fq.gz \
       ref=/reference_genomes/mm39/mm39canonical.fa \
       path=/reference_genomes/mm39/bbmap_mm39canonical/ \
       out=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bam \
       outu=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.unmapped.bam \
       mappedonly=t \
       trimreaddescriptions=t \
       t=3 \
       maxindel=20 \
       strictmaxindel=t \
       inserttag=t \
       matelen=33000 \
       pairlen=35000 \
       intronlen=999999999 \
       ambig=random \
       minid=0.76 \
       killbadpairs=t \
       pairedonly=t \
       covstats=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bbmap_covstats.txt \
       covhist=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bbmap_covhist.txt \
       ihist=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bbmap_inserthist.txt \
       statsfile=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bbmap_alignmentReport.txt \
       2>&1 | tee -a output_testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped_bbmap.txt
    The SAM output is shown below, and all variations of options above appear to have no effect on the SAM flags and tags:
    QNAME                                                                FLAG  RNAME  POS       MAPQ  CIGAR                  RNEXT  PNEXT     TLEN     SEQ                                                  QUAL                                                 FLAGS
    NS500489:195:HLKT3BGXX:1:12109:18917:8143                            83    chr5   11805066  2     3X4=1X2=1X4=2X2=3X29=  =      11375902  -429215  ACATGCCAAAATAACAACCCACTTTTAAGTGTATAATTCATTACACAATGC  //AE//E//E/E/E///EE///EEEEEEEEEEEEEEEEEEEEEEEEAAAAA  XT:A:R  NM:i:10  AM:i:2    X8:Z:429215
    NS500489:195:HLKT3BGXX:1:12109:18917:8143                            163   chr5   11375902  3     51=                    =      11805066  429215   GACAGAGACAGAGAGAGGAACATAGACAGAGACAGAGAGAGGAACATAGAC  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/E/E/E/EE/E  XT:A:R  NM:i:0   AM:i:3    X8:Z:429215
    NS500489:195:HLKT3BGXX:1:11102:2182:13318                            99    chr17  20345318  42    51=                    =      20345469  202      ACTGGTGGCCTTATAACTTGTGTAATAAGAAATGTTTGAGTTTAACTATAA  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE  NM:i:0  AM:i:42  X8:Z:202
    NS500489:195:HLKT3BGXX:1:11102:2182:13318                            147   chr17  20345469  42    51=                    =      20345318  -202     AGCTAAGAAATAGTGAGAGCTGTGAAAATTATTCTTTCCCAGGAAGGGTAG  AEEEEAEEEEEEEEEEEEEEEEAEEEEEEEEEEE6EEEEEEE/EEEAAAAA  NM:i:0  AM:i:42  X8:Z:202
    The end goal is not to include that alignment in the "mapped" output file. I can pass it through awk but that's like adding a speed bump to a racetrack, really awk-ward. Yeah, I said it.

    Leave a comment:

  • Brian Bushnell
    Oh, that flag might be a little misleading... it's the maximum distance for reads to be considered as 'proper pairs'. Beyond that the mapq is penalized, alternative sites are prioritized, and the 'proper pair' flag bit is not set. If you want to ban pairs from mapping at all beyond that distance you need to use either the 'killbadpairs' (which will mark one of the two as unmapped) or 'pairedonly' (which will mark them both as unmapped) flag.

    Leave a comment:

  • jmw86069
    Hi Brian Bushnell thanks again for the BBTools suite, it's really amazing! Lots of configurable settings, fantastic speed, logic, use of kmers, it's got it all. I'm replacing various bits of our pipelines with BBTools equivalents, and so far really loving the improvements. Also enjoyed reading through the various forum posts and responses, impressive support for the tool over the years. I digress.

    I cannot figure out how to get bbmap.sh to filter paired read alignments greater than 32000 distance, which is what I understood the pairlen=32000 argument is supposed to do. It just doesn't filter out alignments. For example, one alignment shows BAM TLEN=429215 for read1 (and -429215 for read2), and there is even the optional tag X8:Z:429215. All consistent. And even with or without read lengths in the distance calculation it should be well over the cutoff (I understood pairlen was distance between read ends pointing toward each other).

    Am I using the wrong argument? pairlen=32000 is default, even setting it to pairlen=40000 or pairlen=20000 it doesn't appear to do anything, or not what I expected anyway.

    Thanks in advance for the help!


    Leave a comment:

  • Brian Bushnell
    Originally posted by Thias View Post
    I think you found a bug ...
    When you look at the given line 625 in the file ./current/shared/TrimRead.java, it evaluates the CIGAR string and asserts that it ends with a match (M or =). This should be the case because the trimReadWithMatchFast() function is only invoked for 100% match or unmapped reads:

    char c=sl.cigar.charAt(sl.cigar.length()-1);
    assert(c=='M' || c=='=') : c+"; "+sl.cigar+"\n"+sl;
    But in your case, you have the hard-trimmed right part, so c is a H and the assertion fails, even though the function is rightfully invoked, since your remaining 22 bases are all matching.
    Yes, that is exactly correct; good analysis. I need to add support for H symbols, which I did not expect to encounter in that function.

    Leave a comment:

  • Thias
    I think you found a bug ...
    When you look at the given line 625 in the file ./current/shared/TrimRead.java, it evaluates the CIGAR string and asserts that it ends with a match (M or =). This should be the case because the trimReadWithMatchFast() function is only invoked for 100% match or unmapped reads:

    char c=sl.cigar.charAt(sl.cigar.length()-1);
    assert(c=='M' || c=='=') : c+"; "+sl.cigar+"\n"+sl;
    But in your case, you have the hard-trimmed right part, so c is a H and the assertion fails, even though the function is rightfully invoked, since your remaining 22 bases are all matching.

    Leave a comment:

  • CharlesARoy
    Hi Brian Bushnell

    I recently installed BBMap_38.98.tar.gz from SourceForge on a machine running Ubuntu 20.04.4 LTS.

    Per the usage guide, I checked my current version of Java with "java -Xmx90m -version" which returned the following:
    openjdk version "11.0.15" 2022-04-19
    OpenJDK Runtime Environment (build 11.0.15+10-Ubuntu-0ubuntu0.20.04.1)
    OpenJDK 64-Bit Server VM (build 11.0.15+10-Ubuntu-0ubuntu0.20.04.1, mixed mode, sharing)

    I tried running reformat on a sam file with a command like:
    reformat.sh in=in.sam out=out.sam forcetrimleft=10

    And got the following error:
    java -ea -Xms300m -cp /data/usr/croy/tools/bbmap/current/ jgi.ReformatReads in=in.sam out=out.sam forcetrimleft=10
    Executing jgi.ReformatReads [in=in.sam, out=out.sam, forcetrimleft=10]

    Input is being processed as unpaired
    Waiting on header to be read from a sam file.
    Exception in thread "main" java.lang.AssertionError: H; 22M123H
    FS10001085:123:BRL95608-3321:1:1101:7680:1000 2195 chr3 50608025 18 22M123H = 50607939 -98 AGTGGTTTTCACTGACAGCGTG F,:FFFFFFFF::FFF:F,F,F NM:i:0 MD:Z:22 MC:Z:93M21D52M AS:i:22 XS:i:0 SA:Z:chr3,50608053,-,17S128M,60,1;
    at shared.TrimRead.trimReadWithMatchFast(TrimRead.java:625)
    at shared.TrimRead.trimReadWithMatch(TrimRead.java:684)
    at shared.TrimRead.trimByAmount(TrimRead.java:279)
    at shared.TrimRead.trimToPosition(TrimRead.java:240)
    at jgi.ReformatReads.process(ReformatReads.java:919)
    at jgi.ReformatReads.main(ReformatReads.java:53)

    The sam file was aligned to hg38 using a recent version of BWA MEM (v0.7.17-r1188). The CIGAR string seems to follow the SAM specs as far as I can tell, so I'm not sure where things are going wrong.

    Please let me know if you need any more information.



    Leave a comment:

  • Omri.huji
    Is bbmap missing mappings of bbmapskimmer ?

    Hello Brian,

    I will be glad to get your help with the next issue.
    I mapped a large DB of reads(over 100K) against the human genome, and found different results between bbmap's and bbmapskimmer's mappings.

    I used bbmap to map reads against hg19, then I was interested finding all the mappings over the genome- so I used bbmapskimmer.
    Except this concept I do not know any main differences between bbmap and bbmapskimmer codes (am I right?).

    And still, bbmap mapped 98,425 reads and bbmapskimmer mapped 107,708. While 96,987 reads are shared.
    Means bbmap mapped 1,438 reads that wasn't mapped by bbmapskimmer, which did mapped another 10,721 that wasn't mapped by bbmap- some of them with perfect minid score of 1.0.
    Is there any possible reason for that?

    I attached SAM files, with the relevant mappings for reads with maximum minid score in each algorithem (best minid for 1,438 reads of bbmap is 0.96, and for bbmapskimmer is 1.0).

    - I used the next reference genome: hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz

    - All mappings were done with next parameters: minid=0.9 ambig=all
    Attached Files

    Leave a comment:

  • sghignone
    filter reads based on GC% content

    Hello Brian and everyone,
    I'm wondering whether it could be possible (and with which tool) to perform a read filter based on its GC% content.
    For example, working with a metagenome with fungal and bacterial DNA fragments, how can I retain only paired-end reads with a GC > 50% and isolate bacterial reads?
    Thanks for the attention

    Leave a comment:

  • ellalalalala
    Error running BBMap


    I already used BBMap successfully in the past to map RNA-seq data to a multi-FASTA transcriptome. Now I was trying to map to the human genome (downloaded from human genome browser of NCBI). I used the following command:

    bbmap.sh ref=$mapping_ref build="$BBmap_ref_ID" in=$BBduked_reads1 in2=$BBduked_reads2 \
    outu=unmapped_"$name1"_"$date".sam outm=mapped_"$name1"_"$date".sam \
    maxindel=200k mdtag=true sam=1.4 subfilter=3 pairedonly=true

    Unfortunately I´m running into an error:

    Retaining first best site only for ambiguous mappings.
    Writing reference.
    Executing dna.FastaToChromArrays2 [/scratch/hpc-prf-ptma2/ptma2001/bbmap/GRCh38_genome_16032022.fna, 6, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

    Set genScaffoldInfo=true
    Writing chunk 1
    Writing chunk 2
    Exception in thread "main" java.lang.NumberFormatException: For input string: "ӌ�jHN�H� �¬���l��,ΚK��/S��ͩm���T
    [email protected]�
    �m��$֧����@�H,���xI%_ Uz�?� S�B�2[*i�_����)i�NEր�ٜ�?Z�i£�� "
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at dna.Data.setGenome2(Data.java:1016)
    at dna.Data.setGenome(Data.java:769)
    at align2.BBMap.loadIndex(BBMap.java:316)
    at align2.BBMap.main(BBMap.java:32)

    Any idea what causes the problem and how I could fix it?

    Thanks a lot in advance and have a great day


    Leave a comment:

  • Ahmer
    BBmap Error

    I am getting this error how to solve it.

    java -ea -Xmx4393m -Xms4393m -cp /home/szweda/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 in=PY53_contigs.fasta out=PY53_bbmapped.sam bamscript=bs.sh
    Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, in=PY53_contigs.fasta, out=PY53_bbmapped.sam, bamscript=bs.sh]
    Version 38.95

    Retaining first best site only for ambiguous mappings.
    Found samtools 1.10
    Exception in thread "main" java.lang.RuntimeException: Can't find file ref/genome/1/summary.txt
    at fileIO.ReadWrite.getRawInputStream(ReadWrite.java:933)
    at fileIO.ReadWrite.getInputStream(ReadWrite.java:898)
    at fileIO.TextFile.open(TextFile.java:280)
    at fileIO.TextFile.<init>(TextFile.java:123)
    at dna.Data.setGenome2(Data.java:823)
    at dna.Data.setGenome(Data.java:769)
    at align2.BBMap.loadIndex(BBMap.java:316)
    at align2.BBMap.main(BBMap.java:32)

    Leave a comment:

  • gtwa-bio
    Entropy filtering

    I am working with shotgun metagenomic sequencing data from gut microbiome samples. We're planning to do taxonomic abundance estimates. For data preprocessing we are going to trim and filter sequences for quality and adapter content with bbduk, as well as remove host sequences with bbsplit. We are also considering an additional entropy filtering step with bbduk after our other preprocessing steps. I was hoping I could ask you for some more information about how this entropy filtering process works, and if you might have a recommendation in our case.
    • In some of our samples, we have an overrepresentation of G homopolymers. We’re confident that these are technical artifacts from the NextSeq sequencing protocol. I know we can filter these out with an entropy threshold of 0.1. However, I’ve seen in some metagenomic studies they filter out repetitive sequences that are not sequencing artifacts. Would you recommend raising this entropy threshold in our case, and if so to what new value?
    • On the more technical side of how bbduk implements entropy filtering…
      • How does the sliding window traverse a read? Is it one base at a time, or does it move by the whole window size?
      • If a read has a region of low complexity sequences at the beginning/end, are only these sections filtered or is the entire read removed?
      • How might a read with an internal region of low complexity be treated?

    Leave a comment:

Latest Articles


  • seqadmin
    A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
    by seqadmin

    ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

    01-24-2023, 01:19 PM
  • seqadmin
    Introduction to Single-Cell Sequencing
    by seqadmin
    Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

    The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
    01-09-2023, 03:10 PM