Header Leaderboard Ad


BBMap (aligner for DNA/RNAseq) is now open-source and available for download.



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Thank you for the response Brian Bushnell !
    We are using the "pairedonly=t" flag, however the reads are still returned with PROPER_PAIR in the SAM file (sam flags 83 and 163).
    It appears the MAPQ is penalized, probably due to the repetitive sequence in read2, however the reads are still shown as properly paired.

    Maybe you'll spot something in the arguments below?
    I've tried all combinations and all SAM alignments have the tag PROPERLY_PAIRED, and are included in the mapped output.
    You can see I also added killbadpairs=t and pairedonly=t, but maybe something else has gone wrong.

    bbmap.sh \
       in=testA_1.subset.fq.gz in2=testA_2.subset.fq.gz \
       ref=/reference_genomes/mm39/mm39canonical.fa \
       path=/reference_genomes/mm39/bbmap_mm39canonical/ \
       out=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bam \
       outu=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.unmapped.bam \
       mappedonly=t \
       trimreaddescriptions=t \
       t=3 \
       maxindel=20 \
       strictmaxindel=t \
       inserttag=t \
       matelen=33000 \
       pairlen=35000 \
       intronlen=999999999 \
       ambig=random \
       minid=0.76 \
       killbadpairs=t \
       pairedonly=t \
       covstats=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bbmap_covstats.txt \
       covhist=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bbmap_covhist.txt \
       ihist=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bbmap_inserthist.txt \
       statsfile=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bbmap_alignmentReport.txt \
       2>&1 | tee -a output_testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped_bbmap.txt
    The SAM output is shown below, and all variations of options above appear to have no effect on the SAM flags and tags:
    QNAME                                                                FLAG  RNAME  POS       MAPQ  CIGAR                  RNEXT  PNEXT     TLEN     SEQ                                                  QUAL                                                 FLAGS
    NS500489:195:HLKT3BGXX:1:12109:18917:8143                            83    chr5   11805066  2     3X4=1X2=1X4=2X2=3X29=  =      11375902  -429215  ACATGCCAAAATAACAACCCACTTTTAAGTGTATAATTCATTACACAATGC  //AE//E//E/E/E///EE///EEEEEEEEEEEEEEEEEEEEEEEEAAAAA  XT:A:R  NM:i:10  AM:i:2    X8:Z:429215
    NS500489:195:HLKT3BGXX:1:12109:18917:8143                            163   chr5   11375902  3     51=                    =      11805066  429215   GACAGAGACAGAGAGAGGAACATAGACAGAGACAGAGAGAGGAACATAGAC  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/E/E/E/EE/E  XT:A:R  NM:i:0   AM:i:3    X8:Z:429215
    NS500489:195:HLKT3BGXX:1:11102:2182:13318                            99    chr17  20345318  42    51=                    =      20345469  202      ACTGGTGGCCTTATAACTTGTGTAATAAGAAATGTTTGAGTTTAACTATAA  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE  NM:i:0  AM:i:42  X8:Z:202
    NS500489:195:HLKT3BGXX:1:11102:2182:13318                            147   chr17  20345469  42    51=                    =      20345318  -202     AGCTAAGAAATAGTGAGAGCTGTGAAAATTATTCTTTCCCAGGAAGGGTAG  AEEEEAEEEEEEEEEEEEEEEEAEEEEEEEEEEE6EEEEEEE/EEEAAAAA  NM:i:0  AM:i:42  X8:Z:202
    The end goal is not to include that alignment in the "mapped" output file. I can pass it through awk but that's like adding a speed bump to a racetrack, really awk-ward. Yeah, I said it.


    • Is there anything else I need to do to test some Java changes to debug where the issue is happening?
      For example, I would like to turn on verboseS=t and have it spit out debug info. I think the pairlen (MAX_PAIR_DIST) is just not being applied from what I can tell from verbose=t output.


      • Brian Bushnell Thanks for providing this very useful toolset.

        I have a few questions wrt the use of BBSplit, and I guess it's OK to put them in a single post...

        [Edited: well, there has been no reply for a while and I figured out some of the answers myself, and I add them below just in case someone else run into the same problems...]

        1. Is there any recommendation in terms of computation resource allocation and speeding up a run? For human+mouse genome (a total of 5.6GB .fasta files) and decompressed fastq files at about 50GB each (paired-end), I found that nearly 200GB memory was needed when using 1 thread, and it can take several days or even up to a week to run. Is this normal/expected?

        [Edited: in one of my later applications where I only needed to map one of the read-pairs (decompressed at around 50G), I managed to increase the number of threads to 16, with 10G memory per thread, setting java heap size around 50G, and BBSplit finished running overnight.]

        2. From the user guide I read that the `maxindel` parameter should be set to a large number for RNA-seq (e.g. 200k for human according to the guide). Another example in the guide for BBMap says:
        To map vertebrate RNA-seq reads to a genome:
        bbmap.sh in=reads.fq out=mapped.sam maxindel=200k ambig=random intronlen=20 xstag=us
        I could not find additional information on properly different parameter settings for e.g. whole exome sequencing (WES) vs RNA-seq data. Is the above a good default for human+mouse RNA-seq while running BBSplit? (except for that intronlen is probably irrelevant for BBSplit, and xstag may be application-specific)

        3. I have this problem on re-using genome index as follows:

        In this post, someone asked:
        After running bbsplit once using the syntax: ref=ref1.fa,ref2.fa, is it possible to re-use that index on subsequent runs using the path= parameter?
        And the reply was:
        Yes, it is. Also, with BBSplit, I think it will try to regenerate the index every time as long as "ref=" is specified, even if it already exists, so only do that once.
        However, I have some trouble figuring out how exactly to reuse pre-built genome index when running BBSplit. Below is a toy example.

        Let's say I first build an index for two reference genomes, x.fa and y.fa as follows:
        bbsplit.sh build=1 ref=x.fa,y.fa path=./test
        Next I would like to reuse the pre-built index for splitting many fastq files; for one pair of files, I tried, e.g.:
        bbsplit.sh in1=r1.fq in2=r2.fq basename=o%_#.fq build=1 path=./test
        However, when I run it, it seems that the index was re-built; contents under ./test/ref/genome/1 and ./test/ref/index/1 were removed and then re-generated from scratch. Indeed, there is this message below in the log.
        NOTE: Deleting contents of ./test/ref/genome/1 because reference is specified and overwrite=true
        NOTE: Deleting contents of ./test/ref/index/1 because reference is specified and overwrite=true
        Writing reference.
        Executing dna.FastaToChromArrays2 [ref/genome/1/merged_ref_1939486580590361.fa.gz, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]
        Set genScaffoldInfo=true
        Writing chunk 1
        Writing chunk 2
        Even if I do not specify `path=./test` in both commands and just leave it to the default (i.e. the current dir), the above still happens. I also tried to specify `overwrite=f` or `ow=f`, but that did not seem to have an effect either.

        Besides, I also have a related question, which is whether I can pre-build indices separately for two genomes x.fa and y.fa, saved in different paths or with different build numbers, and then later reuse them for aligning with BBSplit (potentially with something like `build=1,2` or `path=<first_path>,<second_path>`).

        [Edited: I figured this out. I first build the index with `bbsplit.sh ref=x.fa,y.fa`, which generated the index in a folder named `ref`, then in the subsequent BBSplit run for each sample organized in its own sample-specific dir, I first link the `ref` folder to the sample dir, then from the sample dir run with something like `bbsplit.sh ref=x.fa,y.fa in1=r1.fq in2=r2.fq basename=o%_#.fq`, now BBSplit can find the index without any issue.]

        I am using BBMap version 39.01, the latest version to date.

        I appreciate your kind help in advance.
        Last edited by KenC; 11-14-2022, 12:43 AM.


        • Hi Brian Bushnell, thank you very much for developing BBMap. It is such a useful tool!

          I am trying to map sequencing reads to a metagenome I had previously assembled and I am getting a very strange output. After running the alignment, I also ran pileup.sh to obtain coverage stats and I got the following output:

          pileup.sh in=mapped.sam out=stats.txt
          Reads: 18093134
          Mapped reads: 3652730
          Mapped bases: 331276393
          Ref scaffolds: 134515
          Ref bases: 43282856
          Percent mapped: 20.188
          Percent proper pairs: 12.341
          Average coverage: 7.654
          Average coverage with deletions: 7.689
          Standard deviation: 9.676
          Percent scaffolds with any coverage: 99.51
          Percent of reference bases covered: 99.34
          If I interpreted this summary correctly, while most of my contigs/scaffolds are being covered, only ~20% of my input reads are being mapped to said contigs. However, I do no understand what happened with the other ~80% since those were the same reads I used to assemble the contigs. Does this mean that my assembler (metaSPAdes) was not able to assemble that ~80% into any kind of contigs? And if that is the case, do you have any idea why this happens? Sorry if this last question is beyond the scope of this forum.

          Thanks in advance with any help you might be able to provide,



          • Hi Brian,

            I ran filterbyname.sh to extract some FastQ reads from a bigger FastQ file and I noticed that on the quality line of the output, the # (hashtag symbol) gets replaced by an ! (exclamation mark symbol). I checked the documentation and the forums about this behaviour but couldn't find anything.

            Is this expected? If so, why?

            Mine were latest FastQ files produced by BaseSpace from an Illumina MiSeq instrument.

            Thanks in advance.