Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    luc,

    That's unfortunate. If you download and unzip a 64-bit version of java 7, then path directly to the java executable rather than relying on the default java of your environment, it should work. Otherwise, the limit is something like 1600MB (specified by the flag -Xmx1600m) if I recall correctly, which is plenty for small references like microbes and fungi.

    -Brian

    Comment


    • #17
      Has anybody experience with running bbmap on clusters?

      Hello everybody,

      I would like to know, if anyone has (good) experience with running bbmap alignments on clusters, which require to submit jobs via a scheduler?

      Because I can successfully start/finish an alignment on the university's cluster when started directly from the terminal (which is prohibited, but who cares on Sundays at 3am?), but the exactly same command written to Portable Batch System (PBS) job script and submitted via qsub to Torque/Maui scheduler fails:

      Code:
      Exception in thread "Thread-5" java.lang.NullPointerException
           at align2.Block.<init>(Block.java:33)
           at align2.Block.read(Block.java:156)
           at align2.IndexMaker4$BlockMaker.makeArrays(IndexMaker4.java:139)
           at align2.IndexMaker4$BlockMaker.run(IndexMaker4.java:127)
      Generated Index:    3.816 seconds.
      Exception in thread "main" java.lang.NullPointerException
           at align2.BBIndex.analyzeIndex(BBIndex.java:116)
           at align2.BBMap.loadIndex(BBMap.java:389)
           at align2.BBMap.main(BBMap.java:30)
      I guess bbmap recognizes that there are more cores available on the system, than I am allowing for the job via the job script and tries to use them nevertheless?

      Since I would like to extent my alignment times beyond weekend nights , I would be happy for suggestions.

      Comment


      • #18
        Originally posted by Thias View Post
        I guess bbmap recognizes that there are more cores available on the system, than I am allowing for the job via the job script and tries to use them nevertheless?
        Are you explicitly indicating number of threads to use for the bbmap command by using the "threads=x" option (BTW: this option is set to "auto" by default so unless you specify a number bbmap will try to use all cores)?

        Comment


        • #19
          Yes, I set the threads=8 option and also a memory limit of 32g for bbmap.

          However I realised that I initially wrote the job script in a way that the scheduler could assign random distributed cores to the job. When I limited that only to cores of one particular node, then everything runs smoothly. So no problem with bbmap, only with me using the batch system correctly .
          Last edited by Thias; 03-25-2014, 07:12 AM.

          Comment


          • #20
            problems with bbmap switches

            Hi Brian,
            Very decent mapper that you wrote, and supergreat that it is finally
            available .

            I was playing around with bbmap, sooo many cool features and an impressive
            speed! But I could not figure out a couple of things:
            1) How to point to directories with path=
            I would like to be able to use a set of indices that I created previously
            and that are stored in a specific 'databases' path mounted on all nodes of
            our cluster. But I did not get this to work setting path= during the
            mapping process because it changes where bbmap searches for the reads. My
            second trial was to call bbmap in this folder of the references and set
            path to the folder of the reads, but then I always got an error that the
            read file was not found. The only thing that worked for me is calling bbmap
            from a folder which also includes the /ref folder, but this means copying
            both reads and refs accross the filesystem wherever I need them. We were mostly using bowtie2 up to now and in bowtie2 I can point to absolute paths for references, reads and outputs. Would be cool to be able to handle files in
            bbmap similarly
            2) using outm
            I am mapping shotgun metagenomic illumina paired end reads to references
            that are gene databases. I was expecting to get different output for out=
            and outm= but the files produced are identical. I would expect to see some
            read pairs where only one of the reads maps to the gene database and the
            other not, and as far as I understood out= gives me the mapping pairs and
            outm= gives me the mapping pairs and the single mapping reads with their
            pairs.
            3) how can I get sorted unmapped pairs written to a file?
            While outm1=reads.f.fq and outm2=reads.r.fq gives me the mapped pairs, outu always writes everything to a single file (no outu2= possible)
            4) I was trying to limit the insert sizes allowed in the paired end mapping with e.g. pairlen=1000, but the output still reported exactly the same mappings with insert sizes way higher, often in the multi kb range for PE-reads. This also greatly affects the average insert size reported... What am I doing wrong? It would also be very cool to get the standard deviation put out, as well as the median. One can calculate these things from the very useful histogram files that inserthistogram=file outputs, but that is not as convenient.

            Keep up the good work
            Harald

            Comment


            • #21
              Originally posted by HGV View Post
              Hi Brian,
              Very decent mapper that you wrote, and supergreat that it is finally
              available.

              I was playing around with bbmap, sooo many cool features and an impressive
              speed!
              Thanks! Glad you like it.

              But I could not figure out a couple of things:
              1) How to point to directories with path=
              I would like to be able to use a set of indices that I created previously
              and that are stored in a specific 'databases' path mounted on all nodes of
              our cluster. But I did not get this to work setting path= during the
              mapping process because it changes where bbmap searches for the reads. My
              second trial was to call bbmap in this folder of the references and set
              path to the folder of the reads, but then I always got an error that the
              read file was not found. The only thing that worked for me is calling bbmap
              from a folder which also includes the /ref folder, but this means copying
              both reads and refs accross the filesystem wherever I need them. We were mostly using bowtie2 up to now and in bowtie2 I can point to absolute paths for references, reads and outputs. Would be cool to be able to handle files in
              bbmap similarly
              BBMap can use absolute paths for all of those, but the syntax is a bit different. Let's say you have 2 builds of the human genome, HG18 and HG19, at /fastas/hg18.fa and /fastas/hg19.fa, reads at /data/reads.fq, and you want to write output to /output/. There are 3 ways you could handle this.

              Version 1: Don't write an index to disk. This usually makes the startup slower, but is more I/O efficient. It's the simplest and best if you are only mapping to a reference once (like when evaluating multiple assemblies).
              bbmap.sh in=/data/reads.fq ref=/fastas/hg18.fa out=/output/mapped18.sam nodisk
              bbmap.sh in=/data/reads.fq ref=/fastas/hg19.fa out=/output/mapped19.sam nodisk


              Version 2: Write each index to its own directory.
              bbmap.sh ref=/fastas/hg18.fa path=/human/hg18/
              bbmap.sh ref=/fastas/hg19.fa path=/human/hg19/

              bbmap.sh in=/data/reads.fq path=/human/hg18/ out=/output/mapped18.sam
              bbmap.sh in=/data/reads.fq path=/human/hg19/ out=/output/mapped19.sam


              Version 3: Write indices to the same directory, but specify a build number.
              bbmap.sh ref=/fastas/hg18.fa path=/human/ build=18
              bbmap.sh ref=/fastas/hg19.fa path=/human/ build=19

              bbmap.sh in=/data/reads.fq path=/human/ build=18 out=/output/mapped18.sam
              bbmap.sh in=/data/reads.fq path=/human/ build=19 out=/output/mapped19.sam


              If you don't specify a build number, the default is always 1.

              2) using outm
              I am mapping shotgun metagenomic illumina paired end reads to references
              that are gene databases. I was expecting to get different output for out=
              and outm= but the files produced are identical. I would expect to see some
              read pairs where only one of the reads maps to the gene database and the
              other not, and as far as I understood out= gives me the mapping pairs and
              outm= gives me the mapping pairs and the single mapping reads with their
              pairs.
              For paired reads, outm captures all reads that are mapped OR have a mate that is mapped. You can change this behavior with the flag "po" which stands for "pairedonly". The default is "po=f". If you set "po=t" then reads will be unmapped unless they are paired. So, if only one read maps, or if they both map but are not in a valid pairing configuration, they will go to "outu" instead of "outm". "out" will get both of them no matter what.
              3) how can I get sorted unmapped pairs written to a file?
              While outm1=reads.f.fq and outm2=reads.r.fq gives me the mapped pairs, outu always writes everything to a single file (no outu2= possible)
              Are you sure? It works for me... though it is not mentioned in the documentation. I'll clarify that.
              bbmap.sh in=reads.fq outu1=r1.fq outu2=r2.fq
              produces 2 files, r1.fq and r2.fq
              You can only specify output streams 1 and 2 for fasta or fastq output, though, not sam. Also - by default, whenever you output paired reads to a single file, they will come out interleaved (read 1, read 2, read 1, read 2, etc). You can split them back into 2 files with reformat:
              reformat.sh in=reads.fq out1=read1.fq out2=read2.fq
              4) I was trying to limit the insert sizes allowed in the paired end mapping with e.g. pairlen=1000, but the output still reported exactly the same mappings with insert sizes way higher, often in the multi kb range for PE-reads. This also greatly affects the average insert size reported... What am I doing wrong? It would also be very cool to get the standard deviation put out, as well as the median. One can calculate these things from the very useful histogram files that inserthistogram=file outputs, but that is not as convenient.
              Actually, I only track the average insert size, not the median, because that's the easiest, though I admit the median and SD would be more useful. I'll add that, since I'm tracking the information anyway (if you specify inserthistogram).

              As for what you noticed with the pairlen flag, I had not noticed that - sounds like a possible bug. I will look into it.

              Keep up the good work
              Harald
              I will. Thanks for the feedback!

              Comment


              • #22
                I have uploaded a new version of BBTools, v32.06. Some improvements:


                Shellscripts:

                Much better memory detection; should be able to correctly detect free physical and virtual memory in Linux so you won't have to set the -Xmx flag.


                All programs:


                Added append flag; allows you to append to existing output files rather than overwriting.


                BBMap (bbmap.sh):

                Now reports insert size median and standard deviation (as long as you have the "ihist" flag set).

                Added 'qahist' flag, which outputs the observed quality of bases for each claimed quality value. Works for both indel and substitution error models.

                Added 'bhist' flag, which outputs the frequency of ACGTN by base position. Reformat and BBDuk.sh also support 'bhist' and 'qhist' flags.

                'pairlen' flag now works correctly (for capping the max allowed insert size).

                IUPAC reference codes are now converted to N prior to mapping.

                Statistics are now reported both by number of reads and number of bases.

                Added BBWrap (bbwrap.sh), a wrapper for BBMap that allows you to map multiple sets of reads while only loading the index once (important for huge indexes and small read sets). The mapped reads can be written all to the same output file, or to one output file per input file.

                Slightly increased accuracy.

                PacBio reads can now be mapped in fastq format.


                BBMerge (bbmerge.sh):


                Now supports dual output files for unmerged reads, 'outu1' and 'outu2'.

                Improved accuracy.


                BBDuk (bbduk.sh):

                Much faster when used without any reference (e.g. for quality trimming).


                RandomReads:


                Much better simulated PacBio reads.


                Thanks to everyone at seqanswers who has given me feedback!

                Comment


                • #23
                  Originally posted by Brian Bushnell View Post

                  Shellscripts:

                  Much better memory detection; should be able to correctly detect free physical and virtual memory in Linux so you won't have to set the -Xmx flag.
                  Brian: I am not nitpicking but many of us run BBMap on clusters via queuing systems so setting that flag (and the threads=n) is critical (to prevent BBMap from overstepping the queue resource limits).

                  Thanks!

                  Comment


                  • #24
                    Hi Brian,

                    I can't seem to find any documentation on how to implement any specific error models for the read simulation of Illumina or PacBio? Perhaps I am misunderstanding something though.

                    Originally posted by Brian Bushnell View Post
                    I have uploaded a new version of

                    RandomReads:


                    Much better simulated PacBio reads.
                    !

                    Comment


                    • #25
                      Originally posted by GenoMax View Post
                      Brian: I am not nitpicking but many of us run BBMap on clusters via queuing systems so setting that flag (and the threads=n) is critical (to prevent BBMap from overstepping the queue resource limits).

                      Thanks!
                      For clusters, I recommend setting the -Xmx flag to indicate the amount of memory you requested and 'threads' (or 't') to the number of slots you have requested. We (JGI) run on clusters too, but most of the nodes are exclusive-only, so all hardware resources are available to the user.

                      The new memory-detection system should work on Linux/UGE clusters (it works here). The shellscript calculates the system's free virtual memory, free physical memory, and 'ulimit -v'. Then it will invoke Java with the minimum of those 3 values, and thus, hopefully, will never fail initialization, and never use more RAM than is allowed. Although here, at least, it's impossible to use more RAM than allowed because our cluster will detect that and kill the process.

                      Our clusters use UGE (formerly known as SGE) and set 'ulimit' according to the qsub memory request. I don't know if that is universal or just a local policy, but at least here, users should be able to use the shellscripts without specifying -Xmx, and they should work. I'd be happy if someone elsewhere tried it without setting -Xmx and reported what happens.

                      Threads is a little more difficult. For now, unless you have exclusive access to a node, I recommend that you specify the number of threads allowed (e.g. t=4 if you reserved 4 slots). There is an environment variable "NSLOTS" that apparently gets set by UGE, which I will parse in my next release, and default to using the minimum of the system's reported logical cores and NSLOTS. But it appears to not always be correct, and furthermore, hyperthreading makes the matter more confusing (BBMap does not benefit much from hyperthreading, but many of the other BBTools do, like BBNorm).

                      Also - there are basically 4 types of programs in BBTools, resource-wise:

                      1) Super-heavyweight:
                      Requests all available threads and memory by default, but can be capped with the '-Xmx' and 't' flags. That's because all are both multithreaded and require an amount of memory dependent on the input, which the shellscript can't calculate. This includes bbmap.sh, bbsplit.sh, bbwrap.sh, mapPacBio.sh, mapPacBio8k.sh, bbnorm.sh, ecc.sh, khist.sh, bbduk.sh, countkmersexact.sh, and dedupe.sh.

                      2) Heavyweight:
                      Requests all available RAM but only one primary thread. Pipeline-multithreading (input thread, processing thread, output thread) cannot be disabled, so the 't' flag has no effect unless you are doing multithreaded compression using pigz. Includes pileup.sh, randomreads.sh, bbcountunique.sh, and bbmask.sh.

                      3) Midweight:
                      Requests all available threads by default, but can be capped with the 't' flag; needs little memory, so the -Xmx flag is hard-coded at an appropriate value that should work for all inputs (ranging from 100m to 400m) and not affected by available memory, though you can still override it. This includes bbmerge.sh.

                      4) Lightweight:
                      Needs minimal memory and only one primary thread, so again the -Xmx flag is hard-coded to a low value (at most 200m) rather than dependent on autodetection. Again, 't' flag has no effect unless you are doing multithreaded compression using pigz. This includes reformat.sh, stats.sh, statswrapper.sh, readlength.sh, bbest.sh, bbsplitpairs.sh, gradesam.sh, translate6frames.sh, and samtoroc.sh.

                      Note that if you have pigz installed, you can accelerate all BBTools performance on gzipped input or output using the "unpigz" and "pigz" flags (which can be set to 't' or 'f'). pigz allows multithreaded .gz compression and decompression, and is both faster and more efficient (even with only a single thread) than Java or gzip. So with pigz installed, even a mostly singlethreaded program like reformat.sh can (and by default, will) use all all available threads if you write gzipped output:

                      reformat.sh in=read1.fq.gz in2=read2.fq.gz out=interleaved.fasta.gz zl=8

                      ..will compress the output to gzip compression level 8. If pigz is installed, it will use pigz for both compression and decompression instead of Java, resulting in decompression using around 1.5 cores per input file and compression using all allowed cores. You can prevent this with the 't=1' flag (to cap compression at 1 thread) or 'unpigz=f' and 'pigz=f' (to disable decompression/compression in pigz processes).

                      Due to some bugs in UGE (which can kill processes that fork in certain circumstances), pigz is by default set to false in high-memory processes and true in low-memory processes.

                      If you have not tried pigz, I highly recommend it! It's great and will become increasingly important as core counts increase. It's a drop-in replacement for gzip - same command line, 100% inter-compatible, but multithreaded and more efficient per thread. The only downside is increased memory usage, but it's not bad - around 12 MB per thread. Compression (for genomic data) is within ~0.1% of gzip at the same compression level. As I mentioned, decompression uses at most 1.5 cores in my tests, while compression can use all cores.
                      Last edited by Brian Bushnell; 04-17-2014, 08:46 AM.

                      Comment


                      • #26
                        Originally posted by luc View Post
                        Hi Brian,

                        I can't seem to find any documentation on how to implement any specific error models for the read simulation of Illumina or PacBio? Perhaps I am misunderstanding something though.
                        Luc,

                        Sorry about that; randomreads is currently not documented. I will try to remedy that tomorrow, and let you know when it's done.

                        Comment


                        • #27
                          Thanks a lot Brian,

                          BBMap does have a lot of different tools.

                          Comment


                          • #28
                            Originally posted by Brian Bushnell View Post
                            Our clusters use UGE (formerly known as SGE) and set 'ulimit' according to the qsub memory request. I don't know if that is universal or just a local policy, but at least here, users should be able to use the shellscripts without specifying -Xmx, and they should work. I'd be happy if someone elsewhere tried it without setting -Xmx and reported what happens.
                            With LSF, not setting the -Xmx flag is leading to default allocation of 4G (at least on our cluster, things may be set up differently at other locations) which results in Java running out of heap space. I am also not using exclusive nodes option so far and it seems to be working fine.

                            BTW: The latest code for BBMap (32.06) seems to have shell script files in DOS format (wasn't the case before on previous versions).

                            Comment


                            • #29
                              Originally posted by GenoMax View Post
                              With LSF, not setting the -Xmx flag is leading to default allocation of 4G (at least on our cluster, things may be set up differently at other locations) which results in Java running out of heap space. I am also not using exclusive nodes option so far and it seems to be working fine.
                              Hmm, thanks for testing. Sounds like LSF is setting the ulimit too low, probably for 1 slot no matter how many slots you reserve, I'm guessing.

                              BTW: The latest code for BBMap (32.06) seems to have shell script files in DOS format (wasn't the case before on previous versions).
                              Thanks for noticing that; I've uploaded a new version 32.07 which fixes it. They work fine for me but some versions of bash can't process DOS encoding.

                              By the way - I forgot to mention it, but I added the phiX reference and truseq adapters to the /resources/ directory, for use with BBDuk.

                              Comment


                              • #30
                                Originally posted by luc View Post
                                Hi Brian,

                                I can't seem to find any documentation on how to implement any specific error models for the read simulation of Illumina or PacBio? Perhaps I am misunderstanding something though.
                                Luc,

                                I released a new version (32.14) which now has RandomReads documented (run randomreads.sh with no arguments).

                                By default it will use an Illumina error model, which assigns qualities in a pattern that looks roughly like the average quality histogram of Illumina reads (peaking at about position 20, and sloping down toward the head and tail). Then some bases will by randomly changed based on the quality score. To generate Illumina-like reads from e.coli:
                                randomreads.sh ref=ecoli.fa reads=100000 length=150 maxq=40 midq=30 minq=10 out=synth.fq
                                It can also generate paired reads if desired.

                                To generate PacBio-like reads:
                                randomreads.sh ref=ecoli.fa reads=100000 minlength=200 maxlength=4000 pacbio pbminrate=0.13 pbmaxrate=0.17 out=synth.fq

                                That will add substitutions and single-base-pair indels with per-base probability ranging from 0.13 to 0.17.

                                There are also a lot of other options for adding specific numbers or lengths of insertions, deletions, substitutions, and Ns, in the help menu.

                                Comment

                                Working...
                                X