Header Leaderboard Ad


BBMap (aligner for DNA/RNAseq) is now open-source and available for download.



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • BBmap on Ion data

    I tried to run BBmap to call variants on four BAM files (2 parents, 2 children) generated from exome sequencing on an Ion Proton machine (TMAP software).

    The bbmap command was:
    /opt/bbmap/callvariants.sh in=IonXpress_009_rawlib.bam,IonXpress_010_rawlib.bam,IonXpress_011_rawlib.bam,IonXpress_012_rawlib.bam ref=./hg19/hg19.fasta out=vars.vcf multisample ploidy=2 prefilter 1> err.txt 2> out.txt &
    I got an error as following (I'm including also the last lines of output):

    Finished second pass.
    Writing output.
    Merging [(IonXpress_009_rawlib, individual_IonXpress_009_rawlib.vcf.gz), (IonXpress_010_rawlib, individual_IonXpress_010_rawlib.vcf.gz), (IonXpress_011_rawlib, individual_IonXpress_011_rawlib.vcf.gz), (IonXpress_012_rawlib, individual_IonXpress_012_rawlib.vcf.gz)]
    Exception in thread "main" java.lang.NumberFormatException: For input string: "162,590"
    	at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
    	at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
    	at java.lang.Float.parseFloat(Float.java:451)
    	at var2.MergeSamples.processHeader(MergeSamples.java:233)
    	at var2.MergeSamples.processRow(MergeSamples.java:177)
    	at var2.MergeSamples.mergeFiles(MergeSamples.java:151)
    	at var2.MergeSamples.mergeSamples(MergeSamples.java:117)
    	at var2.CallVariants2.process(CallVariants2.java:380)
    	at var2.CallVariants2.main(CallVariants2.java:52)
    	at var2.CallVariants.main(CallVariants.java:48)
    I unzipped the vcf files and used grep to find "162,590". I got a match at line 9 of the vcf for one individual:
    The other individuals also had similar formatting:

    This is the full header for barcode 009:
    ##INFO=<ID=SN,Number=1,Type=Integer,Description="Scaffold Number">
    ##INFO=<ID=R1P,Number=1,Type=Integer,Description="Read1 Plus Count">
    ##INFO=<ID=R1M,Number=1,Type=Integer,Description="Read1 Minus Count">
    ##INFO=<ID=R2P,Number=1,Type=Integer,Description="Read2 Plus Count">
    ##INFO=<ID=R2M,Number=1,Type=Integer,Description="Read2 Minus Count">
    ##INFO=<ID=PPC,Number=1,Type=Integer,Description="Paired Count">
    ##INFO=<ID=LS,Number=1,Type=Integer,Description="Length Sum">
    ##INFO=<ID=MQS,Number=1,Type=Integer,Description="MAPQ Sum">
    ##INFO=<ID=MQM,Number=1,Type=Integer,Description="MAPQ Max">
    ##INFO=<ID=BQS,Number=1,Type=Integer,Description="Base Quality Sum">
    ##INFO=<ID=BQM,Number=1,Type=Integer,Description="Base Quality Max">
    ##INFO=<ID=EDS,Number=1,Type=Integer,Description="End Distance Sum">
    ##INFO=<ID=EDM,Number=1,Type=Integer,Description="End Distance Max">
    ##INFO=<ID=IDS,Number=1,Type=Integer,Description="Identity Sum">
    ##INFO=<ID=IDM,Number=1,Type=Integer,Description="Identity Max">
    ##INFO=<ID=MCOV,Number=1,Type=Integer,Description="Minus Coverage">
    ##INFO=<ID=CED,Number=1,Type=Integer,Description="Contig End Distance">
    ##INFO=<ID=HMP,Number=1,Type=Integer,Description="Homopolymer Count">
    ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
    ##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Fraction">
    ##INFO=<ID=RAF,Number=1,Type=Float,Description="Revised Allele Fraction">
    ##INFO=<ID=DP4,Number=4,Type=Integer,Description="Ref+, Ref-, Alt+, Alt-">
    ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
    ##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Allele Depth">
    ##FORMAT=<ID=AF,Number=1,Type=Float,Description="Allele Fraction">
    ##FORMAT=<ID=PF,Number=1,Type=String,Description="Pass Filter">
    Anything I should have done differently to not cause the error?
    Thank you!


    • Oh, that's really strange. It's due to your locale settings, I guess. In the US, decimals are printed like "##readLengthAvg=162.590" rather than "##readLengthAvg=162,590". For some reason Java is using commas as the delimiter when printing the floating point numbers, but not when trying to read floating point numbers, which is extremely irritating.

      This will take a little research to fix (other than you changing your computer's locale to US or something like that); I'll get back to you.


      • @r.rosati: See if this helps. https://docs.oracle.com/cd/E23824_01...033/glmha.html


        • Originally posted by GenoMax View Post
          Thank you... Indeed I ran it on a ssh session from a computer that had pt_BR as LC_NUMERIC.

          I've read Zhou's comment on indels and I thought I'd first run the callvariants script on a test exome from CEPH 1347-02 DNA.
          Should I realign the reads (I'm reading it's advised for non-BBmap aligned reads?) i.e.
          callvariants.sh in=IonXpress_014_rawlib.bam ref=./hg19/hg19.fasta out=CEPH_BBmap_variants_TMAP_BAM_realigned.vcf ploidy=2 realign threads=16 1> err.txt 2> out.txt &


          • Yes, I do recommend "realign" for non-BBMap-aligned reads.


            • BBSplit / BBMap error

              I'm using BBSplit to remove mouse reads in human samples. My command is the following :
              bbsplit.sh ambiguous=best ambiguous2=toss path=$8 maxindel=900000 refstats=stat_alignment_$samplename.txt threads=4 qtrim=f untrim=f in1=$8/trim_R1_$samplename.fq in2=$8/trim_R2_$samplename.fq ref=$2/$1.fa,$2/Grcm38.fa basename=decontamin_$samplename.%_#.fq outu1=unmapped1_$samplename.fq outu2=unmapped2_$samplename.fq

              Which give me the following message where the parameters used are not what I specified :

              java -Djava.library.path=/home/audrey/bbmap/jni/ -ea -Xmx48113m -cp /home/audrey/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 ambiguous=best ambiguous2=toss path=. maxindel=900000 refstats=stat_alignment_L6.txt threads=4 qtrim=f untrim=f in1=./trim_R1_L6.fq in2=./trim_R2_L6.fq ref=/home/audrey/reference_genomes/GATK/hg38.fa,/home/audrey/reference_genomes/GATK/Grcm38.fa basename=decontamin_L6.%_#.fq outu1=unmapped1_L6.fq outu2=unmapped2_L6.fq
              Executing align2.BBSplitter [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, ambiguous=best, ambiguous2=toss, path=., maxindel=900000, refstats=stat_alignment_L6.txt, threads=4, qtrim=f, untrim=f, in1=./trim_R1_L6.fq, in2=./trim_R2_L6.fq, ref=/home/audrey/reference_genomes/GATK/hg38.fa,/home/audrey/reference_genomes/GATK/Grcm38.fa, basename=decontamin_L6.%_#.fq, outu1=unmapped1_L6.fq, outu2=unmapped2_L6.fq]

              Converted arguments to [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, ambiguous=best, ambiguous2=toss, path=., maxindel=900000, refstats=stat_alignment_L6.txt, threads=4, qtrim=f, untrim=f, in1=./trim_R1_L6.fq, in2=./trim_R2_L6.fq, basename=decontamin_L6.%_#.fq, outu1=unmapped1_L6.fq, outu2=unmapped2_L6.fq, ref_hg38=/home/audrey/reference_genomes/GATK/hg38.fa, ref_Grcm38=/home/audrey/reference_genomes/GATK/Grcm38.fa]
              Creating merged reference file ref/genome/1/merged_ref_6706777893850.fa.gz
              Ref merge time: 112.084 seconds.
              Executing align2.BBMap [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, ambiguous=best, ambiguous2=toss, maxindel=900000, refstats=stat_alignment_L6.txt, threads=4, qtrim=f, untrim=f, in1=./trim_R1_L6.fq, in2=./trim_R2_L6.fq, outu1=unmapped1_L6.fq, outu2=unmapped2_L6.fq, ref=ref/genome/1/merged_ref_6706777893850.fa.gz, out_hg38=decontamin_L6.hg38_#.fq, out_Grcm38=decontamin_L6.Grcm38_#.fq]

              BBMap version 36.92
              Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
              Reference set statistics will be written to stat_alignment_L6.txt
              Set threads to 4
              Retaining first best site only for ambiguous mappings.
              NOTE: Deleting contents of ref/genome/1 because reference is specified and overwrite=true
              Writing reference.
              Executing dna.FastaToChromArrays2 [ref/genome/1/merged_ref_6706777893850.fa.gz, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

              Set genScaffoldInfo=true
              Writing chunk 1
              Writing chunk 2
              Writing chunk 3
              Writing chunk 4
              Writing chunk 5
              Writing chunk 6
              Writing chunk 7
              Writing chunk 8
              Writing chunk 9
              Writing chunk 10
              Writing chunk 11
              Writing chunk 12
              Writing chunk 13
              Set genome to 1

              Loaded Reference: 0.005 seconds.
              Loading index for chunk 1-13, build 1
              No index available; generating from reference genome: /media/audrey/26a8cc89-53d7-4f9b-984c-f8d2c0d519fd/Ievgenia/PDX/Exome/test/ref/index/1/chr1-3_index_k13_c2_b1.block
              Indexing threads started for block 0-3
              Indexing threads finished for block 0-3
              No index available; generating from reference genome: /media/audrey/26a8cc89-53d7-4f9b-984c-f8d2c0d519fd/Ievgenia/PDX/Exome/test/ref/index/1/chr4-7_index_k13_c2_b1.block
              Indexing threads started for block 4-7
              Indexing threads finished for block 4-7
              No index available; generating from reference genome: /media/audrey/26a8cc89-53d7-4f9b-984c-f8d2c0d519fd/Ievgenia/PDX/Exome/test/ref/index/1/chr8-11_index_k13_c2_b1.block
              Indexing threads started for block 8-11
              Indexing threads finished for block 8-11
              No index available; generating from reference genome: /media/audrey/26a8cc89-53d7-4f9b-984c-f8d2c0d519fd/Ievgenia/PDX/Exome/test/ref/index/1/chr12-13_index_k13_c2_b1.block
              Indexing threads started for block 12-13
              Indexing threads finished for block 12-13
              Generated Index: 571.883 seconds.
              Analyzed Index: 10.807 seconds.
              Reads that map to multiple references will be considered unmapped.
              Started output stream: 6.057 seconds.
              Creating ref-set statistics table: 0.012 seconds.
              Cleared Memory: 4.121 seconds.
              Processing reads in paired-ended mode.
              Started read stream.
              Started 4 mapping threads.

              Could you please explain me how to set and force the parameters?
              Many thanks in advance.


              • Which exact parameters are you referring to?


                • Thank you GenoMax for your answer.
                  I let minratio at the default value. So it should be 0.65 as said in bbsplit help. here it's 0.56.
                  And for the parameters : maxindel, qtrim and untrim, they are both set as the default then as I asked in the same command line. Will it be problematic?
                  Many thanks.


                  • I see. @Brian hopefully will be along later today with an official answer but my guess is that it probably has to do with ambigous2=toss option that you have selected.


                    • And for the parameters : maxindel, qtrim and untrim, they are both set as the default then as I asked in the same command line. Will it be problematic?
                      Later parameters override earlier parameters, so that's not a problem.

                      As for 0.65 versus 0.56, that looks like either a typo, or else I changed the default at some point but forgot to update the help info; I'll fix that. Thanks! In the meantime you can always manually add "minratio=0.65" to override it.

                      Edit: To clarify, 0.56 is the intended behavior, and the description is in error.
                      Last edited by Brian Bushnell; 02-13-2017, 10:18 AM.


                      • Thank you very much GenoMax and Brian for your helpful answers.


                        • Converting a coverage file into a csv format for mmgenome

                          Hello there,

                          Is there a way I could convert the mapping files that bbmap produced into csv coverage file that can be used for the mmgenome software (link: http://madsalbertsen.github.io/mmgen...ad_data.html)?



                          • BBMap can directly output coverage using the "covstats" or "basecov" flags, or it can be generated from sam files with pileup.sh. Can you post a few lines from one of these mmgenome csv files so I can see what the format looks like?


                            • Dear Brian,

                              Thanks for your reply.

                              This is from one of the mapping files (.csv) for mmgenome.

                              [email protected]:~/Downloads$ head C13.11.14.csv
                              "Name","Consensus length","Total read count","Single reads","Reads in pairs","Average coverage","Reference sequence","Reference length","Reference common name","Reference Latin name"
                              "1 mapping","5621","103","19","84","1.443","1","8264","",""
                              "2 mapping","370","6","2","4","0.625","2","1027","",""
                              "3 mapping","1665","198","52","146","13.536","3","1665","",""
                              "4 mapping","94","1","1","0","0.01","4","9056","",""
                              "5 mapping","3042","95","19","76","3.205","5","3343","",""
                              "6 mapping","901","8","2","6","9.663E-3","6","98207","",""
                              "7 mapping","5788","147","11","136","2.612","7","6480","",""
                              "8 mapping","14602","381","47","334","2.782","8","15790","",""
                              "9 mapping","1403","1056","282","774","85.135","9","1403","",""
                              [email protected]:~/Downloads$


                              • Hello,

                                I am new to this forum so I apologize if I posted in the wrong place.

                                Currently I am trying to assemble a single cell genome from some Illumina Hiseq 2x250 bp reads.
                                One of my first steps is to do normalization of the data. Previously I was using bbnorm.sh because it's fast and very good. However, with this data I cannot use bbnorm.sh on our server due to memory issues. (I have around 100 million PE reads).
                                So I tried to use bbnorm on a cluster where I allocated 16 cores and 700GB of RAM.

                                I run the software using this command line:
                                bbnorm.sh in1=st1c_R1.fastq.gz in2=st1c_R2.fastq.gz out1=st1c_R1_norm.fastq.gz out2=st1c_R2_norm.fastq.gz threads=12 target=60 mindepth=2

                                The software runs fine, but the process will be killed by the scheduler because of high CPU usage.

                                =>> PBS: job killed: ncpus 59.52 exceeded limit 16 (sum)
                                I tried to increase the number of threads requested to 30, lower the number of threads in the command line to 10, but still I have the same problem.

                                The cluster has installed BBMap version 36.92. It looks like for some reason the number of threads which I define is not taken into account, even if during the run it tells me that the number of threads was set to 12.

                                threads: 12
                                k: 31
                                deterministic: true
                                toss error reads: false
                                passes: 1
                                bits per cell: 16
                                cells: 145.37B
                                hashes: 3
                                base min quality: 5
                                kmer min prob: 0.5

                                target depth: 240
                                min depth: 2
                                max depth: 300
                                min good kmers: 15
                                depth percentile: 64.8
                                ignore dupe kmers: true
                                fix spikes: false
                                Any suggestions? I really cannot ask for the entire node just for me.
                                I understand that due to other components like pigz, it can use more than 12 cores, but here it looks like uses 60 cores, or all available cores.

                                Thank you,


                                Latest Articles


                                • seqadmin
                                  Improved Targeted Sequencing: A Comprehensive Guide to Amplicon Sequencing
                                  by seqadmin

                                  Amplicon sequencing is a targeted approach that allows researchers to investigate specific regions of the genome. This technique is routinely used in applications such as variant identification, clinical research, and infectious disease surveillance. The amplicon sequencing process begins by designing primers that flank the regions of interest. The DNA sequences are then amplified through PCR (typically multiplex PCR) to produce amplicons complementary to the targets. RNA targets...
                                  Yesterday, 01:49 PM
                                • seqadmin
                                  Targeted Sequencing: Choosing Between Hybridization Capture and Amplicon Sequencing
                                  by seqadmin

                                  Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...
                                  03-10-2023, 05:31 AM
                                • seqadmin
                                  Expert Advice on Automating Your Library Preparations
                                  by seqadmin

                                  Using automation to prepare sequencing libraries isn’t a new concept, and most researchers are aware that there are numerous benefits to automating this process. However, many labs are still hesitant to switch to automation and often believe that it’s not suitable for their lab. To combat these concerns, we’ll cover some of the key advantages, review the most important considerations, and get real-world advice from automation experts to remove any lingering anxieties....
                                  02-21-2023, 02:14 PM





                                Topics Statistics Last Post
                                Started by seqadmin, 03-17-2023, 12:32 PM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 03-15-2023, 12:42 PM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 03-09-2023, 10:17 AM
                                0 responses
                                1 like
                                Last Post seqadmin  
                                Started by seqadmin, 03-03-2023, 12:03 PM
                                0 responses
                                Last Post seqadmin