Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Hi Ruben,

    When the same flag is given twice, the second one takes precedence. So, in this case the command is working fine and minhits=2 / maxindel=200000 are being used.

    Comment


    • #32
      Originally posted by Brian Bushnell View Post
      Hi Ruben,

      When the same flag is given twice, the second one takes precedence. So, in this case the command is working fine and minhits=2 / maxindel=200000 are being used.
      Perfect! Thanks a lot Brian.

      Comment


      • #33
        Hi, Brian:

        I have run bbsplit to number of samples for binning the rRNA reads. Most of them (total of 172 samples) run fine but some of them gave me strange java errors. I rerun couple of times and every time different sample had this error, so I know its not the fastq file error.

        Any suggestions?

        Thanks

        Jack
        Exception in thread "main" java.lang.RuntimeException: java.io.EOFException: Unexpected end of ZLIB input stream
        at fileIO.ReadWrite.readObject(ReadWrite.java:788)
        at fileIO.ReadWrite.read(ReadWrite.java:1154)
        at dna.ChromosomeArray.read(ChromosomeArray.java:65)
        at align2.ChromLoadThread.loadAll(ChromLoadThread.java:50)
        at dna.Data.loadChromosomes(Data.java:272)
        at align2.BBMap.loadIndex(BBMap.java:354)
        at align2.BBMap.main(BBMap.java:33)
        at align2.BBSplitter.main(BBSplitter.java:49)
        Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
        at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
        at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
        at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
        at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2313)
        at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2326)
        at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2797)
        at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)
        at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
        at fileIO.ReadWrite.readObject(ReadWrite.java:783)
        ... 7 more

        Comment


        • #34
          Hi Jack,

          This means that you were running multiple different indexing processes in the same directory at the same time. Unless you use a different directory for each process, or specify a different index location with "path=", or specify a different build number, the indexes can overwrite each other leading to corrupt zip files (which, fortunately, normally get detected, as in this case).

          If you want to do all of these mapping operations to the same references, just index once, wait for it to finish, and then run all the mapping operations without specifying "ref=". E.g.

          Code:
          bbsplit.sh ref=ecoli.fa,salmonella.fa
          
          (wait for finish)
          
          bbsplit.sh a.fq basename=out_a_%.fq
          bbsplit.sh b.fq basename=out_b_%.fq
          bbsplit.sh c.fq basename=out_c_%.fq
          ...etc
          If each one needs different references, then either run them serially, or use a different directory/build each time.

          Comment


          • #35
            Hello Brian,

            Your tool is great and on first sight it seems to work perfectly, but after running several batches of samples I've noticed some quirky behaviour.

            To investigate this, I've run the BBSplit tool several times on the same sample (and deleting the index every time) and it doesn't give me the same results every time. I've pasted the summarised results below and, as you can see, they differ with every run. This certainly isn't what I expected. Mapping the same reference sequences to a sample multiple times should always give me the same results, so something is wrong here.

            The numbers are the the number of reads mapped to a certain reference sequence.
            Code:
            		run1	run2	run3	run4	run5	run6	run7	run8	run9	run10	
            ref_seq_1	4	8	8	8	4	8	8	8	4	8
            ref_seq_2	4	4	4	4	4	4	4	4	4	4
            ref_seq_3	8	4	12	40	16	4	4	4	16	4
            ref_seq_4	24	24	8	4	8	4	24	24	8	4
            ref_seq_5	4	8	12	4	4	40	16	12	4	12
            ref_seq_6	12	8	8	8	4	16	4	32	4	4
            ref_seq_7	4	4	4	4	56	4	56	4	56	4
            ref_seq_8	16	4	4	4	16	4	16	4	16	4
            ref_seq_9	4	4	4	4	4	4	4	4	4	4
            ref_seq_10	20	20	20	20	20	20	20	20	20	20
            Hopefully you're able to tell me what causes this and (even better) solve it easily.

            Kind regards,

            Wouter Lokhorst
            Last edited by GenoMax; 10-02-2017, 06:05 AM.

            Comment


            • #36
              @Wouter: @Brian would be along with an official answer but this may be because most NGS aligners are non-determinstic by default i.e. they may not produce exactly the same answer for each run.

              bbmap.sh provides a "determinisitc" flag that you can set to "true" to get exactly identical results each time. I don't see bbsplit listed as having that flag but it may support it since it is based on bbmap. @Brian will confirm.

              You would also need to consider how similar your references are and how you are choosing to handle multi-mappers (reads which map across genomes). Are you leaving the ambiguous2= option default. It may help to post the entire bbsplit command you are using for this analysis.

              Comment


              • #37
                @GenoMax: I did not consider the non-deterministic behaviour of mappers, however BBMap should still be deterministic in my case, since I have single-end reads. See copied text from the BBMap help section below. I could still try to explicitly call this though.

                Run in deterministic mode. In this case it is good to set averagepairdist. BBMap is deterministic without this flag if using single-ended reads, or run singlethreaded.
                My references are very similar in some cases (e.g. different strains of the same species) and I'm leaving the ambiguous parameters untouched.

                Commands:
                bash ~/bbmap/bbsplit.sh ref={}
                # ref= a folder containing my reference sequences
                bash ~/bbmap/bbsplit.sh in={} basename={}/out_%.fq
                # in= sample, basename= output folder and output filename

                Comment


                • #38
                  You could try changing the ambiguous2= parameter and see how that changes the result. "toss/all" may generate results that look more similar. With very similar genomes you are going to have trouble assigning reads reproducibly. How long are these reads?

                  Comment


                  • #39
                    Neither deterministic=t nor ambiguous2=all helped change the results. Unless you or Brian know a way of solving this, I'm sorry to say that I'll try using another tool (or attempt writing something of my own).

                    Reads are 100bp.

                    Comment


                    • #40
                      That's odd, if you are not intentionally producing random assignments for sequences. Can you please post your full command line? Also, are the reads paired or unpaired?

                      Comment


                      • #41
                        @Brian: Reads are single end 100 bp. Commands used are in post #37

                        Comment


                        • #42
                          The numbers of mapped reads do not add up to the same record in each case... it seems like the program is crashing and not processing all the reads, or else the input is different. The output is deterministic for single-ended reads so I can't see any other explanation. Does the program print an error message?

                          Comment


                          • #43
                            No error messages are produced, but you might be right about it crashing. (It also crashes when I put ./ in front of the input file.) Maybe you can see what goes wrong from the output text, so here it is:
                            Code:
                            BBMap version 37.53
                            Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
                            Retaining first best site only for ambiguous mappings.
                            NOTE:   Ignoring reference file because it already appears to have been processed.
                            NOTE:   If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
                            Set genome to 1
                            
                            Loaded Reference:       1.506 seconds.
                            Loading index for chunk 1-1, build 1
                            Generated Index:        1.482 seconds.
                            Analyzed Index:         3.715 seconds.
                            Cleared Memory:         6.933 seconds.
                            Processing reads in single-ended mode.
                            Started read stream.
                            Started 48 mapping threads.
                            Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
                            
                               ------------------   Results   ------------------
                            
                            Genome:                 1
                            Key Length:             13
                            Max Indel:              20
                            Minimum Score Ratio:    0.56
                            Mapping Mode:           normal
                            Reads Used:             40428   (3919820 bases)
                            
                            Mapping:                13.209 seconds.
                            Reads/sec:              3060.74
                            kBases/sec:             296.76
                            
                            
                            Read 1 data:            pct reads       num reads       pct bases          num bases
                            
                            mapped:                   4.5464%            1838         4.3524%             170607
                            unambiguous:              1.2120%             490         1.2217%              47890
                            ambiguous:                3.3343%            1348         3.1307%             122717
                            low-Q discards:           0.0000%               0         0.0000%                  0
                            
                            perfect best site:        0.7173%             290         0.6776%              26561
                            semiperfect site:         0.7297%             295         0.6905%              27066
                            
                            Match Rate:                   NA               NA        91.4514%             156445
                            Error Rate:              42.4134%            1543         8.4773%              14502
                            Sub Rate:                42.1935%            1535         7.7799%              13309
                            Del Rate:                 2.2265%              81         0.2701%                462
                            Ins Rate:                 4.1506%             151         0.4273%                731
                            N Rate:                   0.2474%               9         0.0713%                122
                            
                            Total time:             27.078 seconds.

                            Comment


                            • #44
                              Do you only have 40428 reads in the file? 48 threads seems like an overkill for this amount of reads. Can you use 4 and see if that helps? You can also delete the previous bbmap index files and force the program to create new ones using ref=.

                              Adding relative paths before the file names should not cause the program to crash. It never has for me.

                              Comment


                              • #45
                                40428 reads indeed. The number of threads shouldn't matter for the succes or failure of the process. However, I did run it on 4 threads for you, so here are the results:
                                Code:
                                BBMap version 37.53
                                Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
                                Set threads to 4
                                Retaining first best site only for ambiguous mappings.
                                NOTE:   Ignoring reference file because it already appears to have been processed.
                                NOTE:   If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
                                Set genome to 1
                                
                                Loaded Reference:       1.647 seconds.
                                Loading index for chunk 1-1, build 1
                                Generated Index:        1.542 seconds.
                                Analyzed Index:         3.989 seconds.
                                Cleared Memory:         6.978 seconds.
                                Processing reads in single-ended mode.
                                Started read stream.
                                Started 4 mapping threads.
                                Detecting finished threads: 0, 1, 2, 3
                                
                                   ------------------   Results   ------------------
                                
                                Genome:                 1
                                Key Length:             13
                                Max Indel:              20
                                Minimum Score Ratio:    0.56
                                Mapping Mode:           normal
                                Reads Used:             40428   (3919820 bases)
                                
                                Mapping:                10.603 seconds.
                                Reads/sec:              3812.77
                                kBases/sec:             369.68
                                
                                
                                Read 1 data:            pct reads       num reads       pct bases          num bases
                                
                                mapped:                   4.5464%            1838         4.3524%             170607
                                unambiguous:              1.2120%             490         1.2217%              47890
                                ambiguous:                3.3343%            1348         3.1307%             122717
                                low-Q discards:           0.0000%               0         0.0000%                  0
                                
                                perfect best site:        0.7173%             290         0.6776%              26561
                                semiperfect site:         0.7297%             295         0.6905%              27066
                                
                                Match Rate:                   NA               NA        91.4514%             156445
                                Error Rate:              42.4134%            1543         8.4773%              14502
                                Sub Rate:                42.1935%            1535         7.7799%              13309
                                Del Rate:                 2.2265%              81         0.2701%                462
                                Ins Rate:                 4.1506%             151         0.4273%                731
                                N Rate:                   0.2474%               9         0.0713%                122
                                
                                Total time:             25.038 seconds.
                                This also includes removal of the ref folder (which I do every time before testing the next run) and re-creation with the first command I've given you before.

                                Comment

                                Working...
                                X