Seqanswers Leaderboard Ad

**Brian Bushnell** · 03-05-2017, 07:57 PM

Hi Ruben,

When the same flag is given twice, the second one takes precedence. So, in this case the command is working fine and minhits=2 / maxindel=200000 are being used.

**RubenD** · 03-05-2017, 08:32 PM

Originally posted by Brian Bushnell View Post

Hi Ruben,

When the same flag is given twice, the second one takes precedence. So, in this case the command is working fine and minhits=2 / maxindel=200000 are being used.

Perfect! Thanks a lot Brian.

**xc611** · 07-16-2017, 05:55 AM

Hi, Brian:

I have run bbsplit to number of samples for binning the rRNA reads. Most of them (total of 172 samples) run fine but some of them gave me strange java errors. I rerun couple of times and every time different sample had this error, so I know its not the fastq file error.

Any suggestions?

Thanks

Jack
Exception in thread "main" java.lang.RuntimeException: java.io.EOFException: Unexpected end of ZLIB input stream
at fileIO.ReadWrite.readObject(ReadWrite.java:788)
at fileIO.ReadWrite.read(ReadWrite.java:1154)
at dna.ChromosomeArray.read(ChromosomeArray.java:65)
at align2.ChromLoadThread.loadAll(ChromLoadThread.java:50)
at dna.Data.loadChromosomes(Data.java:272)
at align2.BBMap.loadIndex(BBMap.java:354)
at align2.BBMap.main(BBMap.java:33)
at align2.BBSplitter.main(BBSplitter.java:49)
Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2313)
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2326)
at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2797)
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
at fileIO.ReadWrite.readObject(ReadWrite.java:783)
... 7 more

**Brian Bushnell** · 07-17-2017, 08:52 AM

Hi Jack,

This means that you were running multiple different indexing processes in the same directory at the same time. Unless you use a different directory for each process, or specify a different index location with "path=", or specify a different build number, the indexes can overwrite each other leading to corrupt zip files (which, fortunately, normally get detected, as in this case).

If you want to do all of these mapping operations to the same references, just index once, wait for it to finish, and then run all the mapping operations without specifying "ref=". E.g.

Code:

bbsplit.sh ref=ecoli.fa,salmonella.fa

(wait for finish)

bbsplit.sh a.fq basename=out_a_%.fq
bbsplit.sh b.fq basename=out_b_%.fq
bbsplit.sh c.fq basename=out_c_%.fq
...etc

If each one needs different references, then either run them serially, or use a different directory/build each time.

**wlokhorst** · 10-02-2017, 03:01 AM

Hello Brian,

Your tool is great and on first sight it seems to work perfectly, but after running several batches of samples I've noticed some quirky behaviour.

To investigate this, I've run the BBSplit tool several times on the same sample (and deleting the index every time) and it doesn't give me the same results every time. I've pasted the summarised results below and, as you can see, they differ with every run. This certainly isn't what I expected. Mapping the same reference sequences to a sample multiple times should always give me the same results, so something is wrong here.

The numbers are the the number of reads mapped to a certain reference sequence.

Code:

		run1	run2	run3	run4	run5	run6	run7	run8	run9	run10	
ref_seq_1	4	8	8	8	4	8	8	8	4	8
ref_seq_2	4	4	4	4	4	4	4	4	4	4
ref_seq_3	8	4	12	40	16	4	4	4	16	4
ref_seq_4	24	24	8	4	8	4	24	24	8	4
ref_seq_5	4	8	12	4	4	40	16	12	4	12
ref_seq_6	12	8	8	8	4	16	4	32	4	4
ref_seq_7	4	4	4	4	56	4	56	4	56	4
ref_seq_8	16	4	4	4	16	4	16	4	16	4
ref_seq_9	4	4	4	4	4	4	4	4	4	4
ref_seq_10	20	20	20	20	20	20	20	20	20	20

Hopefully you're able to tell me what causes this and (even better) solve it easily.

Kind regards,

Wouter Lokhorst

**GenoMax** · 10-02-2017, 04:03 AM

@Wouter: @Brian would be along with an official answer but this may be because most NGS aligners are non-determinstic by default i.e. they may not produce exactly the same answer for each run.

bbmap.sh provides a "determinisitc" flag that you can set to "true" to get exactly identical results each time. I don't see bbsplit listed as having that flag but it may support it since it is based on bbmap. @Brian will confirm.

You would also need to consider how similar your references are and how you are choosing to handle multi-mappers (reads which map across genomes). Are you leaving the ambiguous2= option default. It may help to post the entire bbsplit command you are using for this analysis.

**wlokhorst** · 10-02-2017, 04:22 AM

@GenoMax: I did not consider the non-deterministic behaviour of mappers, however BBMap should still be deterministic in my case, since I have single-end reads. See copied text from the BBMap help section below. I could still try to explicitly call this though.

Run in deterministic mode. In this case it is good to set averagepairdist. BBMap is deterministic without this flag if using single-ended reads, or run singlethreaded.

My references are very similar in some cases (e.g. different strains of the same species) and I'm leaving the ambiguous parameters untouched.

Commands:
bash ~/bbmap/bbsplit.sh ref={}
# ref= a folder containing my reference sequences
bash ~/bbmap/bbsplit.sh in={} basename={}/out_%.fq
# in= sample, basename= output folder and output filename

**GenoMax** · 10-02-2017, 06:00 AM

You could try changing the ambiguous2= parameter and see how that changes the result. "toss/all" may generate results that look more similar. With very similar genomes you are going to have trouble assigning reads reproducibly. How long are these reads?

**wlokhorst** · 10-02-2017, 08:34 AM

Neither deterministic=t nor ambiguous2=all helped change the results. Unless you or Brian know a way of solving this, I'm sorry to say that I'll try using another tool (or attempt writing something of my own).

Reads are 100bp.

**Brian Bushnell** · 10-02-2017, 03:26 PM

That's odd, if you are not intentionally producing random assignments for sequences. Can you please post your full command line? Also, are the reads paired or unpaired?

**GenoMax** · 10-02-2017, 06:06 PM

@Brian: Reads are single end 100 bp. Commands used are in post #37

**Brian Bushnell** · 10-04-2017, 09:24 AM

The numbers of mapped reads do not add up to the same record in each case... it seems like the program is crashing and not processing all the reads, or else the input is different. The output is deterministic for single-ended reads so I can't see any other explanation. Does the program print an error message?

**wlokhorst** · 10-05-2017, 12:31 AM

No error messages are produced, but you might be right about it crashing. (It also crashes when I put ./ in front of the input file.) Maybe you can see what goes wrong from the output text, so here it is:

Code:

BBMap version 37.53
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
Retaining first best site only for ambiguous mappings.
NOTE:   Ignoring reference file because it already appears to have been processed.
NOTE:   If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
Set genome to 1

Loaded Reference:       1.506 seconds.
Loading index for chunk 1-1, build 1
Generated Index:        1.482 seconds.
Analyzed Index:         3.715 seconds.
Cleared Memory:         6.933 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 48 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47

   ------------------   Results   ------------------

Genome:                 1
Key Length:             13
Max Indel:              20
Minimum Score Ratio:    0.56
Mapping Mode:           normal
Reads Used:             40428   (3919820 bases)

Mapping:                13.209 seconds.
Reads/sec:              3060.74
kBases/sec:             296.76


Read 1 data:            pct reads       num reads       pct bases          num bases

mapped:                   4.5464%            1838         4.3524%             170607
unambiguous:              1.2120%             490         1.2217%              47890
ambiguous:                3.3343%            1348         3.1307%             122717
low-Q discards:           0.0000%               0         0.0000%                  0

perfect best site:        0.7173%             290         0.6776%              26561
semiperfect site:         0.7297%             295         0.6905%              27066

Match Rate:                   NA               NA        91.4514%             156445
Error Rate:              42.4134%            1543         8.4773%              14502
Sub Rate:                42.1935%            1535         7.7799%              13309
Del Rate:                 2.2265%              81         0.2701%                462
Ins Rate:                 4.1506%             151         0.4273%                731
N Rate:                   0.2474%               9         0.0713%                122

Total time:             27.078 seconds.

**GenoMax** · 10-05-2017, 03:09 AM

Do you only have 40428 reads in the file? 48 threads seems like an overkill for this amount of reads. Can you use 4 and see if that helps? You can also delete the previous bbmap index files and force the program to create new ones using ref=.

Adding relative paths before the file names should not cause the program to crash. It never has for me.

**wlokhorst** · 10-05-2017, 03:29 AM

40428 reads indeed. The number of threads shouldn't matter for the succes or failure of the process. However, I did run it on 4 threads for you, so here are the results:

Code:

BBMap version 37.53
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
Set threads to 4
Retaining first best site only for ambiguous mappings.
NOTE:   Ignoring reference file because it already appears to have been processed.
NOTE:   If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
Set genome to 1

Loaded Reference:       1.647 seconds.
Loading index for chunk 1-1, build 1
Generated Index:        1.542 seconds.
Analyzed Index:         3.989 seconds.
Cleared Memory:         6.978 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 4 mapping threads.
Detecting finished threads: 0, 1, 2, 3

   ------------------   Results   ------------------

Genome:                 1
Key Length:             13
Max Indel:              20
Minimum Score Ratio:    0.56
Mapping Mode:           normal
Reads Used:             40428   (3919820 bases)

Mapping:                10.603 seconds.
Reads/sec:              3812.77
kBases/sec:             369.68


Read 1 data:            pct reads       num reads       pct bases          num bases

mapped:                   4.5464%            1838         4.3524%             170607
unambiguous:              1.2120%             490         1.2217%              47890
ambiguous:                3.3343%            1348         3.1307%             122717
low-Q discards:           0.0000%               0         0.0000%                  0

perfect best site:        0.7173%             290         0.6776%              26561
semiperfect site:         0.7297%             295         0.6905%              27066

Match Rate:                   NA               NA        91.4514%             156445
Error Rate:              42.4134%            1543         8.4773%              14502
Sub Rate:                42.1935%            1535         7.7799%              13309
Del Rate:                 2.2265%              81         0.2701%                462
Ins Rate:                 4.1506%             151         0.4273%                731
N Rate:                   0.2474%               9         0.0713%                122

Total time:             25.038 seconds.

This also includes removal of the ref folder (which I do every time before testing the next run) and re-creation with the first command I've given you before.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 51 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News