Seqanswers Leaderboard Ad

**GenoMax** · 06-05-2014, 04:30 AM

Brian: You have put a single help document together for all the programs in BBMap. I am not sure you have done that.

I am losing track of all the programs that are now part of BBMap suite

**Brian Bushnell** · 06-05-2014, 04:02 PM

I have updated the speed results by moving the input and output files to a ramdisk to eliminate I/O bottlenecks, and finishing all datapoints for PEAR. Results attached. Also, if anyone has a different program they use for read merging that they want added to the roundup, please post it here before my test data is purged from /scratch in ~6 weeks.

Originally posted by GenoMax View Post

Brian: You have put a single help document together for all the programs in BBMap. I am not sure you have done that.

I am losing track of all the programs that are now part of BBMap suite

I am too... I definitely need to put together such a document! Hopefully soon. I have an internal document describing them but it's not totally relevant outside our cluster.

Attached Files

BBMerge_Speed2.png (21.0 KB, 277 views)

**jimmybee** · 06-06-2014, 02:00 AM

Ive been using it over the last couple of weeks and it has been really quick. Im actually using it to run adapter trimming (with small inserts, all of your reads should overlap), so adding in some options for adapter matching would be handy (although I do know that you already have a trimming script in the bbtools suite)

**SES** · 06-06-2014, 05:43 AM

You may want to add COPE to your tests since it is published. Also, there are a number of other unpublished (I think) programs on this blog post, but I don't think it's necessary to test every script out there. I mention COPE because it works with Fasta, whereas the other tools require Fastq, so that is a major convenience that this tool provides.

There are a couple of caveats with COPE that I can share. 1) The paired reads have to be in the same order in the forward and reverse files. The program will run just fine if they are not, but the number of reads merged will be very low (this would be a nice thing to check for). 2) There is an option for setting the overlap mode and this has to be given or the program will fail silently ("-m 0" is the simple overlap mode).

**Brian Bushnell** · 06-06-2014, 10:31 AM

Originally posted by jimmybee View Post

Ive been using it over the last couple of weeks and it has been really quick. Im actually using it to run adapter trimming (with small inserts, all of your reads should overlap), so adding in some options for adapter matching would be handy (although I do know that you already have a trimming script in the bbtools suite)

jimmybee,

FYI, the latest version of BBMerge by default rejects insert sizes shorter than 35 and does not look for insert sizes shorter than 25, so if you have super-short insert libraries, you might want to reduce those with these flags:
mininsert=12 minoi=12

"mininsert" will reject anything shorter than that value as "too short", while "minoi" (minoverlapinsert) will tell the program to not look for inserts shorter than that. The difference is subtle. Sorry, the default values listed for them in the shellscript are wrong; I will fix that with the next release. Of course for most purposes you don't ever see insert sizes that short but you can in certain experiments involving tiny RNAs.

I have no plans to explicitly add adapter matching to BBMerge, but it sounds like it may be useful to add a mode that trims reads with insert sizes shorter than read length rather than merging them. Also, it would be fairly easy to add overlap-detection to BBDuk for greater adapter-trimming sensitivity. Hmmm, I'll probably do both.

@SES,

For some reason I can't get COPE to work:

/dev/shm$ ./cope -a /dev/shm/sr1.fq -b /dev/shm/sr2.fq -o /dev/shm/cmerged.fq -m 0 -s 33
Program start..
Program: COPE (Connecting Overlapped Pair-End reads)
Version: v1.1.3
Author: BGI-ShenZhen
CompileDate: Jun 6 2014 time: 10:30:24
Current time: Fri Jun 6 10:46:56 2014
Command line: ./cope -a /dev/shm/sr1.fq -b /dev/shm/sr2.fq -o /dev/shm/cmerged.fq -m 0 -s 33
Error: input file is empty, please check the set of input file!

Other programs can see the files fine. I moved them to various places and tried both absolute and relative paths. Any ideas?

**SES** · 06-06-2014, 01:25 PM

Originally posted by Brian Bushnell View Post

@SES,

For some reason I can't get COPE to work:

Other programs can see the files fine. I moved them to various places and tried both absolute and relative paths. Any ideas?

Try specifying the files for unconnected pairs. I would try:

Code:

./cope -a /dev/shm/sr1.fq -b /dev/shm/sr2.fq -o /dev/shm/cmerged.fq -2 /dev/shm/sr1_unconnected.fq -3 /dev/shm/sr2_unconnected.fq -m 0 -s 33

I know that solution doesn't logicially follow from the error message, but from what I remember, you will get either no messages on failure with this program or nondescript messages. If that doesn't work, I'd check the input again, and then try without the "-s 33" as a last suggestion. It could be the input, but I'm guessing it's just the combination of options, or lack of, that is causing the issue.

**Brian Bushnell** · 06-06-2014, 04:23 PM

Thanks SES, that solved it. I guess all outputs must be specified. I've attached the new graphs and updated the inline images in the first post.

COPE performed much better when I set the flag "-N 0" because 5% of my synthetic reads contain an N, so I used that flag, but it was unstable and seg-faulted at and value of -c under 0.92. I included both settings, "-N 1" (default) and "-N 0". Overall it was very close to FLASH when I swept through the -c parameter (match ratio cutoff) from 1.0 to 0.75 (the default) - slightly better with -N 1 (but unstable) and slightly worse with -N 0 (but stable). There are also -d and -B parameters that probably impact the merge rate, which I left at the default, so as with most of the other programs, it's still possible that better results could be achieved. Sample command:

./cope -a /dev/shm/r1.fq -b /dev/shm/r2.fq -o /dev/shm/cmerged.fq -2 /dev/shm/r1_unconnected.fq -3 /dev/shm/r2_unconnected.fq -m 0 -s 33 -c 1.0 -u 150 -N 0 1>cope.o 2>&1

I also added fastq-join (from ea-utils), which was generally outperformed by FLASH. I swept through the -p (from 0 to 25) and -m (from 4 to 12) parameters; only the -p sweep is shown, as it gave better results. Sample command:

fastq-join r1.fq r2.fq -o fqj -p 0

These were similar in speed - both were much faster than PEAR and slightly slower than BBMerge or FLASH, when those three were limited to a single thread. fastq-join is slightly faster than COPE.

Attached Files

**woodydon** · 06-17-2014, 12:10 AM

Is there a list of parameters that I can use? For example, insert size and cpu threads. I tried to assembled a set of ~24 million pair-end reads and found that only 44.59% were assembled. Is there a way to increase (e.g. set loose)? Also, the input reads were 100bp in length and 10th percentile showed size 74. Why is that? By the way, how can the insert range as high as 188?

Reads: 24316810
Joined: 10843877 44.594%
Ambiguous: 62051 0.255%
No Solution: 13397906 55.097%
Too Short: 12976 0.053%
Avg Insert: 132.7

Insert range: 26 - 188
90th percentile: 178
50th percentile: 140
10th percentile: 74

Bests,
Woody

**Brian Bushnell** · 06-17-2014, 09:02 AM

Woody,

If you run the shellscript bbmerge.sh with no parameters, it will display the help information. To answer your specific questions:

By default it uses all available threads, but you can override that with the "t=" flag. You can get a higher merge rate at the expense of more false positives with the "loose" or "vloose" (very loose) flag. It's also possible to more finely tune it with other flags like "minoverlap", "margin", "entropy", and "mismatches", but those are more complicated to tweak. You can use the "mininsert=" flag to ban merges with an insert size shorter than that value.

2x100bp reads can have a 74bp insert or a 188bp insert size. 74bp insert means that the molecule being sequenced was shorter than read length, and as a result the data collection continued off the end of the genomic sequence and into the adapter. So, before merging, the reads each contained 26bp of junk. And a 188bp insert size means that the reads overlapped by 12 base pairs. BBMerge does not look for overlaps shorter than 12bp in the default mode; the shorter the overlap, the more likely that it's a false positive.

The higher the error rate of reads, the fewer successful joins there will be. If you have sufficient coverage (at least 30x average) you can try error-correcting the reads first with my error correction tool; that should increase the merge rate.

ecc.sh in1=r1.fq in2=r2.fq out1=corrected1.fq out2=corrected2.fq

-Brian

**Brian Bushnell** · 06-17-2014, 04:37 PM

Originally posted by jimmybee View Post

Ive been using it over the last couple of weeks and it has been really quick. Im actually using it to run adapter trimming (with small inserts, all of your reads should overlap), so adding in some options for adapter matching would be handy (although I do know that you already have a trimming script in the bbtools suite)

jimmybee,

I added some relevant flags and modes, in version 32.32. Thanks for the suggestions!

First, BBMerge now has a new flag, "tbo" (trimbyoverlap). Rather than joining reads, it will just trim the adapter sequence off the ends of reads that have an apparent insert size shorter than read length.

Second, I added the ability to overlap to BBDuk also, with the same flag, "tbo". So if you are trimming adapters fragment libraries with the normal read orientation, you can trim both by specifying an adapter sequence file and by overlap at the same time, which drastically reduces the false negative rate (reads that did not have adapters completely removed) by around a factor of 12. I highly recommend using both for fragment libraries, as the combination outperforms either one alone:

bbduk.sh in=reads.fq out=trimmed.fq tbo ktrim=r ref=truseq.fa k=25 mink=1 hdist=1

The 'tbo' flag is NOT recommended for long mate pair libraries which may have the adapters in totally different places and are not expected to overlap, which is why it is disabled by default.

-Brian

**woodydon** · 06-17-2014, 10:48 PM

Originally posted by Brian Bushnell View Post

Woody,

2x100bp reads can have a 74bp insert or a 188bp insert size. 74bp insert means that the molecule being sequenced was shorter than read length, and as a result the data collection continued off the end of the genomic sequence and into the adapter. So, before merging, the reads each contained 26bp of junk. And a 188bp insert size means that the reads overlapped by 12 base pairs. BBMerge does not look for overlaps shorter than 12bp in the default mode; the shorter the overlap, the more likely that it's a false positive.

-Brian

Thanks for the reply. It turns out that I confused your "insert size" with "inner distance" of pair-end reads. Since the ~25 million reads were exome-seq data, 44.594% assemble rate is a bit lower than what I expected. If the assemble is perfect, it means 50% of the fragments were actually longer than 188. I hope you could improve those fragments whose "insert size" is longer than 188 in the future.

**Brian Bushnell** · 06-18-2014, 08:24 AM

Originally posted by woodydon View Post

Thanks for the reply. It turns out that I confused your "insert size" with "inner distance" of pair-end reads. Since the ~25 million reads were exome-seq data, 44.594% assemble rate is a bit lower than what I expected. If the assemble is perfect, it means 50% of the fragments were actually longer than 188. I hope you could improve those fragments whose "insert size" is longer than 188 in the future.

Woody,

Did you try error-correcting and using the 'vloose' setting? Both of those can substantially improve the merge rate.

If you want to see what the actual, unbiased insert size distribution of your data, then you can generate a histogram by mapping:

bbmap.sh in1=read1.fq in2=read2.fq ref=reference.fa ihist=ihist.txt nodisk

Then plot a graph of 'ihist.txt', so you can see how many reads should be expected to assemble.

**woodydon** · 06-18-2014, 06:10 PM

Originally posted by Brian Bushnell View Post

Woody,

Did you try error-correcting and using the 'vloose' setting? Both of those can substantially improve the merge rate.

If you want to see what the actual, unbiased insert size distribution of your data, then you can generate a histogram by mapping:

bbmap.sh in1=read1.fq in2=read2.fq ref=reference.fa ihist=ihist.txt nodisk

Then plot a graph of 'ihist.txt', so you can see how many reads should be expected to assemble.

Great. That's an neat option.

**woodydon** · 06-22-2014, 09:31 PM

Hi Brain,

I am using a computing cluster to run BBMerge and found the following error:

java -ea -Xmx48g -cp /home/xxx/tools/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 -Xmx48g in1=/home/xxx/reads/n_R1.fastq in2=/home/xxx/reads/n_R2.fastq ref=/home/xxx/ref/hg19c/genome_hg19.fa ihist=/home/xxx/bbmap/n_ihist.txt out=/home/xxx/bbmap/n_merged.fq
/home/xxx/tools/bbmap/bbmap.sh: line 176: java: command not found

my command was:

/home/xxx/tools/bbmap/bbmap.sh -Xmx48g threads=10 in1=/home/xxx/reads/n_R1.fastq in2=/home/xxx/reads/n_R2.fastq ref=/home/xxx/ref/hg19c/genome_hg19.fa ihist=/home/xxx/bbmap/n_ihist.txt out=/home/xxx/bbmap/n_merged.fq

Thanks,
Woody

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Introducing BBMerge: A paired-end read merger

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News