Unconfigured Ad

**westerman** · 08-01-2014, 10:16 AM

I haven't used Tophat in a while but I thought that it could use any reference library -- genome or amplicon. The underlying program is bowtie2 which certainly handles odd-ball references.

**Brian Bushnell** · 08-01-2014, 10:27 AM

Jeff,

BBMerge has the capability to do this from custom headers generated by BBMap, but currently it is not easy; I will update it to make it simple (probably later today; I'll let you know). Normal fastq headers do NOT contain the relevant information, and sam files don't provide any guarantee of read ordering, so it's nontrivial.

**jstrohm** · 08-01-2014, 11:11 AM

Thanks guys! Worst case scenario I'll just have to throw out my reverse reads. Compared with read lengths three years ago, 250bp isn't awful anyway...

What are your thoughts on quality filtering, clustering and then BLASTing the forward and reverse reads separately. Then merging the taxonomic results?

**Brian Bushnell** · 08-01-2014, 12:42 PM

I've uploaded a new version of BBTools; you can merge them like this:

bbmap.sh -Xmx7g ref=reference.fa in=reads.fq outm=mapped.fq outu=unmapped.fq nodisk po int rbm don

bbmerge.sh in=mapped.fq out=merged_by_mapping.fq int usemapping parsecustom

bbmerge.sh in=unmapped.fq out=merged_by_overlap.fq outu=unmerged.fq int

Then you can concatenate the merged_by_mapping.fq and merged_by_overlap.fq files to get a single file.

Note that these commands all assume you are using a single interleaved file. For two files, you can use the "in1", "in2", "out1", and "out2" flags.

**jstrohm** · 08-01-2014, 12:49 PM

Excellent! Thank you

**jstrohm** · 08-01-2014, 01:02 PM

How would you like me to cite BBtools?

**Brian Bushnell** · 08-01-2014, 03:02 PM

The website address is fine.

**jstrohm** · 09-10-2014, 03:44 PM

Hi Brian,

I finally got around to trying this bbmap code that you suggested.

I'm trying to merge fungal ITS1 sequences. My two Illumina read files each contain ~340,000 reads (173MB total size). My reference set is the current fasta release of the UNITE database with ~44,000 sequences (31 MB in size).

I'm running the code in a virtualized (Virtualbox) version of Ubuntu Linux on a PC with 8GB of RAM and 8 processors.

It's currently using 22% of CPU resources and 80MB of RAM.

It's been running for an hour now... Is there any way to speed things up? I've been using UPARSE for my other reads and all steps finish in a manner of seconds.

The output files are 5MB (mapped) and 3MB (unmapped) in size so far.

Here is the command that I used and the output that has been generated so far:
qiime@qiime-VirtualBox:~/bbmap$ ./bbmap.sh -Xmx7g ref=/home/qiime/bbmap/UNITE_release_s_10.09.2014.fasta in1=/home/qiime/bbmap/reads/EM1-102-R1-ITS_TRIMM_R1.fastq in2=/home/qiime/bbmap/reads/EM1-102-R1-ITS_TRIMM_R2.fastq outm1=mapped_R1.fq outu1=unmapped_R1.fq outm2=mapped_R2.fq outu2=unmapped_R2.fq nodisk po int rbm don
java -ea -Xmx7g -cp /home/qiime/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 -Xmx7g ref=/home/qiime/bbmap/UNITE_release_s_10.09.2014.fasta in1=/home/qiime/bbmap/reads/EM1-102-R1-ITS_TRIMM_R1.fastq in2=/home/qiime/bbmap/reads/EM1-102-R1-ITS_TRIMM_R2.fastq outm1=mapped_R1.fq outu1=unmapped_R1.fq outm2=mapped_R2.fq outu2=unmapped_R2.fq nodisk po int rbm don
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, -Xmx7g, ref=/home/qiime/bbmap/UNITE_release_s_10.09.2014.fasta, in1=/home/qiime/bbmap/reads/EM1-102-R1-ITS_TRIMM_R1.fastq, in2=/home/qiime/bbmap/reads/EM1-102-R1-ITS_TRIMM_R2.fastq, outm1=mapped_R1.fq, outu1=unmapped_R1.fq, outm2=mapped_R2.fq, outu2=unmapped_R2.fq, nodisk, po, int, rbm, don]

BBMap version 33.21
Set INTERLEAVED to true
Retaining first best site only for ambiguous mappings.
Executing dna.FastaToChromArrays2 [/home/qiime/bbmap/UNITE_release_s_10.09.2014.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gzip=true, chromc=false, maxlen=536670912, writechroms=false, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=true]

Set genScaffoldInfo=true
Reset INTERLEAVED to false because paired input files were specified.
Set genome to 1

Loaded Reference: 1.142 seconds.
Loading index for chunk 1-1, build 1
Indexing threads started for block 0-1
Indexing threads finished for block 0-1
Generated Index: 7.765 seconds.
Analyzed Index: 6.103 seconds.
Started output stream: 2.810 seconds.
Started output stream: 0.001 seconds.
Cleared Memory: 0.166 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 1 mapping thread.

MANY THANKS!

**Brian Bushnell** · 09-10-2014, 04:08 PM

Jeff,

That's strange. BBMap is usually very fast with such a small input file, but if the reference is extremely repetitive, it can slow down a lot. You can speed it up by adding the flags "fast" and "t=8". "t=8" forces it to use 8 threads; that's not normally needed because the number of cores is autodetected, but it looks like the virtualization somehow interfered with that. "fast" decreases sensitivity slightly, and typically doubles the speed.

But if the speed is still a problem with those settings, you can merge by overlap with BBMerge first and then only map the unmerged reads, which should vastly reduce the number that need to be mapped; THEN merge the mapped reads by mapping information. BBMerge will only take a few seconds. The modified commands would be:

bbmerge.sh in=reads.fq out=merged_by_overlap.fq outu=unmerged_A.fq t=8

bbmap.sh -Xmx7g ref=reference.fa in=unmerged_A.fq outm=mapped.fq outu=unmapped.fq nodisk po int rbm don fast t=8

bbmerge.sh in=mapped.fq out=merged_by_mapping.fq outu=unmerged_B.fq int usemapping parsecustom int t=8

So ultimately your merged reads will be in the two 'merged' files and your unmerged reads will be in unmerged_B.fq (interleaved).

**jstrohm** · 09-10-2014, 04:44 PM

Thanks Brian, I tried it with 8 threads and it seems just as slow. Can I just run bbmerge.sh from the Windows command line instead to avoid any issues with virtualization?

**Brian Bushnell** · 09-10-2014, 07:32 PM

Jeff,

You can run BBMap and BBMerge in Windows, you just can't use the shellscript. So:

bbmerge.sh
becomes
java -ea -Xmx200m -cp path\to\bbmap\current\ jgi.BBMerge

and
bbmap.sh
becomes
java -ea -Xmx7g -cp path\to\bbmap\current\ align2.BBMap

**sphil** · 09-11-2014, 12:32 AM

Originally posted by jstrohm View Post

Thanks Brian, I tried it with 8 threads and it seems just as slow. Can I just run bbmerge.sh from the Windows command line instead to avoid any issues with virtualization?

That you can't use more cores might have something to do with the setup of your virtual machine. If you only hand through one core, it is obvious that you can't work with 8.... Check on that if you want to use your virtual machine instead Windows...

**jstrohm** · 09-11-2014, 06:15 AM

OK! It's running through windows now. It's using 99% of CPU resources and 1GB of RAM. It's still running after an hour though...

Previously, I tried merging these reads with UPARSE and essentially none of them overlap leading me to try this method. Is there any reason why that would slow things down? I didn't actually try the "fast" command yet. You said it slightly decreases sensitivity so I wanted to see if I could get away without it.

I may be forced to give up and just use the 250bp forward reads. Anyway... thanks a bunch for your efforts and assistance!

With regards to my VirtualBox, I used to run scripts that would auto-detect PC resources. The OS has been acting a little strange since I upgraded to the latest version of Ubuntu. It's probably time for a fresh install, but that would require some serious moving about of files...

**jstrohm** · 09-11-2014, 08:40 AM

Hang on! I made a rookie mistake... I have a quad core processor with 8 "logical processors". So I can't actually run 8 parallel jobs or threads. I dialed it back to t=4 and the job ran in 10 minutes!

When I selected t=8 I had to abort it after three hours.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, Yesterday, 11:08 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 53 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Non-overlapping read assembly

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News