Seqanswers Leaderboard Ad

**luc** · 02-23-2014, 04:17 PM

Dear Brian,

thanks a lot for posting your aligner package.
Thanks to the PacBio option it is not strictly a short read aligner - but for which max read read length would you recommend it (e.g are merged 500 bp Miseq reads OK for the short read version; are PacBio reads longer than the 6 kb mentioned for version 25 allowed for the latest version)?
Would you recommend any specific settings for the alignment of RNA seq reads?

**Brian Bushnell** · 02-24-2014, 10:09 AM

Originally posted by luc View Post

Dear Brian,

thanks a lot for posting your aligner package.
Thanks to the PacBio option it is not strictly a short read aligner - but for which max read read length would you recommend it (e.g are merged 500 bp Miseq reads OK for the short read version; are PacBio reads longer than the 6 kb mentioned for version 25 allowed for the latest version)?
Would you recommend any specific settings for the alignment of RNA seq reads?

Luc,

Unfortunately, the limit is still 6kb. I am working on increasing that but it may take a while. For PacBio reads, you should use the script mapPacBio8k.sh (which despite the name only goes up to 6k), and the PacBio reads should be in fasta format. Reads in fasta format that are longer than 6000bp will be automatically broken into 6000bp pieces and their names appended with "_1", "_2", and so forth. But the vast majority of the PacBio reads will be under 6kbp and thus won't be fragmented.

For any Illumina reads, including merged Miseq, use bbmap.sh. It handles reads up to 500bp. The only exception is if you want to map Illumina reads to PacBio reads (for error correction), in which case you should use mapPacBio.

bbmap.sh (which calls BBMap.class) has mutation penalties tuned for Illumina reads, while mapPacBio8k.sh (which calls BBMapPacBio.class) has mutation penalties tuned for PacBio reads and a longer maximum read length, and is slower. You can still use mapPacBio8k.sh for any really long reads, though, like when mapping one assembly to another.

For RNA-seq (using Illumina reads), I would recommend a command line like this:

bbmap.sh ref=reference.fa in=reads.fq out=mapped.sam maxindel=100000 xstag=firststrand intronlen=10 ambig=random

This will look for introns up to 100kbp long, and flag every gap longer than 10bp with cigar symbol 'N' (skipped) rather than 'D' (deletion). Ambiguously-mapped reads will go to a random top-scoring site, which is generally best for quantification. XS tags, needed by Cufflinks, will be produced according to the "first strand" library protocol (if you don't know what protocol is used, then set it to xstag=unstranded).

The 'maxindel' flag can be increased or decreased as needed. Many plants and fungi tend to have very short introns mostly under 400bp, while human average is much higher and has a few that are even longer than 100kbp. It should be set to above the size of the largest introns you expect, but higher values decrease accuracy, so I don't recommend setting it above 500000 or so. The default is 16000 which is fine for the plants and fungi I've worked with but if you are working with vertebrates then 200000 would probably be appropriate.

**luc** · 02-24-2014, 02:12 PM

Thanks a lot for the advice!
I will give it a try.

**Brian Bushnell** · 02-25-2014, 05:28 PM

Originally posted by luc View Post

Thanks a lot for the advice!
I will give it a try.

You're welcome. By the way, I added a java 6-compiled package to the Sourceforge page (and a newer version of the java 7-compiled package). These may also address some other issues; e.g., some shellscripts had Windows-style newlines (\n\r), and I changed them all to Unix-style newlines (\n). Some shellscript parsers choke on \n\r.

**emixaM** · 03-07-2014, 02:38 AM

I would like to try BBMap on human exomes, but I have a strange output, for the reference indexing (it goes only to chr7), therefore the rest of the computation fails:

$ /home/max/appli/bbmap/bbmap_31.32/bbmap.sh -Xmx40G build=1 ref=/home/max/reference/ucsc.hg19.fasta
This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G.
If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool.
java -ea -Xmx40G -cp /home/max/appli/bbmap/bbmap_31.32/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 -Xmx40G build=1 ref=/home/max/reference/ucsc.hg19.fasta
Executing align2.BBMap [build=1, overwrite=true, match=long, fastareadlen=500, -Xmx40G, build=1, ref=/home/max/reference/ucsc.hg19.fasta]

BBMap version 31.32
Set OVERWRITE to true
Cigar strings enabled.
Retaining first best site only for ambiguous mappings.
No output file.
NOTE: Deleting contents of ref/genome/1 because reference is specified and overwrite=true
NOTE: Deleting contents of ref/index/1 because reference is specified and overwrite=true
Writing reference.
Executing dna.FastaToChromArrays [/home/max/reference/ucsc.hg19.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gzip=true, chromc=false, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

Set genScaffoldInfo=true
Writing ref/genome/1/chr1.chrom.gz
Writing ref/genome/1/chr2.chrom.gz
Writing ref/genome/1/chr3.chrom.gz
Writing ref/genome/1/chr4.chrom.gz
Writing ref/genome/1/chr5.chrom.gz
Writing ref/genome/1/chr6.chrom.gz
Writing ref/genome/1/chr7.chrom.gz
Set genome to 1

Loaded Reference: 0,011 seconds.
Loading index for chrom 1-7, genome 1
No index available; generating from reference genome: /home/max/data/test/ref/index/1/chr1-3_index_k13_c2_b1.block
No index available; generating from reference genome: /home/max/data/test/ref/index/1/chr4-7_index_k13_c2_b1.block
Indexing threads started for block 4-7
Indexing threads started for block 0-3
Indexing threads finished for block 0-3
Indexing threads finished for block 4-7
Generated Index: 176,884 seconds.
Finished Writing: 4,670 seconds.
No reads to process; quitting.

Total time: 230,800 seconds.

And then the mapping:

$ /home/max/appli/bbmap/bbmap_31.32/bbmap.sh -Xmx40G build=1 in=test.bam out=test_bbmap.bam
This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G.
If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool.
java -ea -Xmx40G -cp /home/max/appli/bbmap/bbmap_31.32/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 -Xmx40G build=1 in=test.bam out=test_bbmap.bam
Executing align2.BBMap [build=1, overwrite=true, match=long, fastareadlen=500, -Xmx40G, build=1, in=test.bam, out=test_bbmap.bam]

BBMap version 31.32
Set OVERWRITE to true
Cigar strings enabled.
Retaining first best site only for ambiguous mappings.
Set genome to 1

Loaded Reference: 4,994 seconds.
Loading index for chrom 1-7, genome 1
Generated Index: 10,028 seconds.
Analyzed Index: 9,399 seconds.
Could not find samtools.
Started output stream: 0,049 seconds.
Cleared Memory: 0,441 seconds.
Processing reads in single-ended mode.
Started read stream.
Exception in thread "Thread-45" java.lang.AssertionError: 57 = i
array=�BC��R��UR�t��l�� iqp.)�C�9\�ʣ�1�
�rX�[�%}hW�|t��NO_�n��w�݇��i�+}z~z�+��O��oηw�v��쨻Y*��%d� �� {@}, start=44, stop=141
at align2.Tools.parseInt(Tools.java:1097)
at stream.SamLine.<init>(SamLine.java:1242)
at stream.SamReadInputStream.toReadList(SamReadInputStream.java:130)
at stream.SamReadInputStream.fillBuffer(SamReadInputStream.java:102)
at stream.SamReadInputStream.hasMore(SamReadInputStream.java:58)
at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:717)
at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:709)
Started 32 mapping threads.

And nothing goes on.

**westerman** · 03-07-2014, 08:48 AM

Brian: Trying out BBMap and I am having problems indexing the reference. My species is Pig which has the following Fasta entry lines in the genome.fa file:

grep '>' ../References/GENOME_DIR/genome.fa
>10
>11
>12
>13
>14
>15
>16
>17
>18
>1
>2
>3
>4
>5
>6
>7
>8
>9
>MT
>X
>Y

However it appears that BBMap is only creating 6 chromosomes as follows:

Executing align2.BBMap [build=1, overwrite=true, match=long, fastareadlen=500, ref=../References/GENOME_DIR/genome.fa]

BBMap version 31.32
Set OVERWRITE to true
Cigar strings enabled.
Retaining first best site only for ambiguous mappings.
No output file.
Writing reference.
Executing dna.FastaToChromArrays [../References/GENOME_DIR/genome.fa, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gzip=true, chromc=false, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

Set genScaffoldInfo=true
Writing ref/genome/1/chr1.chrom.gz
Writing ref/genome/1/chr2.chrom.gz
Writing ref/genome/1/chr3.chrom.gz
Writing ref/genome/1/chr4.chrom.gz
Writing ref/genome/1/chr5.chrom.gz
Writing ref/genome/1/chr6.chrom.gz
Set genome to 1

Loaded Reference: 0.008 seconds.
Loading index for chrom 1-6, genome 1
No index available; generating from reference genome: /group/gcore/machine/sbs/run00289/131127_M00657_0033_000000000-A6EDC/hr00671_and-yanhua/006356_-945-Flag-1-946-/ref/index/1/chr1-3_index_k13_c2_b1.block
No index available; generating from reference genome: /group/gcore/machine/sbs/run00289/131127_M00657_0033_000000000-A6EDC/hr00671_and-yanhua/006356_-945-Flag-1-946-/ref/index/1/chr4-6_index_k13_c2_b1.block
Indexing threads started for block 0-3
Indexing threads started for block 4-6
Indexing threads finished for block 4-6
Indexing threads finished for block 0-3
Generated Index: 274.860 seconds.
Finished Writing: 29.271 seconds.
No reads to process; quitting.

Total time: 362.577 seconds.

Any ideas? Feel free to send me email westerman at purdue dot edu

**Brian Bushnell** · 03-07-2014, 09:15 AM

Westerman,

Sorry for the confusion. The files are called "chrom" for legacy reasons, and they don't correspond to chromosomes anymore. I should rename them "chunks". Originally, BBMap wrote one chromosome per chrom file, which is fine for complete reference genomes. But de-novo assemblies can have thousands or millions of scaffolds, so now I bundle them into chunks.

Everything is there, and when you produce a sam file, you'll see that reads are mapped to all of the different input chromosomes.

**westerman** · 03-07-2014, 10:19 AM

Originally posted by Brian Bushnell View Post

Everything is there, and when you produce a sam file, you'll see that reads are mapped to all of the different input chromosomes.

So they are. Different mapping and slightly more -- in terms of raw matches -- than tophat. I'll have to do more work in order to see if the matches are accurate but so far it is encouraging.

samtools idxstats BBMapped.bam | sort -n

* 0 0 174134
MT 16613 121832 3149
X 144288218 11619 559
Y 1637716 16 4
1 315321322 31627 2152
2 162569375 33762 1920
3 144787322 40457 3319
4 143465943 49722 2064
5 111506441 27714 1515
6 157765593 60316 2405
7 134764511 27906 2778
8 148491826 14523 1241
9 153670197 26155 1155
10 79102373 6924 601
11 87690581 13340 650
12 63588571 25561 1007
13 218635234 29541 1744
14 153851969 29737 2588
15 157681621 25974 2666
16 86898991 9360 334
17 69701581 23454 913
18 61220071 6550 332

Tophat accepted_hits.bam

MT 16613 85810 0
X 144288218 11197 0
Y 1637716 12 0
1 315321322 35442 0
2 162569375 24092 0
3 144787322 46519 0
4 143465943 43747 0
5 111506441 22186 0
6 157765593 45857 0
7 134764511 18867 0
8 148491826 13938 0
9 153670197 27813 0
10 79102373 4661 0
11 87690581 12493 0
12 63588571 19511 0
13 218635234 33190 0
14 153851969 19774 0
15 157681621 11172 0
16 86898991 14862 0
17 69701581 19776 0
18 61220071 5159 0

**Brian Bushnell** · 03-07-2014, 10:37 AM

By the way, BBMap's default mode is "ambig=best", meaning that any ambiguously-mapped reads will go to the first top-scoring location (ambiguous hits will be annotated "XT:A:R" rather than "XT:A:U", so you will know which ones are ambiguous). This is fine for variant-calling, but if you are doing RNA-seq expression quantification, it's usually better to set "ambig=random" or "ambig=all". Otherwise, if there are two almost-identical genes, all the reads will map to the first one.

Also, for vertebrate RNA-seq, I suggest these settings:
xstag=us
generate XS tags, needed by cufflinks, according to "unstranded" protocol, or if you don't know the protocol. Alternatives are "ss" (second strand) and "fs" (first strand). If you DO know the protocol, please set it as ss or fs.

intronlen=10
All deletions up to this length will be annotated as 'D' in the cigar string; longer ones will be annotated as 'N' (meaning skipped). This may or may not matter to downstream tools. For DNA mapping you don't need this flag.

maxindel=200000
The longest deletion (i.e., intron) that it will look for. May still find longer ones. Around 98% of human introns are shorter than 100kbp. If you do not set this, it will default to 16kbp, which is too short for vertebrate RNA-seq (though lots of plants and fungi have tiny introns of around 200bp).

I'll add this recommendation to the readme for the next release.

**westerman** · 03-07-2014, 11:35 AM

Thanks for the recommendations, especially the ambig= one. The help for bbmap.sh is ... confusing, at least at first glance.

One thing I think is working but giving me an improper warning message is the -Xmx parameter. From starting like so:

bbmap.sh -Xmx30G path=../BBMap_ref in1=$R1 in2=$R2 ...

I get:

This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G.
If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool.
java -ea -Xmx30G ....

In other words a warning but Java itself is starting up properly with 30G.

**Brian Bushnell** · 03-07-2014, 11:45 AM

Hmm, that's interesting, thanks for noting it. It's because I first try to detect memory, then parse the -Xmx override. I will reverse the order

I'll try to improve the help information, so any concrete suggestions/complaints would be very helpful.

**westerman** · 03-07-2014, 12:44 PM

As for help, the major problem is the formatting on screens with 80 columns. Wrapping is hard to read plus there is too much information for a quick reference. I suggest trying to limit the help messages to 80 columns and then direct the user to a longer formatted README or extended help. I know that the current help says to look at the README.

Also parsing '-h' on the command line would be nice. While displaying help when no commands are given is good it is also useful to be able to use '-h'.

As an example of the first paragraph, at the moment the help includes:

Code:

[FONT="Courier New"]maxsites=5          Maximum number of total alignments to print per read.  Only relevant when secondary=t.
quickmatch=f        Generate cigar strings more quickly.  Must be true to generate secondary site cigar strings.
keepnames=f         Keep original names of paired reads, rather than ensuring both reads have the same name.[/FONT]

All useful information to be sure but these could shortened and the extended information put into a formatted README. E.g.

Code:

[FONT="Courier New"]maxsites=5          Max. number of total alignments to print
quickmatch=f        Generate cigar strings more quickly.
keepnames=f         Keep original names of paired reads[/FONT]

**luc** · 03-07-2014, 07:07 PM

Hi,

I do get the similar warnings and an exits on a system with more than 200 GB of RAM available (using the Java 6 version).

"This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G.
If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool."

"Invalid maximum heap size: -Xmx30G
The specified size exceeds the maximum representable size.
Could not create the Java virtual machine."

**Brian Bushnell** · 03-07-2014, 09:44 PM

Originally posted by luc View Post

Hi,

I do get the similar warnings and an exits on a system with more than 200 GB of RAM available (using the Java 6 version).

"This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G.
If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool."

"Invalid maximum heap size: -Xmx30G
The specified size exceeds the maximum representable size.
Could not create the Java virtual machine."

luc,

That means that you have a 32-bit version of Java installed. You need a 64-bit version, which you can get here.

Westerman,

Thanks for your feedback; I'll simplify the help menus for the next version.

-Brian

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News