Seqanswers Leaderboard Ad

**Brian Bushnell** · 01-25-2017, 10:06 AM

Originally posted by GenoMax View Post

@Brian will confirm later but I think BBMap loads pre-made indexes on disk in memory if you provide path= option. Note: "nodisk" option builds indexes in memory from fasta files but am not sure if it can (or needs to) be used with path= option.

That's correct. If the index is present on disk, it gets loaded at startup; no random-access is ever performed. So, the only difference is a faster startup when the index is already on disk.

**boulund** · 01-26-2017, 02:21 AM

Originally posted by Brian Bushnell View Post

That's correct. If the index is present on disk, it gets loaded at startup; no random-access is ever performed. So, the only difference is a faster startup when the index is already on disk.

Perfect, thank you!

**Mat29** · 02-07-2017, 12:15 AM

Thank you for developing BBMap. Could you please advise how to preprocess the output VCF to make it compatible with VCFAnnotator, SNPEff. I recieve "No gene feature" and "Chromososme is missing" errors on annotation. Tried Pilon also VCFs and errors persist.

**YeastGuy** · 02-07-2017, 08:13 AM

Hi Brian,

I have a discontinuous reference genome in a single fasta file (6000 coding sequences) and I would like to use bbmap to align my paired reads to the coding sequence only. I want to know how many reads align to each ORF.

Now I realised that the order of the reference genome affects the alignment because bbmap seems to ignore the ORF borderes in the reference.
Do you have a suggestion how to constrain the alignment within each ORF?

Thank you!

**GenoMax** · 02-07-2017, 08:52 AM

@YeastGuy: Are those sequences in multi-fasta format?

**YeastGuy** · 02-07-2017, 01:18 PM

@GenoMax: Yes, they are.

**Brian Bushnell** · 02-07-2017, 01:19 PM

Originally posted by Mat29 View Post

Thank you for developing BBMap. Could you please advise how to preprocess the output VCF to make it compatible with VCFAnnotator, SNPEff. I recieve "No gene feature" and "Chromososme is missing" errors on annotation. Tried Pilon also VCFs and errors persist.

Please ensure that your chromosome names do not have spaces in them. If they do, use the "trd" flag when mapping (trim read description), or for simplicity, just process the fasta initially like this:

reformat.sh in=ref.fa out=fixed.fa trd

...then use fixed.fa for mapping and everything else.

**Brian Bushnell** · 02-07-2017, 01:23 PM

Originally posted by YeastGuy View Post

Hi Brian,

I have a discontinuous reference genome in a single fasta file (6000 coding sequences) and I would like to use bbmap to align my paired reads to the coding sequence only. I want to know how many reads align to each ORF.

Now I realised that the order of the reference genome affects the alignment because bbmap seems to ignore the ORF borderes in the reference.
Do you have a suggestion how to constrain the alignment within each ORF?

Thank you!

How are the ORFs indicated? Do you have them in a separate file (gff/gtf/bed)? BBMap doesn't use any of those, though you could probably use a different tool to mask the reference with them. Generally, I don't recommend constraining alignment to a specific subset of the genome, which incurs confirmation bias.

**YeastGuy** · 02-07-2017, 01:39 PM

Originally posted by Brian Bushnell View Post

How are the ORFs indicated? Do you have them in a separate file (gff/gtf/bed)? BBMap doesn't use any of those, though you could probably use a different tool to mask the reference with them. Generally, I don't recommend constraining alignment to a specific subset of the genome, which incurs confirmation bias.

I tried to simply use the multi-fasta format with the hope that the sequences would be seen as seperate and not connected:
>lcl||1|YAL001C|TFC3|1|145
TAGTTACTATGGTCGTTAACGAAATAATATTTCATCCAGGGA
>lcl||2|YAL002W|VPS8|1|145
CTGGTCTGGACCCATTACTTTTTCTAGCTTGGGAAAATGTACAG

But after randomising the order withing the ref file the coverage changed.

**Brian Bushnell** · 02-07-2017, 01:46 PM

Oh, that file should be fine; BBMap sees the individual sequences as unconnected. It's still possible for read 1 to map to one csequence and read 2 to a different sequence, though. The coverage may be different for various reasons, typically due to ambiguity (two different reference sequences that are very similar). You can use "ambig=toss" or "ambig=all" to change the behavior of reads mapping to multiple possible locations. But bear in mind that BBMap is slightly nondeterministic when mapping paired reads.

**YeastGuy** · 02-07-2017, 02:00 PM

Originally posted by Brian Bushnell View Post

Oh, that file should be fine; BBMap sees the individual sequences as unconnected. It's still possible for read 1 to map to one csequence and read 2 to a different sequence, though. The coverage may be different for various reasons, typically due to ambiguity (two different reference sequences that are very similar). You can use "ambig=toss" or "ambig=all" to change the behavior of reads mapping to multiple possible locations. But bear in mind that BBMap is slightly nondeterministic when mapping paired reads.

Okay then, thank you for your quick help! I will try your suggestions.
Now that I know what the problem is, it should be possible to find a solution.

**GenoMax** · 02-07-2017, 03:20 PM

@Brian: What setting (if any) would prevent the paired reads from aligning to two references (maxindel=0 perhaps) in first place?

**shimingt** · 02-08-2017, 02:06 AM

Difficulties using bbmap to generate sorted, indexed bam file

Hello people,

I am trying to generate sorted, indexed bam file with the bbmap tool, but I have an error message.

tanshiming@S620100019205:~/Documents/ACDC-trial/shimingles$ bbmap.sh in1=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R1_001.fastq in2=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R2_001.fastq out=mapped.sam -Xmx30g ref=reformatted-contigs.fasta bamscript=bs.sh; sh bs.sh
java -Djava.library.path=/home/tanshiming/tools/bbmap/jni/ -ea -Xmx30g -cp /home/tanshiming/tools/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 in1=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R1_001.fastq in2=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R2_001.fastq out=mapped.sam -Xmx30g ref=reformatted-contigs.fasta bamscript=bs.sh
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, in1=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R1_001.fastq, in2=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R2_001.fastq, out=mapped.sam, -Xmx30g, ref=reformatted-contigs.fasta, bamscript=bs.sh]

BBMap version 36.02
Retaining first best site only for ambiguous mappings.
Writing reference.
Executing dna.FastaToChromArrays2 [reformatted-contigs.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

Set genScaffoldInfo=true
Writing chunk 1
Set genome to 1

Loaded Reference: 0.084 seconds.
Loading index for chunk 1-1, build 1
No index available; generating from reference genome: /home/tanshiming/Documents/ACDC-trial/shimingles/ref/index/1/chr1_index_k13_c7_b1.block
Indexing threads started for block 0-1
Indexing threads finished for block 0-1
Generated Index: 4.146 seconds.
Analyzed Index: 3.074 seconds.
Started output stream: 0.015 seconds.
Cleared Memory: 0.307 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 32 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31

------------------ Results ------------------

Genome: 1
Key Length: 13
Max Indel: 16000
Minimum Score Ratio: 0.56
Mapping Mode: normal
Reads Used: 13403154 (2943817149 bases)

Mapping: 634.915 seconds.
Reads/sec: 21110.16
kBases/sec: 4636.56

Pairing data: pct reads num reads pct bases num bases

mated pairs: 82.0206% 5496676 85.8858% 2528321686
bad pairs: 1.6892% 113202 1.6858% 49625732
insert size avg: 339.43

Read 1 data: pct reads num reads pct bases num bases

mapped: 85.2928% 5715965 85.4106% 1312819237
unambiguous: 85.0524% 5699850 85.1949% 1309504251
ambiguous: 0.2405% 16115 0.2157% 3314986
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 56.5028% 3786577 57.4090% 882415075
semiperfect site: 57.7478% 3870016 58.6870% 902059233
rescued: 0.7551% 50604

Match Rate: NA NA 95.4737% 1291210481
Error Rate: 32.2729% 1844876 3.9447% 53348516
Sub Rate: 31.0425% 1774536 0.7285% 9852710
Del Rate: 2.5010% 142970 2.9285% 39606112
Ins Rate: 4.1940% 239750 0.2876% 3889694
N Rate: 2.6675% 152488 0.5816% 7866352

Read 2 data: pct reads num reads pct bases num bases

mapped: 85.4634% 5727397 85.6329% 1204639477
unambiguous: 85.1581% 5706936 85.3792% 1201071323
ambiguous: 0.3053% 20461 0.2536% 3568154
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 54.6069% 3659525 55.8135% 785155749
semiperfect site: 55.7207% 3734167 56.9902% 801709710
rescued: 1.0234% 68587

Match Rate: NA NA 95.6569% 1184979976
Error Rate: 34.8039% 1993488 3.7863% 46904205
Sub Rate: 33.7245% 1931661 0.7674% 9506079
Del Rate: 2.2309% 127782 2.7561% 34141479
Ins Rate: 3.7482% 214688 0.2629% 3256647
N Rate: 2.4683% 141381 0.5567% 6896775

Total time: 643.704 seconds.
bs.sh: 2: bs.sh: module: not found
bs.sh: 3: bs.sh: module: not found
Note: This script is designed to run with the amount of memory detected by BBMap.
If Samtools crashes, please ensure you are running on the same platform as BBMap,
or reduce Samtools' memory setting (the -m flag).
Note: Please ignore any warnings about 'EOF marker is absent'; this is a bug in samtools that occurs when using piped input.
[bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-n Sort by read name
-o FILE Write final output to FILE rather than standard output
-T PREFIX Write temporary files to PREFIX.nnnn.bam
-@, --threads INT
Set number of sorting and compression threads [1]
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
[E::hts_open_format] fail to open file 'mapped_sorted.bam'
samtools index: failed to open "mapped_sorted.bam": No such file or directory

Could someone please advice me on this?

Thank You.

**GenoMax** · 02-08-2017, 04:20 AM

Do you have samtools installed and available in your $PATH and which version?

If yes then you can create bam files directly in bbmap command and then sort/index them. Following example is for samtools (v.1.3.1).

Code:

bbmap.sh in1=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R1_001.fastq in2=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R2_001.fastq out=mapped.bam -Xmx30g ref=reformatted-contigs.fasta

samtools sort -o sorted.bam mapped.bam
samtools index sorted.bam

**Brian Bushnell** · 02-08-2017, 10:13 AM

Actually, I think the problem in this case is the samtools version. Samtools changed the default command line flags going from 0.x to 1.x, which causes headaches, and I didn't realize this until recently. If you download the latest version of BBMap (36.92), it will look at the version of samtools installed and change the samtools command appropriately.

Alternately, you can just run these commands manually. For samtools 0.x:

Code:

samtools view -bSh1 mapped.sam.gz | samtools sort -m 1G -@ 4 - mapped_sorted
samtools index mapped_sorted.bam

For samtools 1.x:

Code:

samtools view -bSh1 mapped.sam | samtools sort -m 1G -@ 4 - -o mapped_sorted.bam
samtools index mapped_sorted.bam

The exact amount of memory specified (e.g. 1G) depends on your computer but should be fine in this case.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News