Seqanswers Leaderboard Ad

**Brian Bushnell** · 08-10-2016, 10:47 AM

Oh, that's kind of irritating. BBMap as currently structured has a maximum reference sequence length of 500Mbp. I designed it that way because I was unaware of any chromosomes longer than that, and I believed the reason to be that 500Mbp was above the maximum stable length of an individual chromosome... looks like I may have been wrong!

I'll have to think about how to resolve this; there's no simple setting for it. Thanks for bringing it to my attention.

**parulagwl** · 08-10-2016, 11:24 AM

Thank you Brian for the quick response.
We would really appreciate your thoughts/inputs on how to work around our issue.

**GenoMax** · 08-10-2016, 11:27 AM

Purely speculating. Don't know where the centromere is in this chromosome but you could split it in a region where there are long stretches of N's (and the pieces remain smaller than 500 mb) that way chances of reads needing to map across this break would be small.

**Thias** · 08-11-2016, 02:28 AM

Just because I sometimes stumble over that issue in tutorials (which don't seem to bother) and also saw it again in the recent question....

I once was thaugt (and got a deduction of points in a test for not knowing it) that using even k-mer sizes is frowned upon. The comprehensible rationale behind is, that only odd k-mer sizes ensure a kmer can never be its own reverse complement in the de Bruijn Graph. Such ambiguity created by palindromic k-mers in the de Bruijn graph supposedly make its resolution difficult.

So to settle that question once and for good: Does it really have an impact on mapping efficiency, if I chose an even or its neighboring odd k-mer?

**Brian Bushnell** · 08-11-2016, 09:01 AM

No. The longer the kmer, the greater the speed (and memory consumption); even versus odd is not important.

Additionally, I don't see that even-length kmers cause problems in assembly, either. Genomic palindromes of kmer length or longer cause problems whether you are using an even or odd kmer length. These palindromes always have an even length, but - say you have a genomic palindrome of length 22. Using K=22, you will not (trivially) be able to resolve it. Nor will you with K=21. You will with K=23, and you will with K=24. It's not clear to me in this situation why K=23 would be preferable of K=24 with regards to palindromes, but K=24 can resolve longer repeats than K=23.

**HESmith** · 08-11-2016, 10:28 AM

Actually, an odd k-mer ensures that the strand orientation can be determined, since the central nucleotide cannot be identical due to complementarity (an even k-mer can be a perfect palindrome in both orientations).

But the point about longer k-mers is spot-on.

**Thias** · 08-12-2016, 01:37 AM

Thanks a lot for your answers! Your exemplified replies were really helpful for some more insight.

**darthsequencer** · 08-14-2016, 12:14 PM

Hi I have a couple questions on the terminology used for retaining ambiguous sites using bbmap.

If "ambiguous=best" this means that if there are a bunch of reads all the with the sam score only the first match will be retained? Or does it mean that of all the reads mapping above a score cutoff the first one will be picked?

Along the same lines - for "ambiguous=all" does this mean that if say 5 locations all share the same highest score that they will be reported or does it mean that all locations above the score cutoff will be retained?

**Brian Bushnell** · 08-15-2016, 10:48 AM

"ambiguous=best" is a bit misleading, but it means the genomically first location with a maxmimum score will be used. "ambiguous=all" will report all locations within the ambiguity threshold of the first. This does not mean they need exactly the same score; it means that they are very close, so much so that none can be confidently determined to be the correct mapping location. Normally they're identical, but if for example one mapping had a single 1bp deletion and another mapping had two 1bp substitutions, the scores would be different, but would be close enough to be both reported. But if there was a third potential mapping with, say, 5 substitutions, that would be excluded. This can be controlled with the "secondarysitescoreratio" flag; if you set it to 1.0, only mappings with identical scores to the best score will be reported.

**lankage** · 08-26-2016, 08:58 AM

Hi, Brian

We recently increased our PacBio amplicon size from ~1100 to 3kb. With the smaller amplicon size we were able to map reads to our allele reference sequence library of non-full length allele sequences using "semiperfectmode" to allow for soft-clipping. Im now looking to map ~3kb read sequences obtained from gDNA sequencing to exon reference sequences of ~270 bases a piece and not able to tune the settings to get any mapping results. Is there a way to tune mapPacBio.sh to get hits for regions within long reads to short exon sequences that perfectly match?

**susanklein** · 08-26-2016, 04:42 PM

Hi,

couldn't you just do it the other way around, Have the pacbio as ref and may your short refs to it?

Although I don't understand why you refs are so short.

S.

**Brian Bushnell** · 08-26-2016, 05:01 PM

I agree with Susan. BBMap is a global aligner, and not really designed to map reads to substantially shorter reference sequences. But you could try with the flags "minid=0 local", which might work. Note that "semiperfectmode" will not allow a single mismatch or indel, so it's really only useful in special situations; "local" is more appropriate in this situation.

**GenoMax** · 08-27-2016, 05:49 AM

@lankage: You don't have to align to the short amplicon regions. You could align to the genome (and find out if you have any non-specific amplification along the way).

**GenoMax** · 08-30-2016, 04:41 AM

@moistplus: If you were to use bbmap.sh to do the alignments then you would get that information in the alignment report along with the bam file (as long as you have samtools available in $PATH).

**Shini Sunagawa** · 09-22-2016, 02:51 AM

Hi Brian,

Since I saw increased activity lately again, I was wondering if you might have thought about the issue we discussed back in January (~post #300). It was about dedupe not writing out exact matched and contained sequence identifiers.

As mentioned before, solving this would make this tool very competitive to existing ones, due to the immense speed-up.

Thanks for your consideration!

Best wishes,
Shini

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, 11-08-2024, 11:09 AM	0 responses 34 views 0 likes	Last Post by seqadmin 11-08-2024, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, 11-08-2024, 06:13 AM	0 responses 28 views 0 likes	Last Post by seqadmin 11-08-2024, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 32 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 23 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News