Seqanswers Leaderboard Ad

**Brian Bushnell** · 05-26-2014, 07:02 AM

WhatSoEver,

If you have a cigar string like ths:

20=1I20=1D20=25D20=20I20=

...I would calculate:
(20+20+20+20+20) matches / (20+1+20+root(1)+20+25+20+root(25)+20)
=
(100)/(20+1+20+1+20+25+20+5+20)
=
(100)/(132)=75.75%

I know that's asymmetric - if there's a deletion, aligning the reference to the read would give a different identity than the read to the reference - but there are a couple reasons for that. First, you can get deletions of unbounded length in cigar strings, but insertions can never be longer than the read length, so it is (according to my logic) important to correct for deletions but not really anything else. Second, insertions occur INSTEAD of matches, while deletions occur IN ADDITION to matches. So if a 100bp read has 1 bp match and 99bp insertion, does it make sense to call that 1/(1+root(99)) = about 10% identity? Not really. But with a 100bp deletion in the middle, you'd have 100/110 = about 90% identity, which is maybe a bit high, but it seems reasonable for an alignment with two nearby 50bp consecutive matches, which can't arise by chance. Third, it seems to me that long deletions are more common than long insertions, but that could be discovery bias. Lastly, decreasing the identity penalty of insertions would cause a particular problem. Let's say you have a chimeric 200bp read, 100bp from somewhere in chromosome 1 and 100bp from chromosome 2. You could map it in two places, with cigars string of 100=100I and 100I100=, or 100=100X and 100X100=. My current formula would give that 50% identity for either location, whether it used the X or I symbol for the chimeric bases. But if you take the root of the insertions, then you get 100/(100+10) = 91% identity to two different sequences that have nothing in common, which does not make sense.

It could be that calculating identity this way is not a good idea; I may put in a flag to remove the sqrt from deletions, but it seems to me that it gives a more useful answer in terms of short-read alignment when doing recruitment studies, which is what I actually put the idtag in for.

Originally posted by WhatsOEver View Post

intronlength=10 is the standard value, isn't it? I didn't set it, but can neither find any cigar strings with deletions greater than 10D nor one with N's smaller than 11N.

By default, it is disabled. However, if you set other flags like "xstag" that are used by Tophat, and you don't explicitly set "intronlen", it will be automatically enabled and set to 10.

And one last question out of interest as I worked in fungal genomics before I changed to humans: Why would you suggest a maxindel of 16000 for fungi? From what I analyzed (Comparative genomics of 200+ asco- and basidiomycetes) they are rarely longer than 100bp and I never saw one longer than 3kb (though I'm atm not sure if this was even mitochondrial).

Well, I said "for plants and fungi". I work with fungi and the introns I've seen tend to be 300bp or less, and some of the plants I've seen also seem to have short introns of typically 250bp or so. But I don't know if there might be some out there that are different! The main reason is that very high settings of 'maxindel' (over 100kb) decrease speed and accuracy, but low settings like 16000 don't really have a noticeable effect on either yet it COULD let you discover something new. So I generally run with 16k as the default for BBMap even on DNA. The PacBio version has a lower default because PacBio reads are very long and have a high error rate (which exacerbates the effect on speed and accuracy), and we don't do RNA-seq for PacBio right now.

**Mchicken** · 05-27-2014, 04:21 AM

Hey Brian,

i just briefly checked BBMap and have some questions.

1) I am working, among others, with RNA-Seq data of a cyanobacterium. Is it possible that BBMap performs a spliced alignments for my RNA-Seq data? In a small run i did not observed spliced alignments but nevertheless i was wondering about it. If so, can i prevent this case by setting 'intronlen' to 0?

2) Is there a way to apply soft- or hard-clipping?
For example bwa performs for a read soft-clipping leading to 94M6S where BBMap leads to 100M although the last 6 bases are partially mismatches and therefor i want them to be clipped.

3) Is there a way to determine the maximal number of mismatches?

4) For CRAC and PAR-Clip sequencing data i want only 1 respectively 2 mismatches/indels. So when i want at most 2 indels i have to set maxindel= 2 and strictmaxindel = true. Am i right?

5) Does BBMap reports only the best alignment (or best alignments if alignments with equal quality are present) or is there a way to report also alignments that are not as good as the best alignment? Would this be the maxsites option?

Thanks for your nice support!

Mchicken

**Brian Bushnell** · 05-27-2014, 08:47 AM

Mchicken,

1) 'intronlen' does not actually affect alignment, just the way splices are reported - deletions shorter than 'intronlen' will have a 'D' symbol in the cigar string, while longer ones will have an 'N' symbol. To prevent gapped alignments, you should set 'maxindel=0', which will prevent them from being looked for, though some may still be found (or 'strictmaxindel=0' to absolutely ban all alignments with indels). But prokaryotes do have some self-splicing; these are typically very short, like 20bp or so, I think. So you may want to set 'maxindel=100' or something like that, which will prevent long ones.

2) Soft-clipping is applied by default if a read goes off the end of a scaffold. You can also force it on alignments like the one you describe with the flag 'local=t'. Also, note that if you use the flag 'sam=1.4' then cigar strings will be generated with '= for match and 'X' for mismatch, instead of 'M' for both.

3) You can use a flag like 'minid=98' to prevent alignments with lower than approximately 98% identity, which would be 2 mismatches for a 100bp read, but substitutions, deletions, and insertions are all scored differently, and whether they are contiguous or scattered also affects the score, so this is not exact. BBMap does not have any way to ensure that the best alignment with at most X mismatches is returned, like Bowtie 1 can do, because the scoring is fundamentally different.

4) "maxindel" actually controls the length of individual indels, not the number of them, which is not controlled. There's also a "maxindel2" flag that controls the sum of the length of all indels, which by default is set to double maxindel. So if you did this:
maxindel=10 maxindel2=15 strictmaxindel
...then individual indels could be up to 10bp; the sum of the length of indels could be up to 15bp; and any alignment with a single indel longer than 10bp would be banned. But there is no way to limit the total number of indel events.

5) By default the best alignment is reported, and if there are multiple equally-scoring alignments, the first one (in genomic coordinates) is reported, but it is marked as ambiguous (XT:A:R tag). This corresponds to the setting 'ambig=best'. You can alternatively set 'ambig=random' or 'ambig=all' (which is what you are looking for). 'maxsites' will limit the total number of alignments displayed in 'ambig=all' mode, but won't have any effect in default mode.

Thanks for using BBMap!

-Brian

**simono101** · 05-28-2014, 02:41 AM

Escaping spaces using the shell script

Hi Brian,
I can't seem to work out how to pass a filepath with spaces in it to BBMap using the shell script. Nothing that I normally would use seems to work...

e.g.

Code:

 ~/bbmap/bbmap.sh ref="../Reference\ genome/myref.fasta" build=2 in1=./Reads1.fastq.gz in2=./Reads2.fastq.gz out=./mapped.sam

Yields the error

Code:

Exception in thread "main" java.lang.RuntimeException: Unknown parameter: genome/myref.fasta

Is there a way to pass a string with escaped spaces to bbmap using the shell script or do I need to call bbmap directly?

Thanks!

**GenoMax** · 05-28-2014, 03:27 AM

Have you pre-built the index for "myref.fasta"?

Code:

$ bbmap.sh ref=/path_to_your_genome.fa

This step has to be done independently of the mapping step, AFAIK.

**GenoMax** · 05-28-2014, 03:32 AM

On second reading of your post, it appears that you have a space in your directory name and you are asking about escaping that? Let me test.

Edit: Standard ways of escpaing spaces in shell script are not working but directly running the java command (escaping the space) is working.

Brian should be able to provide the final answer.

**simono101** · 05-28-2014, 05:38 AM

@GenoMax,
Thanks, yes your second post is exactly what I was referring to. I can't seem to find a way to escape the spaces in a directory path (but I too successfully called it directly rather than through the shell script).

**Brian Bushnell** · 05-28-2014, 07:53 AM

simono,

In Windows, you can handle that with quotes, e.g. ref="/a/path with spaces/x.fa"

In Linux... I don't know how to do it easily. Perhaps the easiest way would be to make a symlink that doesn't have any spaces. Alternatively, as in this article, you could export the path as an environment variable.

-Brian

**Thias** · 06-02-2014, 02:50 AM

Hello Brian,

I need someone to give me a broad hint. I have already successfully indexed some eukaryotic genomes with BBMAP. But one PhD student in our lab has sequenced a custom shRNA library and I just wanted to use my usual aligner and just lower k-mer size. But it claims that it cannot read the FASTA with the reference sequences, although it is there. It there a lower limit of scaffold sizes (or a special parameter which I didn't set) which may cause BBMAP to abort with that error or is really the FASTA file somehow corrupted?

Code:

java -da -Xmx32g -cp /home/z/zepper/software/bbmap/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 ref=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index build=1 threads=4 -Xmx32g midpad=2000 k=5 minscaf=5
Executing align2.BBMap [build=1, overwrite=true, match=long, fastareadlen=500, ref=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa, path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index, build=1, threads=4, -Xmx32g, midpad=2000, k=5, minscaf=5]

BBMap version 31.56
Set OVERWRITE to true
Cigar strings enabled.
Set threads to 4
Retaining first best site only for ambiguous mappings.
No output file.
Exception in thread "main" java.lang.RuntimeException: Cannot read file /mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa
        at align2.RefToIndex.makeIndex(RefToIndex.java:28)
        at align2.BBMap.setup(BBMap.java:238)
        at align2.AbstractMapper.<init>(AbstractMapper.java:56)
        at align2.BBMap.<init>(BBMap.java:39)
        at align2.BBMap.main(BBMap.java:28)
[zepper@zivsmp001 Pia]$ head /mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa
>Acta1_Mmi503488
TGCTGTACAGGTCCTTCCTGATGTCGGTTTTGGCCACTGACTGACCGACATCAAAGGACCTGTA
>Acta1_Mmi503489
TGCTGTCTGCATGCGGTCAGCGATACGTTTTGGCCACTGACTGACGTATCGCTCCGCATGCAGA
>Acta1_Mmi503490
TGCTGAAGAGCGGTGGTCTCGTCTTCGTTTTGGCCACTGACTGACGAAGACGACCACCGCTCTT
>Ahr_Mmi503972
TGCTGATTACAGGGAGCAAAGTTCTGGTTTTGGCCACTGACTGACCAGAACTTCTCCCTGTAAT
>Ahr_Mmi503973
TGCTGTATGGATGAGCTCATATACGCGTTTTGGCCACTGACTGACGCGTATATGCTCATCCATA

**GenoMax** · 06-02-2014, 03:07 AM

Originally posted by Thias View Post

Hello Brian,

I need someone to give me a broad hint. I have already successfully indexed some eukaryotic genomes with BBMAP. But one PhD student in our lab has sequenced a custom shRNA library and I just wanted to use my usual aligner and just lower k-mer size. But it claims that it cannot read the FASTA with the reference sequences, although it is there. It there a lower limit of scaffold sizes (or a special parameter which I didn't set) which may cause BBMAP to abort with that error or is really the FASTA file somehow corrupted?

[CODE]
java -da -Xmx32g -cp /home/z/zepper/software/bbmap/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 ref=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index build=1 threads=4 -Xmx32g midpad=2000 k=5 minscaf=5
Executing align2.BBMap [build=1, overwrite=true, match=long, fastareadlen=500, ref=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa, path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index, build=1, threads=4, -Xmx32g, midpad=2000, k=5, minscaf=5]

Perhaps because you are using both ref= and path= options?

Since you have pre-built the BBMap index you can just use path= to point to the indexes.

**Thias** · 06-02-2014, 04:45 AM

@GenoMax: Thanks for your answer, but unfortunately this can't be the problem.

The folder in ref points to an empty directory, because I do not want to include the shRNA sequences to my regular set of indexed genomes. However the syntax with ref= and path= is exactly the syntax I used to successfully index the eukaryotic genomes before. (I used the standard k-mer size 13 and a midpad size of 200000 for the genomes though).

**Brian Bushnell** · 06-02-2014, 08:40 AM

Thias,

That specific error is triggered by:

File f=new File(reference);
if(!f.exists() || !f.isFile() || !f.canRead()){

So according to Java, either it does not exist, is not a file, or cannot be read. Perhaps the permissions are wrong, or the metadata on the file is wrong? It triggered before even looking inside the file, so the data is probably not corrupted.

You could technically do this:

cat /mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa | java -da -Xmx32g -cp /home/z/zepper/software/bbmap/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 ref=stdin.fa path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index build=1 threads=4 -Xmx32g midpad=2000 k=5 minscaf=5

...which would circumvent Java trying to read the file. But I'd like to hear if you discover anything odd about the file permissions or metadata (like if it claims to be a directory, or something).

**WhatsOEver** · 06-03-2014, 05:01 AM

Dear Brian,

I'm coming with another problem (bug?), I cannot find an answer to. I run a RNAseq mapping with bbmap and got afterwards the following results:

Genome: 1
Key Length: 14
Max Indel: 200000
Minimum Score Ratio: 0.4
Mapping Mode: normal
Reads Used: 702748 (302588801 bases)

Mapping: 2780.014 seconds.
Reads/sec: 252.79
kBases/sec: 108.84

Read 1 data: pct reads num reads pct bases num bases

mapped: 98.1544% 689778 98.8899% 299229721
unambiguous: 96.8285% 680460 97.9391% 296352687
ambiguous: 1.3259% 9318 0.9508% 2877034
low-Q discards: 0.0011% 8 0.0001% 185

...

As I don't like the samtools flagstat output, I wrote my own samfile-parser in order to compare the output of different mappers and different parameters used for mapping. From the samfile that was produced by bbmap, my parser returned the following

Found 762672 entries in the samfile.
Identified 702748 different reads.
663897 of the reads have unique mapping positions.
25881 of the reads have 85805 ambiguous mapping positions.
12970 of the reads did not map to the reference.
Elapsed seconds: 4

I verified these results by counting the individual bit-flags.
As this is quite different to your program output (except for the unmapped reads), I'm wondering were the mistake is?

Cheers,
W

**Brian Bushnell** · 06-03-2014, 09:10 AM

W,

I use the "XT:A:R" tag to indicate that a read is ambiguously mapped and XT:A:U to indicate unambiguously mapped. If you say "ambig=all" then the threshold for printing secondary alignments is more liberal than the threshold for deciding an alignment is ambiguous. So, even "unambiguously" aligned reads will sometimes have secondary sites printed - the best one, and others that are "decent" even though they are sufficiently worse than the best site that I classify the best site as unambiguous.

So if you add up all of the primary alignments with XT:A:U, you should get the same numbers as the program reports. Sorry that it's a custom tag (that is also used by bwa), but there's nothing in the official flag bits or official tags to indicate whether an alignment is ambiguous, and the SAM spec doesn't say anything about the presence of secondary alignments indicating ambiguity. It's really a subjective judgement.

**WhatsOEver** · 06-03-2014, 11:44 PM

Sorry, but I don't get it completely.

Originally posted by Brian Bushnell View Post

I use the "XT:A:R" tag to indicate that a read is ambiguously mapped and XT:A:U to indicate unambiguously mapped. If you say "ambig=all" then the threshold for printing secondary alignments is more liberal than the threshold for deciding an alignment is ambiguous. So, even "unambiguously" aligned reads will sometimes have secondary sites printed - the best one, and others that are "decent" even though they are sufficiently worse than the best site that I classify the best site as unambiguous.

What is your definition of an "unambiguously mapped read" if it has secondary alignments? If secondary alignments are printed as a result of a more liberal threshold, the read is no longer unambiguous under the used threshold, is it? This should be independent of how bad your sec alignments are, the chosen threshold shouldn't change the terminology, should it?
So in my case I have ~16K reads which are "unambiguous" according to your definition but have seconday alignments?

Originally posted by Brian Bushnell View Post

So if you add up all of the primary alignments with XT:A:U, you should get the same numbers as the program reports.

You are right - if I count XT:A:R, I get the same number as the program output (XT:A:U are however not visible/set?, but should of course be the remaining)

Originally posted by Brian Bushnell View Post

Sorry that it's a custom tag (that is also used by bwa), but there's nothing in the official flag bits or official tags to indicate whether an alignment is ambiguous, and the SAM spec doesn't say anything about the presence of secondary alignments indicating ambiguity. It's really a subjective judgement.

Yes, I know that and I really think that this is a confusing point in the SAM spec. But no, I don't think that it's a subjective judgement just because the word "ambiguity" is not explicitly used in the specs.
Or is it just my german misinterpretation of the strictness in the word ambiguity?

But being aware of that also helps, thanks again Brian for a fast and helpful response. I will now stop my work on mapper-evaluation for my projects and focus on the next steps - so I won't bother you with additional things in the next time

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News