Seqanswers Leaderboard Ad

**sdriscoll** · 11-14-2014, 12:06 PM

I think the answer to "which is best" is nontrivial and maybe there isn't a real answer at all. In STAR I remember reading the author created kind of a hard-coded logic where that aligner will prefer a zero-mismatch spliced alignment over a continuous alignment with mismatches. What STAR is trying to overcome there is the pseudogene problem where you have that exact situation of a read that can align spliced to the gene but continuous with a mismatch or two to the related pseudogene. I think it's probably best for you to keep your scoring system and then for people using your tools to just get a good understanding of it and use the tools wisely. Power users will always want to understand this little explanation of bbmap's decisions making process for spliced alignments since that is a particularly nasty part of RNA-seq (but obviously one of the greatest powers of RNA-seq).

Again using STAR as a reference (which is a RNA specific aligner), we are allowed to manipulate the penalties for different kinds of events. There are so-called 'gap open' and extension penalties but then there are separate penalties for introns. So in STAR I'm able to tell it NOT to penalize for introns at all leaving the alignment score alone aside from any mismatches. So by STAR logic the 50M20000N50M alignment with zero mismatches would have 100% identity. I think this is a very important aspect of RNA-seq alignment because we generally don't want to punish for finding introns.

Since BBMap provides that little switch (via the setting on maxindel and intronlen) would it be at all possible to incorporate a switch in the scoring mechanism that when a gapped alignment contains a space long enough to be switched to 'N' from 'D' the %identity calculation ignores that region in the alignment length or applies some standard penalty (like making it equal to a mismatch or two)? Otherwise I'm afraid the identity filtering logic as applied to RNA-seq is going to severely bias away from reporting spliced alignments since I don't want to allow continuous alignments with < 10% identity however a spliced alignment with perfect base for base matches could easily have < 10% identity and I'd want to keep those.

Am I understanding this correctly?

**Brian Bushnell** · 11-14-2014, 12:20 PM

To clarify, I was not planning on changing BBMap's default scoring system, just adding an additional score matrix (with, essentially, different penalties for substitutions and gap opening/extension). You can already switch between selecting the Illumina and PacBio score matrices at runtime with the "msa" flag; e.g. "msa=MultiStateAligner9PacBio"

The current choices are:
MultiStateAligner11ts (Designed for Illumina and generic use)
MultiStateAligner9PacBio (Designed for PacBio and Nanopore)
MultiStateAligner9Flat (Designed for producing scores close to % identity)

In my testing on RNA-seq data, BBMap does an excellent good job of deciding whether to place a read with a long splice and no mismatches, or with a few mismatches and no splice, even in the presence of intronless pseudogenes. And it would probably do an even better job with a new RNA-seq-specific score matrix. But your suggestion of changing the way mismatches are penalized based on whether an alignment is spliced seems interesting; I will have to investigate whether it is practical.

**susanklein** · 11-19-2014, 04:54 PM

Java error - threads error

Hi,

so far loving bbmap, but got this error today.

Here's my command line (I have 30 cores but 20 available atm, so set it to 18 because at first it tried to use 30, but still get the error below)

~/bin/bbmap/bbmap.sh threads=18 outputunmapped=f local=t maxindel=50 nodisk=f ref=plasmids.fna in=~/data/019/clip_fastq/1clip.fastq outu=1.fasta

errors:
BBMap version 33.90
Set threads to 18
Set DONT_OUTPUT_UNMAPPED_READS to true
Retaining first best site only for ambiguous mappings.
NOTE: Ignoring reference file because it already appears to have been processed.
NOTE: If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
Set genome to 1

Loaded Reference: 0.084 seconds.
Loading index for chunk 1-1, build 1
Generated Index: 1.347 seconds.
Analyzed Index: 4.701 seconds.
Started output stream: 0.154 seconds.
Cleared Memory: 0.316 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 18 mapping threads.
Exception in thread "Thread-8" java.lang.AssertionError: Read 8202, length 621, exceeds the limit of 600
You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
at align2.AbstractMapThread.run(AbstractMapThread.java:284)
Exception in thread "Thread-6" java.lang.AssertionError: Read 17996, length 604, exceeds the limit of 600
You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
at align2.AbstractMapThread.run(AbstractMapThread.java:284)
Exception in thread "Thread-11" java.lang.AssertionError: Read 28626, length 629, exceeds the limit of 600
You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
at align2.AbstractMapThread.run(AbstractMapThread.java:313)
Exception in thread "Thread-17" java.lang.AssertionError: Read 32533, length 601, exceeds the limit of 600
You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
at align2.AbstractMapThread.run(AbstractMapThread.java:284)
Exception in thread "Thread-7" java.lang.AssertionError: Read 33313, length 622, exceeds the limit of 600
You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
at align2.AbstractMapThread.run(AbstractMapThread.java:284)
Exception in thread "Thread-19" java.lang.AssertionError: Read 39168, length 620, exceeds the limit of 600
You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
at align2.AbstractMapThread.run(AbstractMapThread.java:284)
Exception in thread "Thread-20" java.lang.AssertionError: Read 40488, length 923, exceeds the limit of 600
You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
at align2.AbstractMapThread.run(AbstractMapThread.java:284)
Exception in thread "Thread-9" java.lang.AssertionError: Read 40793, length 695, exceeds the limit of 600
You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
at align2.AbstractMapThread.run(AbstractMapThread.java:313)
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17

**************************************************************************

Warning! 8 mapping threads did not terminate normally.
Please check the error log; the output may be corrupt or incomplete.

**************************************************************************

------------------ Results ------------------

Genome: 1
Key Length: 13
Max Indel: 50
Minimum Score Ratio: 0.56
Mapping Mode: normal
Reads Used: 81634 (18891086 bases)

Mapping: 2.708 seconds.
Reads/sec: 15073.10
kBases/sec: 6976.19

Pairing data: pct reads num reads pct bases num bases

mated pairs: 0.0000% 0 0.0000% 0
bad pairs: 0.0000% 0 0.0000% 0
insert size avg: NaN

Read 1 data: pct reads num reads pct bases num bases

mapped: 0.1127% 46 0.1021% 9627
unambiguous: 0.1127% 46 0.1021% 9627
ambiguous: 0.0000% 0 0.0000% 0
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 0.0000% 0 0.0000% 0
semiperfect site: 0.0000% 0 0.0000% 0
rescued: 0.0000% 0

Match Rate: NA NA 85.6567% 8283
Error Rate: 100.0000% 46 12.9162% 1249
Sub Rate: 100.0000% 46 12.1613% 1176
Del Rate: 34.7826% 16 0.4447% 43
Ins Rate: 28.2609% 13 0.3102% 30
N Rate: 34.7826% 16 1.4271% 138

Read 2 data: pct reads num reads pct bases num bases

mapped: 0.0759% 31 0.0610% 5768
unambiguous: 0.0759% 31 0.0610% 5768
ambiguous: 0.0000% 0 0.0000% 0
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 0.0000% 0 0.0000% 0
semiperfect site: 0.0000% 0 0.0000% 0
rescued: 0.0000% 0

Match Rate: NA NA 86.4074% 5022
Error Rate: 100.0000% 31 12.2161% 710
Sub Rate: 100.0000% 31 11.0117% 640
Del Rate: 29.0323% 9 0.7571% 44
Ins Rate: 29.0323% 9 0.4474% 26
N Rate: 48.3871% 15 1.3765% 80
Exception in thread "main" java.lang.RuntimeException: BBMap terminated in an error state; the output may be corrupt.
at align2.BBMap.testSpeed(BBMap.java:469)
at align2.BBMap.main(BBMap.java:34)
all29c@aahl-02-mel:~/data/019/19.8$ ~/bin/bbmap/bbmap.sh threads=18 outputunmapped=f local=t maxindel=50 nodisk=f ref=plasmids.fna in=~/data/019/clip_fastq/1clip.fastq outu=1.fasta^C

**Brian Bushnell** · 11-19-2014, 05:01 PM

Susan,

BBMap only supports reads up to 600bp in normal mode. The PacBio version (mapPacBio.sh) supports reads up to 6kbp, and all of the syntax is identical. So if 6kbp is long enough, just use that version.

You can optionally shred long reads into shorter pieces with the flag 'maxlen', e.g. 'maxlen=600' will break all reads longer than 600bp into 600bp pieces and map them individually. So a 900bp read named "X" would be mapped as 600bp read "X_1" and 300bp read "X_2".

I plan to overcome the 6kbp limitation, but it's not trivial so it will take a while.

**susanklein** · 11-19-2014, 05:03 PM

Hi,

yes, I just realized that, thanks. I'm using 454 data so yes, I'll try the PacBio version.

Thanks very much,

S.

Originally posted by Brian Bushnell View Post

Susan,

BBMap only supports reads up to 600bp in normal mode. The PacBio version (mapPacBio.sh) supports reads up to 6kbp, and all of the syntax is identical. So if 6kbp is long enough, just use that version.

You can optionally shred long reads into shorter pieces with the flag 'maxlen', e.g. 'maxlen=600' will break all reads longer than 600bp into 600bp pieces and map them individually. So a 900bp read named "X" would be mapped as 600bp read "X_1" and 300bp read "X_2".

I plan to overcome the 6kbp limitation, but it's not trivial so it will take a while.

**susanklein** · 11-19-2014, 05:25 PM

Another very quick question.. How can I ouput the output stats, as printed to screen, to a file?

Thanks.

**Brian Bushnell** · 11-19-2014, 05:30 PM

You can do that with the redirect operator. BBTools print status info to standard err, not standard out, so you redirect it like this:

bbmap.sh arguments 2>log.txt

For reference, 1>x.txt 2>y.txt will capture both stdout and stderr to different files in case you use the 'out=stdout.fa' flag.

**susanklein** · 11-23-2014, 02:09 PM

ahhh. Thanks.. I was trying to just use '>'.

S.

**susanklein** · 11-24-2014, 12:30 AM

basecov problem

Hi,

I am trying to use the basecov option to output per base coverage so I can graph the results. However, the input file I am using is a pseudomolecule with genes separated by 150 Ns. When the output bam files is viewed, the coverage looks fine, but it does not match up with the basecov file and seems 'shifted' 3' by roughly 150bp.

Any ideas on what I am dong wrong?

Thanks,

S.

**Brian Bushnell** · 11-24-2014, 09:55 AM

Susan,

How big are these files? Would it be possible for you to send them to me? I am unaware of this ever happening and it sounds more like a bug than user error, but it's hard to know for sure without seeing the data...

**susanklein** · 11-24-2014, 02:43 PM

Brian, nevermind, my mistake, I have realised its just the coverage file includes the local alignment overlap into the 'Ns' whereas my bam viewer (Geneious) trims the overlap and doesn't show it as coverage. Problem solved. Thanks anyway.

S.

**Brian Bushnell** · 11-24-2014, 02:55 PM

OK, good to hear!

**sdriscoll** · 12-02-2014, 04:29 PM

Originally posted by Brian Bushnell View Post

Now, as for scoring - the affine score matrix is deeply embedded in BBMap. A 100bp read that maps perfectly with only matches has a 100% identity and BBMap score of 9970 (100 points per match, but only 70 points for the first match). If a 100bp read can map somewhere with 1 mismatch (99% identity, BBMap score of 9713: 9970+1*(-127-100-100+70)), that site will always be chosen over a site with 50bp match, 20k intron, 50bp match (4.7% identity, BBMap score of roughly 8841). But if the 100bp read can map somewhere with 5 mismatches (95% identity, BBMap score of 8685: 9970 +5*(-127-100-100+70)), the spliced alignment will be chosen, because 8841>8685. However they are very close so it would get a low mapping score.

Brian, do you mind explaining how the 50bp match, 20k intron, 50bp match ends up with a rough score of 8841? And in terms of the above examples that means it would have a final score ratio of about 0.887, right? thanks.

**Brian Bushnell** · 12-02-2014, 04:55 PM

The scoring uses affine transforms, so consecutive cigar string symbols are scored differently according to how many there are. These are the scores:

Code:

POINTS_NOREF=0;
POINTS_NOCALL=0;
POINTS_MATCH=70;
POINTS_MATCH2=100;
POINTS_SUB=-127;
POINTS_SUB2=-51;
POINTS_SUB3=-25;
POINTS_INS=-395;
POINTS_INS2=-39;
POINTS_INS3=-23;
POINTS_INS4=-8;
POINTS_DEL=-472;
POINTS_DEL2=-33;
POINTS_DEL3=-9;
POINTS_DEL4=-1;
POINTS_DEL5=-1;

LIMIT_FOR_COST_2=1;
LIMIT_FOR_COST_3=5;
LIMIT_FOR_COST_4=20;
LIMIT_FOR_COST_5=80;

So, the first consecutive match is 70 points, and subsequent ones are 100 points. N is 0 points (whether in the reference or in the read). The first consecutive substitution is -127 points; subsequent ones are -51 points; and after 5 in a row, it drops to -25 points each.

The "LIMIT_FOR_COST_X" terms indicate how many consecutive symbols are needed before the cost drops. So, after 80 consecutive symbols (LIMIT_FOR_COST_5), the cost will drop to the lowest possible level. For deletions, this is POINTS_DEL5=-1; other types don't have a level-5 cost so just use the lowest available (e.g. POINTS_INS4 for insertions).

However, these costs are still not low enough to allow jumbo deletions. So I compress the deleted portions of the reference; every 128 bases are replaced by a single symbol, allowing effectively -1/128 point cost per deleted base, once the deletion is sufficiently long (though with present settings it costs -2/128).

Since it's pretty confusing to calculate the cost of a gap, you can currently do that with the main() function of one of the classes:

java -cp /path/to/current/ align2.MultiStateAligner11ts 20000

...will print out "-1129", which is the cost of a 20kbp gap. For a 100bp read, that would be added to the score from matching reads (2*70 + 98*100 = 9940) for a total of 8811.

Before I had said 8841, because I forgot about the additional 30-point penalty due to 2 bases being in the MATCH1 state, whereas in a perfect read, only the first base is in that state.

**sdriscoll** · 12-02-2014, 04:56 PM

Holy smokes. And thanks!

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News