Seqanswers Leaderboard Ad

**Brian Bushnell** · 12-02-2014, 05:05 PM

Originally posted by sdriscoll View Post

Brian, do you mind explaining how the 50bp match, 20k intron, 50bp match ends up with a rough score of 8841? And in terms of the above examples that means it would have a final score ratio of about 0.887, right? thanks.

Oh, right - yes, the score ratio would correspond to 8841/9970 = 0.887, if 8841 was in fact the read's raw score.

**sdriscoll** · 12-04-2014, 12:14 PM

Hey Brian,

I don't know if this would be efficient to add to the tail end of bbmap's pipeline or if it would work better as a separate program but for me, at least, it would be very handy to be able to do some filtering based on various options. I've made perl script to do exactly what I need but maybe the filtering can happen very rapidly in the same way using idfilter=N does in bbmap.

So because the scoring system and final identity values make it maybe impossible to filter the alignments in way I'd want for RNA-seq alignments to the genome I'm finding that what works better is to let bbmap run with near defaults and then filter afterwards. For example in my final RNA-seq alignment I would consider a spliced alignment with 2 mismatches to be equal to an unspliced alignment with 2 mismatches. With bbmap's scoring system those two conditions are very different with the spliced alignment sometimes having a final identity of less than 5% while less than 5% on a continuous alignment would likely include many mismatches.

To give you an idea of some handy filtering options have a look at the help text from my perl script.

Code:

usage: bbmap-filter.pl [options] <alignments.bam>

Parses alignments from bbmap and modifies them based on options.

Options:
  -p INT           Number of threads for processing (default: 2)
  -N INT           Maximum number of mismatches allowed (default: any)
  -d INT           Maximum number of deletions per alignment (default: any)
  -D INT           Maximum length of deletion allowed (default: any)
  -i INT           Maximum number of insertions per alignment (default: any)
  -I INT           Maximum length of insertion allowed (default: any)
  -e INT           Maximum number of INDELS total per alignment (default: any)
  -E INT           Maximum total edits (INDELS+mismatches) per alignment (default: any)
  -c               Set reads not passing filters to qc-failed (default: unaligned)
  -q INT           Minimum MAPQ (default: any)
  -a               Output alignments passing filter only (default: all)
  -h               Show this message and exit

This gives me some flexibility so I can filter out high substitution unspliced alignments and keep those low substitution spliced alignments. Also on occasion I like to filter alignments by only allowing a single edit (whether it is a single base indel or a single substitution). Anyways, if you're bored, it wouldn't be a bad addition to the suite of tools to have some kind of filter like this.

**sdriscoll** · 12-04-2014, 01:05 PM

Quick question - I wanted to test out the usejni option so I rolled into the ./jni folder to compile the c codes and ran into an issue right away. Compiling failed saying that 'jni.h' was not found.

*edit*

My system did not have JAVA_HOME set, you might add a note about that in the README.txt file to set that first before compiling. And maybe also to make sure to have the correct JDK installed. My Ubuntu system I've been using bbmap on did not have that installed. So for those non-java devs out there that are using your tools...this note's for them!

**Brian Bushnell** · 12-04-2014, 01:22 PM

Originally posted by sdriscoll View Post

Hey Brian,

I don't know if this would be efficient to add to the tail end of bbmap's pipeline or if it would work better as a separate program but for me, at least, it would be very handy to be able to do some filtering based on various options. I've made perl script to do exactly what I need but maybe the filtering can happen very rapidly in the same way using idfilter=N does in bbmap.

So because the scoring system and final identity values make it maybe impossible to filter the alignments in way I'd want for RNA-seq alignments to the genome I'm finding that what works better is to let bbmap run with near defaults and then filter afterwards. For example in my final RNA-seq alignment I would consider a spliced alignment with 2 mismatches to be equal to an unspliced alignment with 2 mismatches. With bbmap's scoring system those two conditions are very different with the spliced alignment sometimes having a final identity of less than 5% while less than 5% on a continuous alignment would likely include many mismatches.

To give you an idea of some handy filtering options have a look at the help text from my perl script.

Code:

usage: bbmap-filter.pl [options] <alignments.bam>

Parses alignments from bbmap and modifies them based on options.

Options:
  -p INT           Number of threads for processing (default: 2)
  -N INT           Maximum number of mismatches allowed (default: any)
  -d INT           Maximum number of deletions per alignment (default: any)
  -D INT           Maximum length of deletion allowed (default: any)
  -i INT           Maximum number of insertions per alignment (default: any)
  -I INT           Maximum length of insertion allowed (default: any)
  -e INT           Maximum number of INDELS total per alignment (default: any)
  -E INT           Maximum total edits (INDELS+mismatches) per alignment (default: any)
  -c               Set reads not passing filters to qc-failed (default: unaligned)
  -q INT           Minimum MAPQ (default: any)
  -a               Output alignments passing filter only (default: all)
  -h               Show this message and exit

This gives me some flexibility so I can filter out high substitution unspliced alignments and keep those low substitution spliced alignments. Also on occasion I like to filter alignments by only allowing a single edit (whether it is a single base indel or a single substitution). Anyways, if you're bored, it wouldn't be a bad addition to the suite of tools to have some kind of filter like this.

Those sound useful, and won't be very difficult, so I will go ahead and add them. Thanks for the suggestion!

Quick question - I wanted to test out the usejni option so I rolled into the ./jni folder to compile the c codes and ran into an issue right away. Compiling failed saying that 'jni.h' was not found.

*edit*

My system did not have JAVA_HOME set, you might add a note about that in the README.txt file to set that first before compiling. And maybe also to make sure to have the correct JDK installed. My Ubuntu system I've been using bbmap on did not have that installed. So for those non-java devs out there that are using your tools...this note's for them!

OK, I will add that.

Incidentally, in my testing, JNI speeds up BBMap by around 25-30%, which is nice but not huge, probably because it is memory-bandwidth limited or too branchy. However, it speeds up Dedupe (when edits are allowed) and BBMerge by closer to 200%; both are compute-limited.

**sdriscoll** · 12-04-2014, 01:31 PM

Awesome. Yeah for all of the sam/bam utilities out there I haven't found one that could do elaborate filtering based on the details of the alignment. Probably because those tools can't guarantee that aligners are writing the alignment files with the necessary flags. Within the ecosystem of BBTools, however, you can control all...so that would be fantastic.

I'm testing JNI with bbmap now and will report back with the speedup.

Speedup: ~ 2%

This was an alignment that I'd feed into eXpress for gene expression quantification. Options used:

Code:

t=8 ambig=all minid=0.95 usejni=t saa=t secondary=t maxsites=20 sam=1.3 trd=t

**Brian Bushnell** · 12-04-2014, 05:09 PM

That's a little disappointing, but thanks for the report! The speedup should be greater the worse the reads match the genome, as the JNI portion only affects the affine-transform alignment, which is not used unless reads have more than 1 mismatch to the reference.

**Thias** · 12-05-2014, 09:01 AM

bbmap and multimapper MAPQ

Hello Brian and others,

Usually I run bbmap with the ambig=best parameter and am good to go, but for one of our datasets, it may prove favourable to give multi-mapping reads some special treatment. Therefore I now have to familiarize myself with the SAM format and find optimal settings for high quality mapping of spliced stranded reads.

I ran bbmap for an initial test with those parameters

Code:

ambig=all minid=0.9 padding=6 tipsearch=200 maxindel=200000 intronlen=5 secondary=T quickmatch=T sssr=0.97 local=T saa=F xstag=T xmtag=T nhtag=T

and am now messing around with the alignments. The current idea is to use bbmaps real MAPQ and modify it according to our additional criteria. Then I will sort the file based on the read ID and secondarily descending on the final MAPQ and neglect ID duplicates to resolve multi-mappers.

However when I check the original MAPQ, I only see qualities between 1-3 despite I have chosen local=T option. Shouldn't there be some higher MAPQ reads also within the multi-mappers? Or did I misunderstand the local option?

Code:

# All MAPQ
awk -F "\t" '!/^@/ {print $5}' $SAMFILE | sort | uniq
# Only multimapping MAPQ
grep -Fvw "NH:i:1" $SAMFILE | awk -F "\t" '!/^@/ {print $5}' | sort | uniq

Thanks a lot
Matthias

PS: Support for something like sdriscoll's bbmap-filter.pl becoming a part of bbmap suite.

**Brian Bushnell** · 12-05-2014, 10:43 AM

Hmm... all multi-mapping reads get their quality penalized, according to how many sites there are and how close the scores are. Because the SAM spec indicates the mapq is supposed to indicate the probability that the location is correct, a read that maps to 2 places with identical scores will get a maximum mapq of 3. There is currently no way to disable this score penalty, but I will put in a flag to disable it.

You can also use the flag "idtag", though, which will print each read's percent identity to the reference in a custom field. This value is unaffected by multimapping, so you could use that for unbiased filtering by mapping quality.

**Thias** · 12-07-2014, 08:13 AM

Originally posted by Brian Bushnell View Post

Because the SAM spec indicates the mapq is supposed to indicate the probability that the location is correct, a read that maps to 2 places with identical scores will get a maximum mapq of 3. There is currently no way to disable this score penalty, but I will put in a flag to disable it.

Thanks a lot for your kind offer, but your hint with the "idtag" is perfectly fine for our purpose, so no need to adjust your code. I just misunderstood the doings of the local=T flag in exactly the proposed way. Since some downstream tools may depend on the real MAPQ, on second thought it now anyway seems wiser to introduce an extra column for that custom score, resolve the multi-mappers according to it and kick that column out again for the final SAM file.

Thanks a lot!
Matthias

**Brian Bushnell** · 12-08-2014, 04:25 PM

sdriscoll and Matthias,

I've added the following flags:

subfilter=-1 Ban alignments with more than this many substitutions.
insfilter=-1 Ban alignments with more than this many insertions.
delfilter=-1 Ban alignments with more than this many deletions.
indelfilter=-1 Ban alignments with more than this many indels.
editfilter=-1 Ban alignments with more than this many edits.
inslenfilter=-1 Ban alignments with an insertion longer than this.
dellenfilter=-1 Ban alignments with a deletion longer than this.

(all of those have no effect if the value is negative)

penalizeambiguous=t (pambig) Penalize the mapping score of reads that multimap.

**Thias** · 12-09-2014, 03:10 AM

Hello Brian,

Kudos to you for doing all of that at no time. This will certainly make the bbmap suite even more useful than it already is.

Thanks again
Matthias

BTW: I solved my problem by using the idtag and the scoretag:

Code:

 grep -o 'YR:i:[0-9]*' $MULTIMAPSAMFILE | cut -f3- -d':'

I tweaked the scores a bit for our purpose and pasted them as first column to the sam. Then I sorted the file on the read ID and the scores and removed the duplicates.

sort -k2,2 -r -k1,1 $SCOREDMULTIMAPSAMFILE | awk -F "\t" '!x[$2]++' | cut -f2-

**tkc.yang** · 12-18-2014, 11:41 AM

Hi Brian,

For the mismatch filter, is there (or could there be) an option for number of mismatches accepted based on the length of the read? My reads have variable lengths and some are much longer than others. So, I'm wondering if setting an absolute number of mismatches (eg. 2) would be too stringent on the long reads...

Also, my long reads tend to drop off in quality towards the right end, so I want to trim the right end off long reads (eg. keep only the first 200bp) without affecting the shorter reads. Is there good a way to do this? I can't trim a set number of bp because that would trim the short reads; and for maxlen option, I could break long reads (from the left side?), but I don't want to align the remaining right ends of long reads.

Thanks!

**Brian Bushnell** · 12-18-2014, 11:57 AM

You can use "idfilter" to allow a variable number of edits depending on the read length. e.g. idfilter=0.98 would allow up to 4 edits in a 200bp read but only 2 in a 100bp read. For quality-trimming, you have a couple of options. You can use BBDuk to forcibly trim all reads to at most 200bp like this:

bbduk.sh in=reads.fq out=trimmed.fq forcetrimright=199

Alternately, you can use "qtrim=r trimq=10" for example to trim the right side of the read to Q10, so only low-quality bases will be removed (you can set that higher if you want). This flag works in BBDuk or BBMap. BBMap also allows you to trim before mapping, then restore the read to the original length afterward, using the "untrim" flag, for example:

bbmap.sh in=reads.fq out=mapped.sam qtrim=r trimq=10 untrim

This is useful when you only want to allow high-identity alignments, but you don't want to discard the low-quality bases at the end of the read.

**rkanwar** · 12-18-2014, 01:11 PM

Hello Brian,

I aligned fastq paired end reads using bbmap. I then tried to sort the SAM file using Picard and I am getting the following error:

Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Mate Alignment start (146342323) must be <= reference seque
nce length (90354753) on reference 16; Line 1211
Line: R0266248:292:C5C0TACXX:4:1101:4088:2519 161 16 33530580 44 98= = 146342323 112811852 ATAGAAAATTATTCA
GCTATATTCACTGCCTCACCACCTTTGTTTTTTTGTACACAAAAAATAACATTATCATTATTTGATTGCTCTCATGAAGCACT CCCFFFFFGHHHHJJJJJJJJIJJJJJJJJJJIJJJJFJJJJIIJJJIJIJIIJJJIJJIHHH
FFFFFFEEEEEEEFEFEDDDDDDDDDEDDDDDDDD NM:i:0 AM:i:44 RG:Z:AS_N#1#4#AS_N_1

It appears that for this read, each end is aligning to a different contig. But in the SAM record we have an '=' for the mate's contig name, which might be causing it. Is my reasoning correct, and how do I fix it ? Thanks a lot for your help.

**Brian Bushnell** · 12-18-2014, 01:17 PM

Sounds like a bug, and I agree, it seems as though the mate is mapped to a different (longer) contig. What version of BBMap are you using? Also, can you give me the complete command line and, if possible, the text of the second read? Thanks!

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 22 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News