Originally posted by sdriscoll
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Hey Brian,
I don't know if this would be efficient to add to the tail end of bbmap's pipeline or if it would work better as a separate program but for me, at least, it would be very handy to be able to do some filtering based on various options. I've made perl script to do exactly what I need but maybe the filtering can happen very rapidly in the same way using idfilter=N does in bbmap.
So because the scoring system and final identity values make it maybe impossible to filter the alignments in way I'd want for RNA-seq alignments to the genome I'm finding that what works better is to let bbmap run with near defaults and then filter afterwards. For example in my final RNA-seq alignment I would consider a spliced alignment with 2 mismatches to be equal to an unspliced alignment with 2 mismatches. With bbmap's scoring system those two conditions are very different with the spliced alignment sometimes having a final identity of less than 5% while less than 5% on a continuous alignment would likely include many mismatches.
To give you an idea of some handy filtering options have a look at the help text from my perl script.
Code:usage: bbmap-filter.pl [options] <alignments.bam> Parses alignments from bbmap and modifies them based on options. Options: -p INT Number of threads for processing (default: 2) -N INT Maximum number of mismatches allowed (default: any) -d INT Maximum number of deletions per alignment (default: any) -D INT Maximum length of deletion allowed (default: any) -i INT Maximum number of insertions per alignment (default: any) -I INT Maximum length of insertion allowed (default: any) -e INT Maximum number of INDELS total per alignment (default: any) -E INT Maximum total edits (INDELS+mismatches) per alignment (default: any) -c Set reads not passing filters to qc-failed (default: unaligned) -q INT Minimum MAPQ (default: any) -a Output alignments passing filter only (default: all) -h Show this message and exit
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
Comment
-
Quick question - I wanted to test out the usejni option so I rolled into the ./jni folder to compile the c codes and ran into an issue right away. Compiling failed saying that 'jni.h' was not found.
*edit*
My system did not have JAVA_HOME set, you might add a note about that in the README.txt file to set that first before compiling. And maybe also to make sure to have the correct JDK installed. My Ubuntu system I've been using bbmap on did not have that installed. So for those non-java devs out there that are using your tools...this note's for them!/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
Comment
-
Originally posted by sdriscoll View PostHey Brian,
I don't know if this would be efficient to add to the tail end of bbmap's pipeline or if it would work better as a separate program but for me, at least, it would be very handy to be able to do some filtering based on various options. I've made perl script to do exactly what I need but maybe the filtering can happen very rapidly in the same way using idfilter=N does in bbmap.
So because the scoring system and final identity values make it maybe impossible to filter the alignments in way I'd want for RNA-seq alignments to the genome I'm finding that what works better is to let bbmap run with near defaults and then filter afterwards. For example in my final RNA-seq alignment I would consider a spliced alignment with 2 mismatches to be equal to an unspliced alignment with 2 mismatches. With bbmap's scoring system those two conditions are very different with the spliced alignment sometimes having a final identity of less than 5% while less than 5% on a continuous alignment would likely include many mismatches.
To give you an idea of some handy filtering options have a look at the help text from my perl script.
Code:usage: bbmap-filter.pl [options] <alignments.bam> Parses alignments from bbmap and modifies them based on options. Options: -p INT Number of threads for processing (default: 2) -N INT Maximum number of mismatches allowed (default: any) -d INT Maximum number of deletions per alignment (default: any) -D INT Maximum length of deletion allowed (default: any) -i INT Maximum number of insertions per alignment (default: any) -I INT Maximum length of insertion allowed (default: any) -e INT Maximum number of INDELS total per alignment (default: any) -E INT Maximum total edits (INDELS+mismatches) per alignment (default: any) -c Set reads not passing filters to qc-failed (default: unaligned) -q INT Minimum MAPQ (default: any) -a Output alignments passing filter only (default: all) -h Show this message and exit
Quick question - I wanted to test out the usejni option so I rolled into the ./jni folder to compile the c codes and ran into an issue right away. Compiling failed saying that 'jni.h' was not found.
*edit*
My system did not have JAVA_HOME set, you might add a note about that in the README.txt file to set that first before compiling. And maybe also to make sure to have the correct JDK installed. My Ubuntu system I've been using bbmap on did not have that installed. So for those non-java devs out there that are using your tools...this note's for them!
Incidentally, in my testing, JNI speeds up BBMap by around 25-30%, which is nice but not huge, probably because it is memory-bandwidth limited or too branchy. However, it speeds up Dedupe (when edits are allowed) and BBMerge by closer to 200%; both are compute-limited.
Comment
-
Awesome. Yeah for all of the sam/bam utilities out there I haven't found one that could do elaborate filtering based on the details of the alignment. Probably because those tools can't guarantee that aligners are writing the alignment files with the necessary flags. Within the ecosystem of BBTools, however, you can control all...so that would be fantastic.
I'm testing JNI with bbmap now and will report back with the speedup.
Speedup: ~ 2%
This was an alignment that I'd feed into eXpress for gene expression quantification. Options used:
Code:t=8 ambig=all minid=0.95 usejni=t saa=t secondary=t maxsites=20 sam=1.3 trd=t
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
Comment
-
bbmap and multimapper MAPQ
Hello Brian and others,
Usually I run bbmap with the ambig=best parameter and am good to go, but for one of our datasets, it may prove favourable to give multi-mapping reads some special treatment. Therefore I now have to familiarize myself with the SAM format and find optimal settings for high quality mapping of spliced stranded reads.
I ran bbmap for an initial test with those parameters
Code:ambig=all minid=0.9 padding=6 tipsearch=200 maxindel=200000 intronlen=5 secondary=T quickmatch=T sssr=0.97 local=T saa=F xstag=T xmtag=T nhtag=T
However when I check the original MAPQ, I only see qualities between 1-3 despite I have chosen local=T option. Shouldn't there be some higher MAPQ reads also within the multi-mappers? Or did I misunderstand the local option?
Code:# All MAPQ awk -F "\t" '!/^@/ {print $5}' $SAMFILE | sort | uniq # Only multimapping MAPQ grep -Fvw "NH:i:1" $SAMFILE | awk -F "\t" '!/^@/ {print $5}' | sort | uniq
Matthias
PS: Support for something like sdriscoll's bbmap-filter.pl becoming a part of bbmap suite.
Comment
-
Hmm... all multi-mapping reads get their quality penalized, according to how many sites there are and how close the scores are. Because the SAM spec indicates the mapq is supposed to indicate the probability that the location is correct, a read that maps to 2 places with identical scores will get a maximum mapq of 3. There is currently no way to disable this score penalty, but I will put in a flag to disable it.
You can also use the flag "idtag", though, which will print each read's percent identity to the reference in a custom field. This value is unaffected by multimapping, so you could use that for unbiased filtering by mapping quality.
Comment
-
Originally posted by Brian Bushnell View PostBecause the SAM spec indicates the mapq is supposed to indicate the probability that the location is correct, a read that maps to 2 places with identical scores will get a maximum mapq of 3. There is currently no way to disable this score penalty, but I will put in a flag to disable it.
Thanks a lot!
Matthias
Comment
-
sdriscoll and Matthias,
I've added the following flags:
subfilter=-1 Ban alignments with more than this many substitutions.
insfilter=-1 Ban alignments with more than this many insertions.
delfilter=-1 Ban alignments with more than this many deletions.
indelfilter=-1 Ban alignments with more than this many indels.
editfilter=-1 Ban alignments with more than this many edits.
inslenfilter=-1 Ban alignments with an insertion longer than this.
dellenfilter=-1 Ban alignments with a deletion longer than this.
(all of those have no effect if the value is negative)
penalizeambiguous=t (pambig) Penalize the mapping score of reads that multimap.
Comment
-
Hello Brian,
Kudos to you for doing all of that at no time. This will certainly make the bbmap suite even more useful than it already is.
Thanks again
Matthias
BTW: I solved my problem by using the idtag and the scoretag:
Code:grep -o 'YR:i:[0-9]*' $MULTIMAPSAMFILE | cut -f3- -d':'
sort -k2,2 -r -k1,1 $SCOREDMULTIMAPSAMFILE | awk -F "\t" '!x[$2]++' | cut -f2-
Comment
-
Hi Brian,
For the mismatch filter, is there (or could there be) an option for number of mismatches accepted based on the length of the read? My reads have variable lengths and some are much longer than others. So, I'm wondering if setting an absolute number of mismatches (eg. 2) would be too stringent on the long reads...
Also, my long reads tend to drop off in quality towards the right end, so I want to trim the right end off long reads (eg. keep only the first 200bp) without affecting the shorter reads. Is there good a way to do this? I can't trim a set number of bp because that would trim the short reads; and for maxlen option, I could break long reads (from the left side?), but I don't want to align the remaining right ends of long reads.
Thanks!
Comment
-
You can use "idfilter" to allow a variable number of edits depending on the read length. e.g. idfilter=0.98 would allow up to 4 edits in a 200bp read but only 2 in a 100bp read. For quality-trimming, you have a couple of options. You can use BBDuk to forcibly trim all reads to at most 200bp like this:
bbduk.sh in=reads.fq out=trimmed.fq forcetrimright=199
Alternately, you can use "qtrim=r trimq=10" for example to trim the right side of the read to Q10, so only low-quality bases will be removed (you can set that higher if you want). This flag works in BBDuk or BBMap. BBMap also allows you to trim before mapping, then restore the read to the original length afterward, using the "untrim" flag, for example:
bbmap.sh in=reads.fq out=mapped.sam qtrim=r trimq=10 untrim
This is useful when you only want to allow high-identity alignments, but you don't want to discard the low-quality bases at the end of the read.
Comment
-
Hello Brian,
I aligned fastq paired end reads using bbmap. I then tried to sort the SAM file using Picard and I am getting the following error:
Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Mate Alignment start (146342323) must be <= reference seque
nce length (90354753) on reference 16; Line 1211
Line: R0266248:292:C5C0TACXX:4:1101:4088:2519 161 16 33530580 44 98= = 146342323 112811852 ATAGAAAATTATTCA
GCTATATTCACTGCCTCACCACCTTTGTTTTTTTGTACACAAAAAATAACATTATCATTATTTGATTGCTCTCATGAAGCACT CCCFFFFFGHHHHJJJJJJJJIJJJJJJJJJJIJJJJFJJJJIIJJJIJIJIIJJJIJJIHHH
FFFFFFEEEEEEEFEFEDDDDDDDDDEDDDDDDDD NM:i:0 AM:i:44 RG:Z:AS_N#1#4#AS_N_1
Comment
Latest Articles
Collapse
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, Yesterday, 05:31 AM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
Yesterday, 05:31 AM
|
||
Started by seqadmin, 10-24-2024, 06:58 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
10-24-2024, 06:58 AM
|
||
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types
by seqadmin
Started by seqadmin, 10-23-2024, 08:43 AM
|
0 responses
50 views
0 likes
|
Last Post
by seqadmin
10-23-2024, 08:43 AM
|
||
Started by seqadmin, 10-17-2024, 07:29 AM
|
0 responses
58 views
0 likes
|
Last Post
by seqadmin
10-17-2024, 07:29 AM
|
Comment