Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by sdriscoll View Post
    Brian, do you mind explaining how the 50bp match, 20k intron, 50bp match ends up with a rough score of 8841? And in terms of the above examples that means it would have a final score ratio of about 0.887, right? thanks.
    Oh, right - yes, the score ratio would correspond to 8841/9970 = 0.887, if 8841 was in fact the read's raw score.

    Comment


    • Hey Brian,

      I don't know if this would be efficient to add to the tail end of bbmap's pipeline or if it would work better as a separate program but for me, at least, it would be very handy to be able to do some filtering based on various options. I've made perl script to do exactly what I need but maybe the filtering can happen very rapidly in the same way using idfilter=N does in bbmap.

      So because the scoring system and final identity values make it maybe impossible to filter the alignments in way I'd want for RNA-seq alignments to the genome I'm finding that what works better is to let bbmap run with near defaults and then filter afterwards. For example in my final RNA-seq alignment I would consider a spliced alignment with 2 mismatches to be equal to an unspliced alignment with 2 mismatches. With bbmap's scoring system those two conditions are very different with the spliced alignment sometimes having a final identity of less than 5% while less than 5% on a continuous alignment would likely include many mismatches.

      To give you an idea of some handy filtering options have a look at the help text from my perl script.

      Code:
      usage: bbmap-filter.pl [options] <alignments.bam>
      
      Parses alignments from bbmap and modifies them based on options.
      
      Options:
        -p INT           Number of threads for processing (default: 2)
        -N INT           Maximum number of mismatches allowed (default: any)
        -d INT           Maximum number of deletions per alignment (default: any)
        -D INT           Maximum length of deletion allowed (default: any)
        -i INT           Maximum number of insertions per alignment (default: any)
        -I INT           Maximum length of insertion allowed (default: any)
        -e INT           Maximum number of INDELS total per alignment (default: any)
        -E INT           Maximum total edits (INDELS+mismatches) per alignment (default: any)
        -c               Set reads not passing filters to qc-failed (default: unaligned)
        -q INT           Minimum MAPQ (default: any)
        -a               Output alignments passing filter only (default: all)
        -h               Show this message and exit
      This gives me some flexibility so I can filter out high substitution unspliced alignments and keep those low substitution spliced alignments. Also on occasion I like to filter alignments by only allowing a single edit (whether it is a single base indel or a single substitution). Anyways, if you're bored, it wouldn't be a bad addition to the suite of tools to have some kind of filter like this.
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment


      • Quick question - I wanted to test out the usejni option so I rolled into the ./jni folder to compile the c codes and ran into an issue right away. Compiling failed saying that 'jni.h' was not found.

        *edit*

        My system did not have JAVA_HOME set, you might add a note about that in the README.txt file to set that first before compiling. And maybe also to make sure to have the correct JDK installed. My Ubuntu system I've been using bbmap on did not have that installed. So for those non-java devs out there that are using your tools...this note's for them!
        Last edited by sdriscoll; 12-04-2014, 01:17 PM. Reason: figured it out
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • Originally posted by sdriscoll View Post
          Hey Brian,

          I don't know if this would be efficient to add to the tail end of bbmap's pipeline or if it would work better as a separate program but for me, at least, it would be very handy to be able to do some filtering based on various options. I've made perl script to do exactly what I need but maybe the filtering can happen very rapidly in the same way using idfilter=N does in bbmap.

          So because the scoring system and final identity values make it maybe impossible to filter the alignments in way I'd want for RNA-seq alignments to the genome I'm finding that what works better is to let bbmap run with near defaults and then filter afterwards. For example in my final RNA-seq alignment I would consider a spliced alignment with 2 mismatches to be equal to an unspliced alignment with 2 mismatches. With bbmap's scoring system those two conditions are very different with the spliced alignment sometimes having a final identity of less than 5% while less than 5% on a continuous alignment would likely include many mismatches.

          To give you an idea of some handy filtering options have a look at the help text from my perl script.

          Code:
          usage: bbmap-filter.pl [options] <alignments.bam>
          
          Parses alignments from bbmap and modifies them based on options.
          
          Options:
            -p INT           Number of threads for processing (default: 2)
            -N INT           Maximum number of mismatches allowed (default: any)
            -d INT           Maximum number of deletions per alignment (default: any)
            -D INT           Maximum length of deletion allowed (default: any)
            -i INT           Maximum number of insertions per alignment (default: any)
            -I INT           Maximum length of insertion allowed (default: any)
            -e INT           Maximum number of INDELS total per alignment (default: any)
            -E INT           Maximum total edits (INDELS+mismatches) per alignment (default: any)
            -c               Set reads not passing filters to qc-failed (default: unaligned)
            -q INT           Minimum MAPQ (default: any)
            -a               Output alignments passing filter only (default: all)
            -h               Show this message and exit
          This gives me some flexibility so I can filter out high substitution unspliced alignments and keep those low substitution spliced alignments. Also on occasion I like to filter alignments by only allowing a single edit (whether it is a single base indel or a single substitution). Anyways, if you're bored, it wouldn't be a bad addition to the suite of tools to have some kind of filter like this.
          Those sound useful, and won't be very difficult, so I will go ahead and add them. Thanks for the suggestion!

          Quick question - I wanted to test out the usejni option so I rolled into the ./jni folder to compile the c codes and ran into an issue right away. Compiling failed saying that 'jni.h' was not found.

          *edit*

          My system did not have JAVA_HOME set, you might add a note about that in the README.txt file to set that first before compiling. And maybe also to make sure to have the correct JDK installed. My Ubuntu system I've been using bbmap on did not have that installed. So for those non-java devs out there that are using your tools...this note's for them!
          OK, I will add that.

          Incidentally, in my testing, JNI speeds up BBMap by around 25-30%, which is nice but not huge, probably because it is memory-bandwidth limited or too branchy. However, it speeds up Dedupe (when edits are allowed) and BBMerge by closer to 200%; both are compute-limited.

          Comment


          • Awesome. Yeah for all of the sam/bam utilities out there I haven't found one that could do elaborate filtering based on the details of the alignment. Probably because those tools can't guarantee that aligners are writing the alignment files with the necessary flags. Within the ecosystem of BBTools, however, you can control all...so that would be fantastic.

            I'm testing JNI with bbmap now and will report back with the speedup.

            Speedup: ~ 2%

            This was an alignment that I'd feed into eXpress for gene expression quantification. Options used:

            Code:
            t=8 ambig=all minid=0.95 usejni=t saa=t secondary=t maxsites=20 sam=1.3 trd=t
            Last edited by sdriscoll; 12-04-2014, 01:36 PM. Reason: added speedup result
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • That's a little disappointing, but thanks for the report! The speedup should be greater the worse the reads match the genome, as the JNI portion only affects the affine-transform alignment, which is not used unless reads have more than 1 mismatch to the reference.

              Comment


              • bbmap and multimapper MAPQ

                Hello Brian and others,

                Usually I run bbmap with the ambig=best parameter and am good to go, but for one of our datasets, it may prove favourable to give multi-mapping reads some special treatment. Therefore I now have to familiarize myself with the SAM format and find optimal settings for high quality mapping of spliced stranded reads.

                I ran bbmap for an initial test with those parameters
                Code:
                ambig=all minid=0.9 padding=6 tipsearch=200 maxindel=200000 intronlen=5 secondary=T quickmatch=T sssr=0.97 local=T saa=F xstag=T xmtag=T nhtag=T
                and am now messing around with the alignments. The current idea is to use bbmaps real MAPQ and modify it according to our additional criteria. Then I will sort the file based on the read ID and secondarily descending on the final MAPQ and neglect ID duplicates to resolve multi-mappers.

                However when I check the original MAPQ, I only see qualities between 1-3 despite I have chosen local=T option. Shouldn't there be some higher MAPQ reads also within the multi-mappers? Or did I misunderstand the local option?

                Code:
                # All MAPQ
                awk -F "\t" '!/^@/ {print $5}' $SAMFILE | sort | uniq
                # Only multimapping MAPQ
                grep -Fvw "NH:i:1" $SAMFILE | awk -F "\t" '!/^@/ {print $5}' | sort | uniq
                Thanks a lot
                Matthias

                PS: Support for something like sdriscoll's bbmap-filter.pl becoming a part of bbmap suite.

                Comment


                • Hmm... all multi-mapping reads get their quality penalized, according to how many sites there are and how close the scores are. Because the SAM spec indicates the mapq is supposed to indicate the probability that the location is correct, a read that maps to 2 places with identical scores will get a maximum mapq of 3. There is currently no way to disable this score penalty, but I will put in a flag to disable it.

                  You can also use the flag "idtag", though, which will print each read's percent identity to the reference in a custom field. This value is unaffected by multimapping, so you could use that for unbiased filtering by mapping quality.

                  Comment


                  • Originally posted by Brian Bushnell View Post
                    Because the SAM spec indicates the mapq is supposed to indicate the probability that the location is correct, a read that maps to 2 places with identical scores will get a maximum mapq of 3. There is currently no way to disable this score penalty, but I will put in a flag to disable it.
                    Thanks a lot for your kind offer, but your hint with the "idtag" is perfectly fine for our purpose, so no need to adjust your code. I just misunderstood the doings of the local=T flag in exactly the proposed way. Since some downstream tools may depend on the real MAPQ, on second thought it now anyway seems wiser to introduce an extra column for that custom score, resolve the multi-mappers according to it and kick that column out again for the final SAM file.

                    Thanks a lot!
                    Matthias

                    Comment


                    • sdriscoll and Matthias,

                      I've added the following flags:

                      subfilter=-1 Ban alignments with more than this many substitutions.
                      insfilter=-1 Ban alignments with more than this many insertions.
                      delfilter=-1 Ban alignments with more than this many deletions.
                      indelfilter=-1 Ban alignments with more than this many indels.
                      editfilter=-1 Ban alignments with more than this many edits.
                      inslenfilter=-1 Ban alignments with an insertion longer than this.
                      dellenfilter=-1 Ban alignments with a deletion longer than this.

                      (all of those have no effect if the value is negative)

                      penalizeambiguous=t (pambig) Penalize the mapping score of reads that multimap.

                      Comment


                      • Hello Brian,

                        Kudos to you for doing all of that at no time. This will certainly make the bbmap suite even more useful than it already is.

                        Thanks again
                        Matthias

                        BTW: I solved my problem by using the idtag and the scoretag:

                        Code:
                         grep -o 'YR:i:[0-9]*' $MULTIMAPSAMFILE | cut -f3- -d':'
                        I tweaked the scores a bit for our purpose and pasted them as first column to the sam. Then I sorted the file on the read ID and the scores and removed the duplicates.
                        sort -k2,2 -r -k1,1 $SCOREDMULTIMAPSAMFILE | awk -F "\t" '!x[$2]++' | cut -f2-

                        Comment


                        • Hi Brian,

                          For the mismatch filter, is there (or could there be) an option for number of mismatches accepted based on the length of the read? My reads have variable lengths and some are much longer than others. So, I'm wondering if setting an absolute number of mismatches (eg. 2) would be too stringent on the long reads...

                          Also, my long reads tend to drop off in quality towards the right end, so I want to trim the right end off long reads (eg. keep only the first 200bp) without affecting the shorter reads. Is there good a way to do this? I can't trim a set number of bp because that would trim the short reads; and for maxlen option, I could break long reads (from the left side?), but I don't want to align the remaining right ends of long reads.

                          Thanks!

                          Comment


                          • You can use "idfilter" to allow a variable number of edits depending on the read length. e.g. idfilter=0.98 would allow up to 4 edits in a 200bp read but only 2 in a 100bp read. For quality-trimming, you have a couple of options. You can use BBDuk to forcibly trim all reads to at most 200bp like this:

                            bbduk.sh in=reads.fq out=trimmed.fq forcetrimright=199

                            Alternately, you can use "qtrim=r trimq=10" for example to trim the right side of the read to Q10, so only low-quality bases will be removed (you can set that higher if you want). This flag works in BBDuk or BBMap. BBMap also allows you to trim before mapping, then restore the read to the original length afterward, using the "untrim" flag, for example:

                            bbmap.sh in=reads.fq out=mapped.sam qtrim=r trimq=10 untrim

                            This is useful when you only want to allow high-identity alignments, but you don't want to discard the low-quality bases at the end of the read.

                            Comment


                            • Hello Brian,

                              I aligned fastq paired end reads using bbmap. I then tried to sort the SAM file using Picard and I am getting the following error:

                              Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Mate Alignment start (146342323) must be <= reference seque
                              nce length (90354753) on reference 16; Line 1211
                              Line: R0266248:292:C5C0TACXX:4:1101:4088:2519 161 16 33530580 44 98= = 146342323 112811852 ATAGAAAATTATTCA
                              GCTATATTCACTGCCTCACCACCTTTGTTTTTTTGTACACAAAAAATAACATTATCATTATTTGATTGCTCTCATGAAGCACT CCCFFFFFGHHHHJJJJJJJJIJJJJJJJJJJIJJJJFJJJJIIJJJIJIJIIJJJIJJIHHH
                              FFFFFFEEEEEEEFEFEDDDDDDDDDEDDDDDDDD NM:i:0 AM:i:44 RG:Z:AS_N#1#4#AS_N_1
                              It appears that for this read, each end is aligning to a different contig. But in the SAM record we have an '=' for the mate's contig name, which might be causing it. Is my reasoning correct, and how do I fix it ? Thanks a lot for your help.

                              Comment


                              • Sounds like a bug, and I agree, it seems as though the mate is mapped to a different (longer) contig. What version of BBMap are you using? Also, can you give me the complete command line and, if possible, the text of the second read? Thanks!

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Choosing Between NGS and qPCR
                                  by seqadmin



                                  Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                  10-18-2024, 07:11 AM
                                • seqadmin
                                  Non-Coding RNA Research and Technologies
                                  by seqadmin




                                  Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                  Nobel Prize for MicroRNA Discovery
                                  This week,...
                                  10-07-2024, 08:07 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 05:31 AM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-24-2024, 06:58 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-23-2024, 08:43 AM
                                0 responses
                                48 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-17-2024, 07:29 AM
                                0 responses
                                58 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X