Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • nilshomer
    Nils Homer
    • Nov 2008
    • 1283

    BFAST published

    The paper describing the algorithm behind BFAST has now been published.

    Homer N, Merriman B, Nelson SF.
    BFAST: An alignment tool for large scale genome resequencing.
    PMID: TBD
    PLoS ONE. 2009 4(11): e7767.
    Background The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25–100 base range, in the presence of errors and true biological variation. Methodology We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels. Conclusions We compare BFAST to a selection of large-scale alignment tools - BLAT, MAQ, SHRiMP, and SOAP - in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at http://bfast.sourceforge.net.


    Background
    The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25–100 base range, in the presence of errors and true biological variation.

    Methodology
    We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels.

    Conclusions
    We compare BFAST to a selection of large-scale alignment tools - BLAT, MAQ, SHRiMP, and SOAP - in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at http://bfast.sourceforge.net.
  • ECO
    --Site Admin--
    • Oct 2007
    • 1360

    #2
    Congrats Nils.

    Comment

    • dawe
      Senior Member
      • Apr 2009
      • 258

      #3
      Congratulations!
      I've downloaded it and if it runs as stated in the paper I'm ready to use it especially for resequencing and SNP projects. I'm a bit worried about speed (and memory usage) but, as said, if the error rate is so low...
      SAM output is a great feature, I'm running almost aligner-independent now :-)

      Comment

      • lh3
        Senior Member
        • Feb 2008
        • 686

        #4
        I did not read into details, but something is a little bit odd in Figure 3. How can bowtie and bwa produce wrong alignments even if the read can be perfectly mapped? For perfectly aligned reads, both of them should always give correct results. Also note that 6% error is really a lot. If an aligner was that bad, it would not survive these days. Nonetheless, I agree that the performance of bwa degrades with higher error/mutation rate.

        In SNP calling given short reads, alignment global to reads has some advantage. Suppose the 3rd base in a read is a true mutation. Local alignment will clip the read and as a result you see more reads having the reference allele than having the non-reference allele, which leads to strong reference bias. To alleviate this bias, we need to postprocess local alignments.

        Comment

        • nilshomer
          Nils Homer
          • Nov 2008
          • 1283

          #5
          Originally posted by lh3 View Post
          I did not read into details, but something is a little bit odd in Figure 3. How can bowtie and bwa produce wrong alignments even if the read can be perfectly mapped? For perfectly aligned reads, both of them should always give correct results. Also note that 6% error is really a lot. If an aligner was that bad, it would not survive these days. Nonetheless, I agree that the performance of bwa degrades with higher error/mutation rate.
          What panel are you looking at (sensitivity or false mapping)? Obviously the sensitivity cannot be 100% due to the repetitiveness of the human genome. For false-mapping, with no errors I don't see a >0% false mapping rate in Figure 3.

          In SNP calling given short reads, alignment global to reads has some advantage. Suppose the 3rd base in a read is a true mutation. Local alignment will clip the read and as a result you see more reads having the reference allele than having the non-reference allele, which leads to strong reference bias. To alleviate this bias, we need to postprocess local alignments.
          We don't clip the reads. The "local alignment" we use is a global alignment to a window the candidate location. In other words, we enforce that the entire read must align to some subset of bases in the reference.

          Comment

          • nilshomer
            Nils Homer
            • Nov 2008
            • 1283

            #6
            Originally posted by dawe View Post
            Congratulations!
            I've downloaded it and if it runs as stated in the paper I'm ready to use it especially for resequencing and SNP projects. I'm a bit worried about speed (and memory usage) but, as said, if the error rate is so low...
            SAM output is a great feature, I'm running almost aligner-independent now :-)
            @dawe
            The current version allows the user to get rid of the memory problem at the cost of some speed. You use the "-d" option to split the index during index creation. This allows you to run BFAST in 4GB environments etc.

            @dawe and @lh3
            The assumption that the error rate is below 6% is not valid for ABI SOLiD data. With BFAST's ability to be very sensitive, we regularly observe runs with >10% (color/raw) error rates towards the ends of reads, even though after alignment the base error rate is well below 1% (kudos to 2-base encoding).

            Comment

            • lh3
              Senior Member
              • Feb 2008
              • 686

              #7
              Originally posted by nilshomer View Post
              What panel are you looking at (sensitivity or false mapping)? Obviously the sensitivity cannot be 100% due to the repetitiveness of the human genome. For false-mapping, with no errors I don't see a >0% false mapping rate in Figure 3.
              Unless sensitivity is too bad, I do not care too much about false negatives. I am looking at false positives (2nd column). At 0-mismatch/indel, both bwa and bowtie have ~6% error rate while others have 0%.

              We don't clip the reads. The "local alignment" we use is a global alignment to a window the candidate location. In other words, we enforce that the entire read must align to some subset of bases in the reference.
              This is reassuring. Thanks for the explanation.

              Comment

              • NGSfan
                Senior Member
                • Apr 2009
                • 181

                #8
                Congratulations Nils Homer!

                I really enjoyed your paper - I also must thank you for the in-depth supplement that really explained the history behind BFAST and other aligners. It shows you put a lot of thought into this.

                My only question in regards to BFAST performance, is if you have recently compared it to SSAHA2 ? I know you did not intend to be comprehensive in this paper when comparing algorithms (it is impossible to be comprehensive now with 30+ aligners available), but I feel that SSAHA2 might be a real benchmark to compare against for speed and accuracy. Would you say that BFAST would run faster than SSAHA2 given similar accuracy settings?

                Thanks for releasing your software. I will be trying it now.

                Comment

                • Michael.James.Clark
                  Senior Member
                  • Apr 2009
                  • 207

                  #9
                  Congratulations, Nils!

                  Originally posted by lh3 View Post
                  I did not read into details, but something is a little bit odd in Figure 3. How can bowtie and bwa produce wrong alignments even if the read can be perfectly mapped? For perfectly aligned reads, both of them should always give correct results. Also note that 6% error is really a lot. If an aligner was that bad, it would not survive these days. Nonetheless, I agree that the performance of bwa degrades with higher error/mutation rate.
                  It makes sense to me that they would have higher false positive rates at zero errors because they identify significantly more true positives as well (as 3A is basically gauging true positives and 3B is gauging false positives). My guess is that they are aligning questionable reads that the other aligners are not aligning at all in order to have a higher yield of true positives. I would pose the question of whether one can get BWA and Bowtie to generate fewer false positives, and how that would thereby affect their true positive yield at zero errors. (I have not used them personally, so it's an open question.) Then again, I think it's rather a moot point. The figure is basically demonstrating to me that all of these aligners do quite well at zero-two nucleotide errors.

                  What I find more striking in that particular figure is how they perform with high error reads, but even more strikingly, reads containing deletions (even deletions of a relatively large size). This is important to me because a lot of studies, particularly whole genome, are searching for novel variants and deletions with relatively low coverage.

                  Anyway, we have had great success using BFAST on whole genome data from the SOLiD platform, so I am really happy to see it finally published and encourage others to try it.
                  Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                  Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                  Projects: U87MG whole genome sequence [Website] [Paper]

                  Comment

                  • lh3
                    Senior Member
                    • Feb 2008
                    • 686

                    #10
                    Originally posted by Michael.James.Clark View Post
                    It makes sense to me that they would have higher false positive rates at zero errors because they identify significantly more true positives as well (as 3A is basically gauging true positives and 3B is gauging false positives). My guess is that they are aligning questionable reads that the other aligners are not aligning at all in order to have a higher yield of true positives. I would pose the question of whether one can get BWA and Bowtie to generate fewer false positives, and how that would thereby affect their true positive yield at zero errors. (I have not used them personally, so it's an open question.) Then again, I think it's rather a moot point. The figure is basically demonstrating to me that all of these aligners do quite well at zero-two nucleotide errors.
                    Both bowtie and bwa are able to find all the occurrences of perfect matches and they always know if a perfect alignment is exact repeat or not. They should not produce false positives in theory. In practice, I evaluated them and they do not, at least not as high as 6%. In addition, I am curious why in 3A aligners do not do equally well for perfect hits. Again, maq and soap guarantee to find all perfect hits.

                    What I find more striking in that particular figure is how they perform with high error reads, but even more strikingly, reads containing deletions (even deletions of a relatively large size). This is important to me because a lot of studies, particularly whole genome, are searching for novel variants and deletions with relatively low coverage.
                    I completely agree that Bfast is a great tool for hard reads. The performance of other aligners look reasonable, which implies bfast is very capable. I just think bowtie and bwa should be posed in the right way to avoid confusion.
                    Last edited by lh3; 11-20-2009, 05:02 AM.

                    Comment

                    • nilshomer
                      Nils Homer
                      • Nov 2008
                      • 1283

                      #11
                      Originally posted by NGSfan View Post
                      Congratulations Nils Homer!

                      I really enjoyed your paper - I also must thank you for the in-depth supplement that really explained the history behind BFAST and other aligners. It shows you put a lot of thought into this.

                      My only question in regards to BFAST performance, is if you have recently compared it to SSAHA2 ? I know you did not intend to be comprehensive in this paper when comparing algorithms (it is impossible to be comprehensive now with 30+ aligners available), but I feel that SSAHA2 might be a real benchmark to compare against for speed and accuracy. Would you say that BFAST would run faster than SSAHA2 given similar accuracy settings?

                      Thanks for releasing your software. I will be trying it now.
                      I have not tested SSAHA2 lately, although I think SSAHA2 is built for longer reads if I am correct. Also note that since the release of the paper, BFAST's speed has improved by an order of magnitude (submitted as another paper).

                      Comment

                      • nilshomer
                        Nils Homer
                        • Nov 2008
                        • 1283

                        #12
                        Originally posted by Michael.James.Clark View Post
                        Congratulations, Nils!



                        It makes sense to me that they would have higher false positive rates at zero errors because they identify significantly more true positives as well (as 3A is basically gauging true positives and 3B is gauging false positives). My guess is that they are aligning questionable reads that the other aligners are not aligning at all in order to have a higher yield of true positives. I would pose the question of whether one can get BWA and Bowtie to generate fewer false positives, and how that would thereby affect their true positive yield at zero errors. (I have not used them personally, so it's an open question.) Then again, I think it's rather a moot point. The figure is basically demonstrating to me that all of these aligners do quite well at zero-two nucleotide errors.

                        What I find more striking in that particular figure is how they perform with high error reads, but even more strikingly, reads containing deletions (even deletions of a relatively large size). This is important to me because a lot of studies, particularly whole genome, are searching for novel variants and deletions with relatively low coverage.

                        Anyway, we have had great success using BFAST on whole genome data from the SOLiD platform, so I am really happy to see it finally published and encourage others to try it.
                        At zero errors and no variants, Heng Li is saying that there should be zero false positives, although there could be false-negatives.

                        Comment

                        • HTS
                          Member
                          • Nov 2009
                          • 24

                          #13
                          Performing true local alignment with BFAST?

                          Hi Nils,

                          It is nice to read your paper, and I hope that my question is not completely out of the context (admittedly I haven't got time to dig into the details of your implementation yet). Basically I am thinking about applications other than resequencing, where I just want to find all local alignments passing a certain criterion (sure, you can put an upper-limit on the total number of matches for practical reasons, but I can use any of the tools out there to filter out heavy repeats first). An example is that my reference contains 10 cDNA sequences (different splice isoforms) belonging to the same gene with their own header, and for a given read, I want to find out how many transcripts this read is mapped to. I can do this straightforwardly with BLAT (by parsing the PSL file since I have gene/transcript information in the header) but the reference itself becomes ill-defined for all the global alignment tools out there. I am wondering if your tool can be easily adapted to handle this situation (while still making use of the base quality information as opposed to BLAT). If yes, probably I can finally stop using the old favorite BLAT as well. Please feel free to give me any suggestions/comments (also include all people who read this). Thanks a lot!

                          -- Leo

                          Comment

                          • nilshomer
                            Nils Homer
                            • Nov 2008
                            • 1283

                            #14
                            Originally posted by HTS View Post
                            Hi Nils,

                            It is nice to read your paper, and I hope that my question is not completely out of the context (admittedly I haven't got time to dig into the details of your implementation yet). Basically I am thinking about applications other than resequencing, where I just want to find all local alignments passing a certain criterion (sure, you can put an upper-limit on the total number of matches for practical reasons, but I can use any of the tools out there to filter out heavy repeats first). An example is that my reference contains 10 cDNA sequences (different splice isoforms) belonging to the same gene with their own header, and for a given read, I want to find out how many transcripts this read is mapped to. I can do this straightforwardly with BLAT (by parsing the PSL file since I have gene/transcript information in the header) but the reference itself becomes ill-defined for all the global alignment tools out there. I am wondering if your tool can be easily adapted to handle this situation (while still making use of the base quality information as opposed to BLAT). If yes, probably I can finally stop using the old favorite BLAT as well. Please feel free to give me any suggestions/comments (also include all people who read this). Thanks a lot!

                            -- Leo
                            Absolutely! A similar example is suppose you have prior evidence of RNA splicing between two novel exons in which you only know the approximate boundaries (say within 100bp). You can create all possible joins of the two 100bp windows and then align your reads to the artificial reference to see which join is the most likely.

                            Comment

                            • HTS
                              Member
                              • Nov 2009
                              • 24

                              #15
                              Originally posted by nilshomer View Post
                              Absolutely! A similar example is suppose you have prior evidence of RNA splicing between two novel exons in which you only know the approximate boundaries (say within 100bp). You can create all possible joins of the two 100bp windows and then align your reads to the artificial reference to see which join is the most likely.
                              Thanks a lot for your quick reply, Nils! Could you give me a bit more details about how to do this? I remember in your paper you said when the best alignment of a read map to more than one locations on the reference equally well (in my example would be more than one cDNA sequences, which usually correspond to the same locations on the underlying genome) then you would say the read is not mapped, but I would like BFAST to output all the mappings regardless. I haven't downloaded your tool yet but will do so as soon as I finish the work at hand!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM
                              • seqadmin
                                Investigating the Gut Microbiome Through Diet and Spatial Biology
                                by seqadmin




                                The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                                02-24-2025, 06:31 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-20-2025, 05:03 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-19-2025, 07:27 AM
                              0 responses
                              18 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-03-2025, 01:15 PM
                              0 responses
                              185 views
                              0 reactions
                              Last Post seqadmin  
                              Working...