Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • An important bug in BLAST+ 2.2.26+

    Hi everyone,

    TL;DR: Always use the -max_target_seqs flag when using BLAST unless you are using default verbose output, otherwise you might not get all the hits you should.


    Example:
    Code:
    $ blastn -perc_identity 97 -query my_query.fa -db nt -out result.txt
    $ blastn -perc_identity 97 -query my_query.fa -db nt -outfmt 6 -out result.csv
    $ grep '>' result.txt |wc -l
    186
    $ wc -l result.csv
    143 result.csv
    So you get different numbers of results using default textual output and csv format. I was able to 'rescue' it by adding the -max_target_seqs flag:
    Code:
    $ blastn -perc_identity 97 -max_target_seqs 500 -query my_query.fa -db nt -outfmt 6 -out result.csv2
    $ wc -l result.csv2
    186 result.csv2


    I confirmed this on the blast-help email helpdesk at NCBI. Their response:
    For output formats >4, -max_target_seqs should be explicitly set. In the next release, 2.2.27, you should get a commandline message to that effect
    This is true whether or not you use the -perc_identity flag as I did.

    Hope that helps someone, and does affect too many different pieces of software and science..
    ben

    --
    Tyson Laboratory, Australian Centre for Ecogenomics

  • #2
    Really? It's pretty serious one, Thanks! I'll keep it in mind

    Comment


    • #3
      Wow - that sounds rather nasty. But what shocks me is the reply from the NCBI - are they really saying that in 2.2.27+ they won't fix it, they'll just add a warning!?

      Comment


      • #4
        Originally posted by maubp View Post
        are they really saying that in 2.2.27+ they won't fix it, they'll just add a warning!?
        Potentially, we'll see. It might instead mean that it will become a required parameter, rather than just a warning. That doesn't seem perfect either though because software that runs blast under the hood will now start failing.

        Comment


        • #5
          I have to say I don't really see this as a bug at all. Different reporting modes naturally take different parameter settings. Defaults for those parameters are set at some value reasonable and appropriate for the reporting mode chosen. It is up to the user to understand how the parameters chosen will effect the output. I will grant there may be some lack of clarity in the current documentation of the --max-target-sequences setting but the program is working as designed so you can't call that a "bug" and it doesn't need fixing.

          Comment


          • #6
            You say undocumented 'feature', I say bug. And a nasty one as it causes silent data loss.

            Comment


            • #7
              Hi,

              Do you have any idea on how the not reported hits behave? Do they share some common pattern or anything to be deleted or is it just something random?

              Thanks for the info!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                Yesterday, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              57 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              45 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Working...
              X