Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BLASTp parameter -dbsize problems (blastall -z)

    Hello all,
    I'm stuck at a blastp problem. I want to use the parameter -dbsize to incrementally add sequences to a blast output. Same option is available in blastall with the parameter -z.

    I did several tests before I wrote this posting. I'm trying to use this parameter. But I don't get the right e-values.

    Here an example:
    - I have a FASTA file with 33 sequences: original_seqs.fasta
    - I create a sequence database: makeblastdb -in original_seqs.fasta -dbtype prot -out blast_db
    - output: 33 sequences added
    1. Question: Does this means the database size is: 33? I think so.
    - blast the sequences against each other: blastp -db blast_db -query original_seqs.fasta -outfmt 6 -out all-vs-all_dbSize_default.tsv -evalue 1e-5

    - first use of dbsize: blastp -db blast_db -query original_seqs.fasta -outfmt 6 -out all-vs-all_dbSize_10000.tsv -evalue 1e-5 -dbsize 10000

    - next use of dbsize: blastp -db blast_db -query original_seqs.fasta -outfmt 6 -out all-vs-all_dbSize_33.tsv -evalue 1e-5 -dbsize 33

    2. Question:
    In all three outputs is a different e-value for the same compared sequences. It's clear that the output for dbsize=10000 is different. But why for dbsize=33 (as in the original by default)? Or do I have a false understanding of this blast parameter?

    Later I want to include new sequences. But the evalue should be comparable.
    Last edited by fireLog2; 12-11-2013, 10:05 AM.

  • #2
    The E-value is a parameter that describes the number of hits one can “expect” to see by chance when searching a database of a particular size.


    From biostars:

    I don't know the answer yet, but for one, the -dbsize parameter in the blast command line is supposed to be the cumulative length of all the sequences in the database, rather than the number of sequences in the database.
    Last edited by rhinoceros; 12-11-2013, 12:02 PM.
    savetherhino.org

    Comment


    • #3
      Thank you rhinoceros for helping.

      But I already read that post and tried the cumulative length. In my example they are 13581 aminoacids in all 33 sequences. And with try and error I found that the dbsize must be around 11809 to get the same e-values.

      It should be working somehow. I read it here:
      [...]
      -z db_size is the number of proteins in the set (see “Incrementally add a genome” below)
      [...]
      Incrementally add a genome:
      Once a large all-versus-all BLAST has been completed, you may need to “incrementally” add a new proteome, without re-running the large all-versus-all BLAST. To do so:
      (1) Prepare the new proteome’s FASTA file as you did for the previous ones (Steps 4 and 5);
      (2) Make a new BLAST database that includes all the previous proteins plus the new FASTA file.;
      (3) Use the -z argument of BLAST to simulate the size of the all database, so that the statistics and scoring is compatible with the original all-v-all BLAST. Use the same -z value as was used in the original BLAST.
      [...]
      Source: Fischer, S. (2011), Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups. Curr Protoc Bioinformatics (https://www.ncbi.nlm.nih.gov/pmc/art...report=classic)

      Comment


      • #4
        Make a db with 33 sequences
        Make another db with 32 sequences
        Use the same db_size parameter, e.g. 10000 for blasts against both of these dbs. Are the evalues of the hits the same?
        savetherhino.org

        Comment


        • #5
          Thanks for the idea. Here is the result:
          If I use the same value for db_size for different blast-db, I get same evalues.

          Now is the problem partially solved. Next time, I will set at the beginning a fix value for the db-size, if I will extend it later.

          But how is the db size defined, if you want to know it afterwards?

          Comment


          • #6
            I found this year-old thread quite helpful. I did my own experiments, 22 sequences of 7723 amino acids.
            I counted amino acids this way:
            Code:
            cat goodProteins.fasta | grep -v '>' | tr -d "\n" | wc -c
            Setting -dbsize 7723 gave the exact same e-values as when -dbsize was not specified.
            Setting -dbsize 7713 gave 2 different e-values
            Setting -dbsize 7600 gave 33 different e-values
            So for me, -dbsize was the total number of amino acids.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            9 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            51 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X