No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • BLASTp parameter -dbsize problems (blastall -z)

    Hello all,
    I'm stuck at a blastp problem. I want to use the parameter -dbsize to incrementally add sequences to a blast output. Same option is available in blastall with the parameter -z.

    I did several tests before I wrote this posting. I'm trying to use this parameter. But I don't get the right e-values.

    Here an example:
    - I have a FASTA file with 33 sequences: original_seqs.fasta
    - I create a sequence database: makeblastdb -in original_seqs.fasta -dbtype prot -out blast_db
    - output: 33 sequences added
    1. Question: Does this means the database size is: 33? I think so.
    - blast the sequences against each other: blastp -db blast_db -query original_seqs.fasta -outfmt 6 -out all-vs-all_dbSize_default.tsv -evalue 1e-5

    - first use of dbsize: blastp -db blast_db -query original_seqs.fasta -outfmt 6 -out all-vs-all_dbSize_10000.tsv -evalue 1e-5 -dbsize 10000

    - next use of dbsize: blastp -db blast_db -query original_seqs.fasta -outfmt 6 -out all-vs-all_dbSize_33.tsv -evalue 1e-5 -dbsize 33

    2. Question:
    In all three outputs is a different e-value for the same compared sequences. It's clear that the output for dbsize=10000 is different. But why for dbsize=33 (as in the original by default)? Or do I have a false understanding of this blast parameter?

    Later I want to include new sequences. But the evalue should be comparable.
    Last edited by fireLog2; 12-11-2013, 10:05 AM.

  • #2
    The E-value is a parameter that describes the number of hits one can “expect” to see by chance when searching a database of a particular size.

    From biostars:

    I don't know the answer yet, but for one, the -dbsize parameter in the blast command line is supposed to be the cumulative length of all the sequences in the database, rather than the number of sequences in the database.
    Last edited by rhinoceros; 12-11-2013, 12:02 PM.


    • #3
      Thank you rhinoceros for helping.

      But I already read that post and tried the cumulative length. In my example they are 13581 aminoacids in all 33 sequences. And with try and error I found that the dbsize must be around 11809 to get the same e-values.

      It should be working somehow. I read it here:
      -z db_size is the number of proteins in the set (see “Incrementally add a genome” below)
      Incrementally add a genome:
      Once a large all-versus-all BLAST has been completed, you may need to “incrementally” add a new proteome, without re-running the large all-versus-all BLAST. To do so:
      (1) Prepare the new proteome’s FASTA file as you did for the previous ones (Steps 4 and 5);
      (2) Make a new BLAST database that includes all the previous proteins plus the new FASTA file.;
      (3) Use the -z argument of BLAST to simulate the size of the all database, so that the statistics and scoring is compatible with the original all-v-all BLAST. Use the same -z value as was used in the original BLAST.
      Source: Fischer, S. (2011), Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups. Curr Protoc Bioinformatics (


      • #4
        Make a db with 33 sequences
        Make another db with 32 sequences
        Use the same db_size parameter, e.g. 10000 for blasts against both of these dbs. Are the evalues of the hits the same?


        • #5
          Thanks for the idea. Here is the result:
          If I use the same value for db_size for different blast-db, I get same evalues.

          Now is the problem partially solved. Next time, I will set at the beginning a fix value for the db-size, if I will extend it later.

          But how is the db size defined, if you want to know it afterwards?


          • #6
            I found this year-old thread quite helpful. I did my own experiments, 22 sequences of 7723 amino acids.
            I counted amino acids this way:
            cat goodProteins.fasta | grep -v '>' | tr -d "\n" | wc -c
            Setting -dbsize 7723 gave the exact same e-values as when -dbsize was not specified.
            Setting -dbsize 7713 gave 2 different e-values
            Setting -dbsize 7600 gave 33 different e-values
            So for me, -dbsize was the total number of amino acids.


            Latest Articles


            • seqadmin
              Advanced Tools Transforming the Field of Cytogenomics
              by seqadmin

              At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
              09-26-2023, 06:26 AM
            • seqadmin
              How RNA-Seq is Transforming Cancer Studies
              by seqadmin

              Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
              09-07-2023, 11:15 PM





            Topics Statistics Last Post
            Started by seqadmin, 09-29-2023, 09:38 AM
            0 responses
            Last Post seqadmin  
            Started by seqadmin, 09-27-2023, 06:57 AM
            0 responses
            Last Post seqadmin  
            Started by seqadmin, 09-26-2023, 07:53 AM
            1 response
            Last Post seed_phrase_metal_storage  
            Started by seqadmin, 09-25-2023, 07:42 AM
            0 responses
            Last Post seqadmin