Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BLASTp parameter -dbsize problems (blastall -z)

    Hello all,
    I'm stuck at a blastp problem. I want to use the parameter -dbsize to incrementally add sequences to a blast output. Same option is available in blastall with the parameter -z.

    I did several tests before I wrote this posting. I'm trying to use this parameter. But I don't get the right e-values.

    Here an example:
    - I have a FASTA file with 33 sequences: original_seqs.fasta
    - I create a sequence database: makeblastdb -in original_seqs.fasta -dbtype prot -out blast_db
    - output: 33 sequences added
    1. Question: Does this means the database size is: 33? I think so.
    - blast the sequences against each other: blastp -db blast_db -query original_seqs.fasta -outfmt 6 -out all-vs-all_dbSize_default.tsv -evalue 1e-5

    - first use of dbsize: blastp -db blast_db -query original_seqs.fasta -outfmt 6 -out all-vs-all_dbSize_10000.tsv -evalue 1e-5 -dbsize 10000

    - next use of dbsize: blastp -db blast_db -query original_seqs.fasta -outfmt 6 -out all-vs-all_dbSize_33.tsv -evalue 1e-5 -dbsize 33

    2. Question:
    In all three outputs is a different e-value for the same compared sequences. It's clear that the output for dbsize=10000 is different. But why for dbsize=33 (as in the original by default)? Or do I have a false understanding of this blast parameter?

    Later I want to include new sequences. But the evalue should be comparable.
    Last edited by fireLog2; 12-11-2013, 10:05 AM.

  • #2
    The E-value is a parameter that describes the number of hits one can “expect” to see by chance when searching a database of a particular size.


    From biostars:

    I don't know the answer yet, but for one, the -dbsize parameter in the blast command line is supposed to be the cumulative length of all the sequences in the database, rather than the number of sequences in the database.
    Last edited by rhinoceros; 12-11-2013, 12:02 PM.
    savetherhino.org

    Comment


    • #3
      Thank you rhinoceros for helping.

      But I already read that post and tried the cumulative length. In my example they are 13581 aminoacids in all 33 sequences. And with try and error I found that the dbsize must be around 11809 to get the same e-values.

      It should be working somehow. I read it here:
      [...]
      -z db_size is the number of proteins in the set (see “Incrementally add a genome” below)
      [...]
      Incrementally add a genome:
      Once a large all-versus-all BLAST has been completed, you may need to “incrementally” add a new proteome, without re-running the large all-versus-all BLAST. To do so:
      (1) Prepare the new proteome’s FASTA file as you did for the previous ones (Steps 4 and 5);
      (2) Make a new BLAST database that includes all the previous proteins plus the new FASTA file.;
      (3) Use the -z argument of BLAST to simulate the size of the all database, so that the statistics and scoring is compatible with the original all-v-all BLAST. Use the same -z value as was used in the original BLAST.
      [...]
      Source: Fischer, S. (2011), Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups. Curr Protoc Bioinformatics (https://www.ncbi.nlm.nih.gov/pmc/art...report=classic)

      Comment


      • #4
        Make a db with 33 sequences
        Make another db with 32 sequences
        Use the same db_size parameter, e.g. 10000 for blasts against both of these dbs. Are the evalues of the hits the same?
        savetherhino.org

        Comment


        • #5
          Thanks for the idea. Here is the result:
          If I use the same value for db_size for different blast-db, I get same evalues.

          Now is the problem partially solved. Next time, I will set at the beginning a fix value for the db-size, if I will extend it later.

          But how is the db size defined, if you want to know it afterwards?

          Comment


          • #6
            I found this year-old thread quite helpful. I did my own experiments, 22 sequences of 7723 amino acids.
            I counted amino acids this way:
            Code:
            cat goodProteins.fasta | grep -v '>' | tr -d "\n" | wc -c
            Setting -dbsize 7723 gave the exact same e-values as when -dbsize was not specified.
            Setting -dbsize 7713 gave 2 different e-values
            Setting -dbsize 7600 gave 33 different e-values
            So for me, -dbsize was the total number of amino acids.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Non-Coding RNA Research and Technologies
              by seqadmin




              Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

              Nobel Prize for MicroRNA Discovery
              This week,...
              10-07-2024, 08:07 AM
            • seqadmin
              Recent Developments in Metagenomics
              by seqadmin





              Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
              09-23-2024, 06:35 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 06:55 AM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 10-02-2024, 04:51 AM
            0 responses
            105 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 10-01-2024, 07:10 AM
            0 responses
            113 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 09-30-2024, 08:33 AM
            1 response
            117 views
            0 likes
            Last Post EmiTom
            by EmiTom
             
            Working...
            X