Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • AMCT
    Junior Member
    • Dec 2012
    • 3

    Odd_values_In_blastx_outputs

    Hi all

    I am getting some odd values in my blast outputs. The evalues in the outputs don't seem to correspond properly with the percent_ids (and number of mismatches). I am using Blast 2.2.25 and I have been submitting the jobs using the following command:

    #Resources
    #$ -pe orte 4

    #Run this command:
    blastx -query xaa -db /home/my_dir/refseq_dbs/refseq_proteins_eukaryotes.fasta -evalue
    0.01 -outfmt 10 -max_target_seqs 1 -out out_a.txt -num_threads 4

    I have also tried using the -num_descriptions and the -num_alignments flags instead of the -max_target_seqs flag but this has not fixed the problem.

    Here is a few lines from the blastoutput:
    qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

    contig4822 281371382 1 80 79 0 326 87 729 808 5.00E-016 86
    contig5762 326676144 3 96 91 1 348 67 691 786 4.00E-033 142
    contig209 190194343 1 71 70 0 357 145 5 75 1.00E-021 92
    contig320 432885829 2 117 115 0 352 2 275 391 4.00E-047 189
    contig3304 348520766 1 156 145 3 466 20 1084 1237 3.00E-024 113

    Am I missing something?

    Thanks in advance!
  • mastal
    Senior Member
    • Mar 2009
    • 666

    #2
    Odd values in blastx outputs

    Are you using the old blast or blastplus?

    Your results do look odd, the third column, which should be pident, seems to have only integer values.

    What are you using to view the csv file with the results?

    Comment

    • AMCT
      Junior Member
      • Dec 2012
      • 3

      #3
      Hi mastal

      Thanks for your reply. I am using MySQL to handle the results but didn't use the decimal data type for the percent_id or bitscore columns (so, decimal places have been dropped)

      Here are the same lines from the raw blast output:

      contig4822,gi|281371382|ref|NP_001163830.1|,1.25,80,79,0,326,87,729,808,5e-16,86.3
      contig5762,gi|326676144|ref|XP_001334811.4|,3.12,96,91,1,348,67,691,786,4e-33, 142
      contig209,gi|190194343|ref|NP_001121707.1|,1.41,71,70,0,357,145,5,75,1e-21,92.4
      contig320,gi|432885829|ref|XP_004074779.1|,1.71,117,115,0,352,2,275,391,4e-47, 189
      contig3304,gi|348520766|ref|XP_003447898.1|,1.28,156,145,3,466,20,1084,1237,3e-24, 113

      We are running blastx 2.2.25+

      Comment

      • rhinoceros
        Senior Member
        • Apr 2013
        • 372

        #4
        Shouldn't the output have commas since it's supposed to be comma-separated values? Also, the subject sequence ids look wrong (assuming your db is a subset of refseq_protein). Finally, why would you want to have only one match for each contig, when in all likelihood, many contigs ought to have numerous ORFs..

        edit. your second output looks proper..
        savetherhino.org

        Comment

        • kmcarr
          Senior Member
          • May 2008
          • 1181

          #5
          Calculating a Bit score (from which the e-value is derived) is far more complex than just the pecent identity, escpecially so, as your case, where you are doing the BLAST search in amino acid sequence space. When aligning amino acid sequences BLAST uses a scoring matrix with weighted scores (positive and negative) for each possible pair of aligned amino acids. This is unlike alignments in nucleotide space which are simply +1 for a match and 0 for a mismatch. There are also penalties for gap opens and extensions which affect the final score. The number of identical aligned amino acids is just one factor of the Bit score calculation so while there will be a positive correlation between them there is not a direct linear relationship between % identity and Bit score.

          Comment

          • AMCT
            Junior Member
            • Dec 2012
            • 3

            #6
            Thanks for the help so far!

            I ran a second blastx with the same parametrs using a subset of sequences and found that all of the output values are the same as in the first search except for the percent_ids and the number of mismatches.

            output of first search:
            contig4822,gi|281371382|ref|NP_001163830.1|,1.25,80,79,0,326,87,729,808,5e-16,86.3
            contig5762,gi|326676144|ref|XP_001334811.4|,3.12,96,91,1,348,67,691,786,4e-33, 142
            contig209,gi|190194343|ref|NP_001121707.1|,1.41,71,70,0,357,145,5,75,1e-21,92.4
            contig320,gi|432885829|ref|XP_004074779.1|,1.71,117,115,0,352,2,275,391,4e-47, 189
            contig3304,gi|348520766|ref|XP_003447898.1|,1.28,156,145,3,466,20,1084,1237,3e-24, 113

            output of second search:
            contig4822,gi|281371382|ref|NP_001163830.1|,51.25,80,39,0,323,84,729,808,5e-16,86.3
            contig5762,gi|326676144|ref|XP_001334811.4|,72.92,96,24,1,348,67,691,786,4e-33, 142
            contig209,gi|190194343|ref|NP_001121707.1|,61.97,71,27,0,357,145,5,75,1e-21,92.4
            contig320,gi|432885829|ref|XP_004074779.1|,80.34,117,23,0,352,2,275,391,4e-47, 189
            contig3304,gi|348520766|ref|XP_003447898.1|,34.62,156,93,3,466,20,1084,1237,3e-24, 113

            I have looked through the results from my other blastx searches and there also appears to be cases where the evalues and percent_ids don't correspond properly. All of the blastx searches so far have been big- thousands of input sequences against big databases so I have running the searches on multiple threads/computers at the same time (usually in batches of 3000-5000 sequences per input file)(this is why we only want the best hit for each query for now!). Perhaps this is a scale/computing problem on our end...

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Today, 05:37 AM
            0 responses
            5 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-26-2026, 11:10 AM
            0 responses
            16 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            49 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            109 views
            0 reactions
            Last Post SEQadmin2  
            Working...