Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Odd_values_In_blastx_outputs

    Hi all

    I am getting some odd values in my blast outputs. The evalues in the outputs don't seem to correspond properly with the percent_ids (and number of mismatches). I am using Blast 2.2.25 and I have been submitting the jobs using the following command:

    #Resources
    #$ -pe orte 4

    #Run this command:
    blastx -query xaa -db /home/my_dir/refseq_dbs/refseq_proteins_eukaryotes.fasta -evalue
    0.01 -outfmt 10 -max_target_seqs 1 -out out_a.txt -num_threads 4

    I have also tried using the -num_descriptions and the -num_alignments flags instead of the -max_target_seqs flag but this has not fixed the problem.

    Here is a few lines from the blastoutput:
    qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

    contig4822 281371382 1 80 79 0 326 87 729 808 5.00E-016 86
    contig5762 326676144 3 96 91 1 348 67 691 786 4.00E-033 142
    contig209 190194343 1 71 70 0 357 145 5 75 1.00E-021 92
    contig320 432885829 2 117 115 0 352 2 275 391 4.00E-047 189
    contig3304 348520766 1 156 145 3 466 20 1084 1237 3.00E-024 113

    Am I missing something?

    Thanks in advance!

  • #2
    Odd values in blastx outputs

    Are you using the old blast or blastplus?

    Your results do look odd, the third column, which should be pident, seems to have only integer values.

    What are you using to view the csv file with the results?

    Comment


    • #3
      Hi mastal

      Thanks for your reply. I am using MySQL to handle the results but didn't use the decimal data type for the percent_id or bitscore columns (so, decimal places have been dropped)

      Here are the same lines from the raw blast output:

      contig4822,gi|281371382|ref|NP_001163830.1|,1.25,80,79,0,326,87,729,808,5e-16,86.3
      contig5762,gi|326676144|ref|XP_001334811.4|,3.12,96,91,1,348,67,691,786,4e-33, 142
      contig209,gi|190194343|ref|NP_001121707.1|,1.41,71,70,0,357,145,5,75,1e-21,92.4
      contig320,gi|432885829|ref|XP_004074779.1|,1.71,117,115,0,352,2,275,391,4e-47, 189
      contig3304,gi|348520766|ref|XP_003447898.1|,1.28,156,145,3,466,20,1084,1237,3e-24, 113

      We are running blastx 2.2.25+

      Comment


      • #4
        Shouldn't the output have commas since it's supposed to be comma-separated values? Also, the subject sequence ids look wrong (assuming your db is a subset of refseq_protein). Finally, why would you want to have only one match for each contig, when in all likelihood, many contigs ought to have numerous ORFs..

        edit. your second output looks proper..
        savetherhino.org

        Comment


        • #5
          Calculating a Bit score (from which the e-value is derived) is far more complex than just the pecent identity, escpecially so, as your case, where you are doing the BLAST search in amino acid sequence space. When aligning amino acid sequences BLAST uses a scoring matrix with weighted scores (positive and negative) for each possible pair of aligned amino acids. This is unlike alignments in nucleotide space which are simply +1 for a match and 0 for a mismatch. There are also penalties for gap opens and extensions which affect the final score. The number of identical aligned amino acids is just one factor of the Bit score calculation so while there will be a positive correlation between them there is not a direct linear relationship between % identity and Bit score.

          Comment


          • #6
            Thanks for the help so far!

            I ran a second blastx with the same parametrs using a subset of sequences and found that all of the output values are the same as in the first search except for the percent_ids and the number of mismatches.

            output of first search:
            contig4822,gi|281371382|ref|NP_001163830.1|,1.25,80,79,0,326,87,729,808,5e-16,86.3
            contig5762,gi|326676144|ref|XP_001334811.4|,3.12,96,91,1,348,67,691,786,4e-33, 142
            contig209,gi|190194343|ref|NP_001121707.1|,1.41,71,70,0,357,145,5,75,1e-21,92.4
            contig320,gi|432885829|ref|XP_004074779.1|,1.71,117,115,0,352,2,275,391,4e-47, 189
            contig3304,gi|348520766|ref|XP_003447898.1|,1.28,156,145,3,466,20,1084,1237,3e-24, 113

            output of second search:
            contig4822,gi|281371382|ref|NP_001163830.1|,51.25,80,39,0,323,84,729,808,5e-16,86.3
            contig5762,gi|326676144|ref|XP_001334811.4|,72.92,96,24,1,348,67,691,786,4e-33, 142
            contig209,gi|190194343|ref|NP_001121707.1|,61.97,71,27,0,357,145,5,75,1e-21,92.4
            contig320,gi|432885829|ref|XP_004074779.1|,80.34,117,23,0,352,2,275,391,4e-47, 189
            contig3304,gi|348520766|ref|XP_003447898.1|,34.62,156,93,3,466,20,1084,1237,3e-24, 113

            I have looked through the results from my other blastx searches and there also appears to be cases where the evalues and percent_ids don't correspond properly. All of the blastx searches so far have been big- thousands of input sequences against big databases so I have running the searches on multiple threads/computers at the same time (usually in batches of 3000-5000 sequences per input file)(this is why we only want the best hit for each query for now!). Perhaps this is a scale/computing problem on our end...

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 08:47 AM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            59 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Working...
            X