Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing e-values from two blast reports

    Hey guys,

    I'm having some trouble writing a script to compare the e-values from two separate blast reports.

    I'm trying to see where my dataset has contamination. I'm blasting a single library against a database with known contaminants and a dataset with sequences representing "good" data. I'm then comparing, with this script hopefully, the e-values for each sequence's top hits for each database to see if that sequence has contamination or not(lower e-value for the top hit in the contamination blast means contamination for that sequence).

    My first problem occured when shell arithmetic couldn't compare the scientific notation (#e-#) that blast reports use. I thought perhaps if I changed 'e-' to '.' and then cut each side I could compare that way but that had it's problems as well. Apparently shell variables can't hold numbers like 09, so when I split a number like 3.09 or 0.0, I receive an error like this:

    /compare_evalue.sh: line 7: 09: value too great for base (error token is "09")
    ./compare_evalue.sh: line 7: 0.0: syntax error: invalid arithmetic operator (error token is ".0")

    I was curious if anyone had an elegant solution to comparing values represented as scientific notation in a shell script. Below is my code so far, my problem occurs in the first 15 lines, but I posted my whole script so you can see what I'm going for.

    Code:
    #Usage: ./compare_evalues.txt clean_report contaminated_report #original_sequence_headers(just the names of the sequences)
    
    cleanFile=`awk '{print $2}' $1 | sed 's/e\-/\./'` #Get the e-value column 
    contamFile=`awk '{print $2}' $2 | sed 's/e\-/\./'` #and substitute e- for .
    for i in ${inputFile1[@]}
    do
    
    #Split the evalue into value after decimal point(represents the scientific notation's exponent) and value before decimal point.
    decimalPartClean=`cut -d. -f2 ${cleanFile[$i]}` 
    wholePartClean=`cut -d. -f1 ${cleanFile[$i]}`
    decimalPartContam=`cut -d. -f2 ${contamFile[$i]}`
    wholePartContam=`cut -d. -f1 ${contamFile[$i]}`
    
    #If the clean e-value's scientific notations exponent is larger(e-value is #ultimately smaller) than the contaminants, move the original sequence's #name to clean.txt.
    if [ $decimalClean -gt $decimalContam ]
    then
    sed -n '$i' $3 >> clean.txt
    
    #If the exponents are equal, look at the number in front of the decimal to #determine which e-value is smaller.
    else if [ $decimalClean -eq $decimalContam ]
    then
           if [ $wholePart1 -gt $wholePart2 ]
           then
           sed -n '$i' $3 >> contaminated.txt
           else
           sed -n '$i' $3 >> clean.txt
           fi
    
    #Else here represents if the second is greater than the first.
    else
    sed -n '$i' $3 >> contaminated.txt
    fi
    fi
    done
    So if given two blast reports that have been translated into tables where the second column contains evalues in the form #e-# and the e-values are:

    clean_report: 3e-144 1e-31 9e-32 8e-09 0.0 2e-29 3e-21 3e-65 6e-38 9e-51
    contaminated_report: 3e-29 5e-26 1e-16 3e-102 2e-54 7e-83 4e-79 5e-67 8e-38 2e-83

    The first, second, third, fifth, and ninth sequence headers would go to clean.txt and the fourth, sixth, seventh, eighth, and tenth would go to contaminated.txt.

  • #2
    I would ask why the e-value is so important to you. Why not compare the bit score and avoid coding a long and complicated shell script? Also, you should really consider BioPerl or BioPython for analyzing BLAST reports, even for simple tasks. Solving problems like this with shell code can be a fun challenge, but it does not make a lot of sense when you could write a couple of lines of Bio* code that is actually readable (and maintainable) in the future.
    Last edited by SES; 07-10-2012, 04:33 AM.

    Comment


    • #3
      You probably don't want to compare e-values. Since the e-value depends on the size of the database you can't directly compare between different datasets.

      Comment


      • #4
        Thanks for the replies. I'll recommend we look at bit scores instead of e-values.

        Comment


        • #5
          Originally posted by Jean View Post
          You probably don't want to compare e-values. Since the e-value depends on the size of the database you can't directly compare between different datasets.
          +1

          Very important point

          Comment


          • #6
            Originally posted by Jean View Post
            You probably don't want to compare e-values. Since the e-value depends on the size of the database you can't directly compare between different datasets.
            If a different database was used for the two BLAST reports then they would not be directly comparable on any level. I think you would have to use the same database to compare hits to the same gene or protein, or whatever.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-25-2024, 11:49 AM
            0 responses
            19 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-24-2024, 08:47 AM
            0 responses
            17 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            62 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Working...
            X