Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dsw0002
    Junior Member
    • Jul 2012
    • 2

    Comparing e-values from two blast reports

    Hey guys,

    I'm having some trouble writing a script to compare the e-values from two separate blast reports.

    I'm trying to see where my dataset has contamination. I'm blasting a single library against a database with known contaminants and a dataset with sequences representing "good" data. I'm then comparing, with this script hopefully, the e-values for each sequence's top hits for each database to see if that sequence has contamination or not(lower e-value for the top hit in the contamination blast means contamination for that sequence).

    My first problem occured when shell arithmetic couldn't compare the scientific notation (#e-#) that blast reports use. I thought perhaps if I changed 'e-' to '.' and then cut each side I could compare that way but that had it's problems as well. Apparently shell variables can't hold numbers like 09, so when I split a number like 3.09 or 0.0, I receive an error like this:

    /compare_evalue.sh: line 7: 09: value too great for base (error token is "09")
    ./compare_evalue.sh: line 7: 0.0: syntax error: invalid arithmetic operator (error token is ".0")

    I was curious if anyone had an elegant solution to comparing values represented as scientific notation in a shell script. Below is my code so far, my problem occurs in the first 15 lines, but I posted my whole script so you can see what I'm going for.

    Code:
    #Usage: ./compare_evalues.txt clean_report contaminated_report #original_sequence_headers(just the names of the sequences)
    
    cleanFile=`awk '{print $2}' $1 | sed 's/e\-/\./'` #Get the e-value column 
    contamFile=`awk '{print $2}' $2 | sed 's/e\-/\./'` #and substitute e- for .
    for i in ${inputFile1[@]}
    do
    
    #Split the evalue into value after decimal point(represents the scientific notation's exponent) and value before decimal point.
    decimalPartClean=`cut -d. -f2 ${cleanFile[$i]}` 
    wholePartClean=`cut -d. -f1 ${cleanFile[$i]}`
    decimalPartContam=`cut -d. -f2 ${contamFile[$i]}`
    wholePartContam=`cut -d. -f1 ${contamFile[$i]}`
    
    #If the clean e-value's scientific notations exponent is larger(e-value is #ultimately smaller) than the contaminants, move the original sequence's #name to clean.txt.
    if [ $decimalClean -gt $decimalContam ]
    then
    sed -n '$i' $3 >> clean.txt
    
    #If the exponents are equal, look at the number in front of the decimal to #determine which e-value is smaller.
    else if [ $decimalClean -eq $decimalContam ]
    then
           if [ $wholePart1 -gt $wholePart2 ]
           then
           sed -n '$i' $3 >> contaminated.txt
           else
           sed -n '$i' $3 >> clean.txt
           fi
    
    #Else here represents if the second is greater than the first.
    else
    sed -n '$i' $3 >> contaminated.txt
    fi
    fi
    done
    So if given two blast reports that have been translated into tables where the second column contains evalues in the form #e-# and the e-values are:

    clean_report: 3e-144 1e-31 9e-32 8e-09 0.0 2e-29 3e-21 3e-65 6e-38 9e-51
    contaminated_report: 3e-29 5e-26 1e-16 3e-102 2e-54 7e-83 4e-79 5e-67 8e-38 2e-83

    The first, second, third, fifth, and ninth sequence headers would go to clean.txt and the fourth, sixth, seventh, eighth, and tenth would go to contaminated.txt.
  • SES
    Senior Member
    • Mar 2010
    • 275

    #2
    I would ask why the e-value is so important to you. Why not compare the bit score and avoid coding a long and complicated shell script? Also, you should really consider BioPerl or BioPython for analyzing BLAST reports, even for simple tasks. Solving problems like this with shell code can be a fun challenge, but it does not make a lot of sense when you could write a couple of lines of Bio* code that is actually readable (and maintainable) in the future.
    Last edited by SES; 07-10-2012, 04:33 AM.

    Comment

    • Jean
      Member
      • Nov 2008
      • 37

      #3
      You probably don't want to compare e-values. Since the e-value depends on the size of the database you can't directly compare between different datasets.

      Comment

      • dsw0002
        Junior Member
        • Jul 2012
        • 2

        #4
        Thanks for the replies. I'll recommend we look at bit scores instead of e-values.

        Comment

        • maubp
          Peter (Biopython etc)
          • Jul 2009
          • 1544

          #5
          Originally posted by Jean View Post
          You probably don't want to compare e-values. Since the e-value depends on the size of the database you can't directly compare between different datasets.
          +1

          Very important point

          Comment

          • SES
            Senior Member
            • Mar 2010
            • 275

            #6
            Originally posted by Jean View Post
            You probably don't want to compare e-values. Since the e-value depends on the size of the database you can't directly compare between different datasets.
            If a different database was used for the two BLAST reports then they would not be directly comparable on any level. I think you would have to use the same database to compare hits to the same gene or protein, or whatever.

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Yesterday, 11:10 AM
            0 responses
            7 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            42 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            103 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            125 views
            0 reactions
            Last Post SEQadmin2  
            Working...