Hey guys,
I'm having some trouble writing a script to compare the e-values from two separate blast reports.
I'm trying to see where my dataset has contamination. I'm blasting a single library against a database with known contaminants and a dataset with sequences representing "good" data. I'm then comparing, with this script hopefully, the e-values for each sequence's top hits for each database to see if that sequence has contamination or not(lower e-value for the top hit in the contamination blast means contamination for that sequence).
My first problem occured when shell arithmetic couldn't compare the scientific notation (#e-#) that blast reports use. I thought perhaps if I changed 'e-' to '.' and then cut each side I could compare that way but that had it's problems as well. Apparently shell variables can't hold numbers like 09, so when I split a number like 3.09 or 0.0, I receive an error like this:
/compare_evalue.sh: line 7: 09: value too great for base (error token is "09")
./compare_evalue.sh: line 7: 0.0: syntax error: invalid arithmetic operator (error token is ".0")
I was curious if anyone had an elegant solution to comparing values represented as scientific notation in a shell script. Below is my code so far, my problem occurs in the first 15 lines, but I posted my whole script so you can see what I'm going for.
So if given two blast reports that have been translated into tables where the second column contains evalues in the form #e-# and the e-values are:
clean_report: 3e-144 1e-31 9e-32 8e-09 0.0 2e-29 3e-21 3e-65 6e-38 9e-51
contaminated_report: 3e-29 5e-26 1e-16 3e-102 2e-54 7e-83 4e-79 5e-67 8e-38 2e-83
The first, second, third, fifth, and ninth sequence headers would go to clean.txt and the fourth, sixth, seventh, eighth, and tenth would go to contaminated.txt.
I'm having some trouble writing a script to compare the e-values from two separate blast reports.
I'm trying to see where my dataset has contamination. I'm blasting a single library against a database with known contaminants and a dataset with sequences representing "good" data. I'm then comparing, with this script hopefully, the e-values for each sequence's top hits for each database to see if that sequence has contamination or not(lower e-value for the top hit in the contamination blast means contamination for that sequence).
My first problem occured when shell arithmetic couldn't compare the scientific notation (#e-#) that blast reports use. I thought perhaps if I changed 'e-' to '.' and then cut each side I could compare that way but that had it's problems as well. Apparently shell variables can't hold numbers like 09, so when I split a number like 3.09 or 0.0, I receive an error like this:
/compare_evalue.sh: line 7: 09: value too great for base (error token is "09")
./compare_evalue.sh: line 7: 0.0: syntax error: invalid arithmetic operator (error token is ".0")
I was curious if anyone had an elegant solution to comparing values represented as scientific notation in a shell script. Below is my code so far, my problem occurs in the first 15 lines, but I posted my whole script so you can see what I'm going for.
Code:
#Usage: ./compare_evalues.txt clean_report contaminated_report #original_sequence_headers(just the names of the sequences) cleanFile=`awk '{print $2}' $1 | sed 's/e\-/\./'` #Get the e-value column contamFile=`awk '{print $2}' $2 | sed 's/e\-/\./'` #and substitute e- for . for i in ${inputFile1[@]} do #Split the evalue into value after decimal point(represents the scientific notation's exponent) and value before decimal point. decimalPartClean=`cut -d. -f2 ${cleanFile[$i]}` wholePartClean=`cut -d. -f1 ${cleanFile[$i]}` decimalPartContam=`cut -d. -f2 ${contamFile[$i]}` wholePartContam=`cut -d. -f1 ${contamFile[$i]}` #If the clean e-value's scientific notations exponent is larger(e-value is #ultimately smaller) than the contaminants, move the original sequence's #name to clean.txt. if [ $decimalClean -gt $decimalContam ] then sed -n '$i' $3 >> clean.txt #If the exponents are equal, look at the number in front of the decimal to #determine which e-value is smaller. else if [ $decimalClean -eq $decimalContam ] then if [ $wholePart1 -gt $wholePart2 ] then sed -n '$i' $3 >> contaminated.txt else sed -n '$i' $3 >> clean.txt fi #Else here represents if the second is greater than the first. else sed -n '$i' $3 >> contaminated.txt fi fi done
clean_report: 3e-144 1e-31 9e-32 8e-09 0.0 2e-29 3e-21 3e-65 6e-38 9e-51
contaminated_report: 3e-29 5e-26 1e-16 3e-102 2e-54 7e-83 4e-79 5e-67 8e-38 2e-83
The first, second, third, fifth, and ninth sequence headers would go to clean.txt and the fourth, sixth, seventh, eighth, and tenth would go to contaminated.txt.
Comment