HI
I am running local Blast (version 2.2.6) on a mac machine, and I want to find the homologs between two protein database(named A and B) by reciprocal blast methods under python command lines.
I have succeeded in blast A against B, and B against A, which resulted in two csv files which contained the best hits from each blast. Next I would like to use these two files to do Reciprocal blast, namely to find the homologs between A and B, according to the blast e-value and bit scores.
The codes for reciprocal blast was downloaded from http://ged.msu.edu/angus/tutorials/r...cal-blast.html
and it is like this
******************************************************************************************************
#! /usr/bin/env python
import sys, csv
# number below which to discard results
E_CUTOFF = 1e-3
def load_csv_to_dict(filename):
"""
Load the CSV file into a dictionary, tying query sequences to subject
sequences.
"""
d = {}
for (query_name, subject_name, score,expect) in csv.reader(open(filename)):
# fix the names that start with a 'gi|XXXXXXX|'
query_name = demangle_name(query_name)
subject_name = demangle_name(subject_name)
# convert the e-value into a floating point number
expect = float(expect)
if query_name not in d and expect < E_CUTOFF:
# if we don't already have an entry for this, put it in.
d[query_name] = subject_name
return d
def demangle_name(name):
"""
This functions strips off the 'gi|XXXXX|' name encoding that NCBI uses.
Note that NCBI does this automatically for subject sequences.
"""
if name.startswith('gi|'):
name = name.split('|')
name = name[2:]
name = "|".join(name)
return name
###
# This is the code that's run when you actually type 'find-reciprocal.py'
# at the command line; the above are function definitions, that define
# reusable blocks or chunks of code.
in_file_1 = sys.argv[1]
in_file_2 = sys.argv[2]
d1 = load_csv_to_dict(in_file_1)
d2 = load_csv_to_dict(in_file_2)
output = csv.writer(sys.stdout)
for seqname in d1:
seqmatch1 = d1[seqname]
seqmatch2 = d2.get(seqmatch1)
if seqmatch2 == seqname:
output.writerow([seqname, seqmatch1])
*******************************************************************************************************
However, when I used this command, an error came up like
*******************************************************************************************************
Traceback (most recent call last):
File "ngs-course/blast/find-reciprocal.py", line 49, in <module>
d1 = load_csv_to_dict(in_file_1)
File "ngs-course/blast/find-reciprocal.py", line 14, in load_csv_to_dict
for (query_name, subject_name, score,expect) in csv.reader(open(filename)):
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
*******************************************************************************************************
Would anyone please help me to modify the command lines to analyse the data? It would be a great appreciation for discussion here. Thanks!
I am running local Blast (version 2.2.6) on a mac machine, and I want to find the homologs between two protein database(named A and B) by reciprocal blast methods under python command lines.
I have succeeded in blast A against B, and B against A, which resulted in two csv files which contained the best hits from each blast. Next I would like to use these two files to do Reciprocal blast, namely to find the homologs between A and B, according to the blast e-value and bit scores.
The codes for reciprocal blast was downloaded from http://ged.msu.edu/angus/tutorials/r...cal-blast.html
and it is like this
******************************************************************************************************
#! /usr/bin/env python
import sys, csv
# number below which to discard results
E_CUTOFF = 1e-3
def load_csv_to_dict(filename):
"""
Load the CSV file into a dictionary, tying query sequences to subject
sequences.
"""
d = {}
for (query_name, subject_name, score,expect) in csv.reader(open(filename)):
# fix the names that start with a 'gi|XXXXXXX|'
query_name = demangle_name(query_name)
subject_name = demangle_name(subject_name)
# convert the e-value into a floating point number
expect = float(expect)
if query_name not in d and expect < E_CUTOFF:
# if we don't already have an entry for this, put it in.
d[query_name] = subject_name
return d
def demangle_name(name):
"""
This functions strips off the 'gi|XXXXX|' name encoding that NCBI uses.
Note that NCBI does this automatically for subject sequences.
"""
if name.startswith('gi|'):
name = name.split('|')
name = name[2:]
name = "|".join(name)
return name
###
# This is the code that's run when you actually type 'find-reciprocal.py'
# at the command line; the above are function definitions, that define
# reusable blocks or chunks of code.
in_file_1 = sys.argv[1]
in_file_2 = sys.argv[2]
d1 = load_csv_to_dict(in_file_1)
d2 = load_csv_to_dict(in_file_2)
output = csv.writer(sys.stdout)
for seqname in d1:
seqmatch1 = d1[seqname]
seqmatch2 = d2.get(seqmatch1)
if seqmatch2 == seqname:
output.writerow([seqname, seqmatch1])
*******************************************************************************************************
However, when I used this command, an error came up like
*******************************************************************************************************
Traceback (most recent call last):
File "ngs-course/blast/find-reciprocal.py", line 49, in <module>
d1 = load_csv_to_dict(in_file_1)
File "ngs-course/blast/find-reciprocal.py", line 14, in load_csv_to_dict
for (query_name, subject_name, score,expect) in csv.reader(open(filename)):
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
*******************************************************************************************************
Would anyone please help me to modify the command lines to analyse the data? It would be a great appreciation for discussion here. Thanks!
Comment