Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Tsuyoshi
    Member
    • Sep 2012
    • 24

    Reciproca blast using Python

    HI
    I am running local Blast (version 2.2.6) on a mac machine, and I want to find the homologs between two protein database(named A and B) by reciprocal blast methods under python command lines.
    I have succeeded in blast A against B, and B against A, which resulted in two csv files which contained the best hits from each blast. Next I would like to use these two files to do Reciprocal blast, namely to find the homologs between A and B, according to the blast e-value and bit scores.

    The codes for reciprocal blast was downloaded from http://ged.msu.edu/angus/tutorials/r...cal-blast.html
    and it is like this
    ******************************************************************************************************

    #! /usr/bin/env python
    import sys, csv

    # number below which to discard results
    E_CUTOFF = 1e-3

    def load_csv_to_dict(filename):
    """
    Load the CSV file into a dictionary, tying query sequences to subject
    sequences.
    """
    d = {}

    for (query_name, subject_name, score,expect) in csv.reader(open(filename)):
    # fix the names that start with a 'gi|XXXXXXX|'
    query_name = demangle_name(query_name)
    subject_name = demangle_name(subject_name)

    # convert the e-value into a floating point number
    expect = float(expect)
    if query_name not in d and expect < E_CUTOFF:
    # if we don't already have an entry for this, put it in.
    d[query_name] = subject_name

    return d

    def demangle_name(name):
    """
    This functions strips off the 'gi|XXXXX|' name encoding that NCBI uses.

    Note that NCBI does this automatically for subject sequences.
    """
    if name.startswith('gi|'):
    name = name.split('|')
    name = name[2:]
    name = "|".join(name)

    return name

    ###

    # This is the code that's run when you actually type 'find-reciprocal.py'
    # at the command line; the above are function definitions, that define
    # reusable blocks or chunks of code.

    in_file_1 = sys.argv[1]
    in_file_2 = sys.argv[2]

    d1 = load_csv_to_dict(in_file_1)
    d2 = load_csv_to_dict(in_file_2)

    output = csv.writer(sys.stdout)

    for seqname in d1:
    seqmatch1 = d1[seqname]
    seqmatch2 = d2.get(seqmatch1)

    if seqmatch2 == seqname:
    output.writerow([seqname, seqmatch1])

    *******************************************************************************************************
    However, when I used this command, an error came up like

    *******************************************************************************************************

    Traceback (most recent call last):
    File "ngs-course/blast/find-reciprocal.py", line 49, in <module>
    d1 = load_csv_to_dict(in_file_1)
    File "ngs-course/blast/find-reciprocal.py", line 14, in load_csv_to_dict
    for (query_name, subject_name, score,expect) in csv.reader(open(filename)):
    _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

    *******************************************************************************************************

    Would anyone please help me to modify the command lines to analyse the data? It would be a great appreciation for discussion here. Thanks!
    3
    Always
    66.67%
    2
    Sometimes
    33.33%
    1
    Never
    0.00%
    0

    The poll is expired.

    Last edited by Tsuyoshi; 09-07-2012, 03:53 AM.
  • dariober
    Senior Member
    • May 2010
    • 311

    #2
    Hello,

    I don't have a file to test this but looking at the Pyhton documentation for open() (http://docs.python.org/library/functions.html#open) and at this post found by Google (http://selfsolved.com/problems/pytho...l-newline-mode), you might fix your problem by replacing csv.reader(open(filename)) with csv.reader(open(filename, "rU")) in function load_csv_to_dict(filename). That is, the new function would be:

    Code:
    def load_csv_to_dict(filename):
        """
        Load the CSV file into a dictionary, tying query sequences to subject
        sequences.
        """
        d = {}
    
        for (query_name, subject_name, score,expect) in csv.reader(open(filename, "rU")):
            # fix the names that start with a 'gi|XXXXXXX|'
            query_name = demangle_name(query_name)
            subject_name = demangle_name(subject_name)
    
            # convert the e-value into a floating point number
            expect = float(expect)
            if query_name not in d and expect < E_CUTOFF:
                # if we don't already have an entry for this, put it in.
                d[query_name] = subject_name
    
        return d
    Post a sample file to reproduce the error would help if you can.

    Best
    Dario

    Comment

    • Tsuyoshi
      Member
      • Sep 2012
      • 24

      #3
      Dear Dario,
      Thank you for your kind answer!
      Now it works well!
      Thanks!

      Comment

      • Tsuyoshi
        Member
        • Sep 2012
        • 24

        #4
        Dear Dario,
        Thank you so much!
        It works!

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


          Here are nine questions we think about, in roughly the order they matter, before...
          06-18-2026, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        25 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        42 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        48 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-04-2026, 08:59 AM
        0 responses
        49 views
        0 reactions
        Last Post SEQadmin2  
        Working...