Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reciproca blast using Python

    HI
    I am running local Blast (version 2.2.6) on a mac machine, and I want to find the homologs between two protein database(named A and B) by reciprocal blast methods under python command lines.
    I have succeeded in blast A against B, and B against A, which resulted in two csv files which contained the best hits from each blast. Next I would like to use these two files to do Reciprocal blast, namely to find the homologs between A and B, according to the blast e-value and bit scores.

    The codes for reciprocal blast was downloaded from http://ged.msu.edu/angus/tutorials/r...cal-blast.html
    and it is like this
    ******************************************************************************************************

    #! /usr/bin/env python
    import sys, csv

    # number below which to discard results
    E_CUTOFF = 1e-3

    def load_csv_to_dict(filename):
    """
    Load the CSV file into a dictionary, tying query sequences to subject
    sequences.
    """
    d = {}

    for (query_name, subject_name, score,expect) in csv.reader(open(filename)):
    # fix the names that start with a 'gi|XXXXXXX|'
    query_name = demangle_name(query_name)
    subject_name = demangle_name(subject_name)

    # convert the e-value into a floating point number
    expect = float(expect)
    if query_name not in d and expect < E_CUTOFF:
    # if we don't already have an entry for this, put it in.
    d[query_name] = subject_name

    return d

    def demangle_name(name):
    """
    This functions strips off the 'gi|XXXXX|' name encoding that NCBI uses.

    Note that NCBI does this automatically for subject sequences.
    """
    if name.startswith('gi|'):
    name = name.split('|')
    name = name[2:]
    name = "|".join(name)

    return name

    ###

    # This is the code that's run when you actually type 'find-reciprocal.py'
    # at the command line; the above are function definitions, that define
    # reusable blocks or chunks of code.

    in_file_1 = sys.argv[1]
    in_file_2 = sys.argv[2]

    d1 = load_csv_to_dict(in_file_1)
    d2 = load_csv_to_dict(in_file_2)

    output = csv.writer(sys.stdout)

    for seqname in d1:
    seqmatch1 = d1[seqname]
    seqmatch2 = d2.get(seqmatch1)

    if seqmatch2 == seqname:
    output.writerow([seqname, seqmatch1])

    *******************************************************************************************************
    However, when I used this command, an error came up like

    *******************************************************************************************************

    Traceback (most recent call last):
    File "ngs-course/blast/find-reciprocal.py", line 49, in <module>
    d1 = load_csv_to_dict(in_file_1)
    File "ngs-course/blast/find-reciprocal.py", line 14, in load_csv_to_dict
    for (query_name, subject_name, score,expect) in csv.reader(open(filename)):
    _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

    *******************************************************************************************************

    Would anyone please help me to modify the command lines to analyse the data? It would be a great appreciation for discussion here. Thanks!
    3
    Always
    66.67%
    2
    Sometimes
    33.33%
    1
    Never
    0.00%
    0

    The poll is expired.

    Last edited by Tsuyoshi; 09-07-2012, 03:53 AM.

  • #2
    Hello,

    I don't have a file to test this but looking at the Pyhton documentation for open() (http://docs.python.org/library/functions.html#open) and at this post found by Google (http://selfsolved.com/problems/pytho...l-newline-mode), you might fix your problem by replacing csv.reader(open(filename)) with csv.reader(open(filename, "rU")) in function load_csv_to_dict(filename). That is, the new function would be:

    Code:
    def load_csv_to_dict(filename):
        """
        Load the CSV file into a dictionary, tying query sequences to subject
        sequences.
        """
        d = {}
    
        for (query_name, subject_name, score,expect) in csv.reader(open(filename, "rU")):
            # fix the names that start with a 'gi|XXXXXXX|'
            query_name = demangle_name(query_name)
            subject_name = demangle_name(subject_name)
    
            # convert the e-value into a floating point number
            expect = float(expect)
            if query_name not in d and expect < E_CUTOFF:
                # if we don't already have an entry for this, put it in.
                d[query_name] = subject_name
    
        return d
    Post a sample file to reproduce the error would help if you can.

    Best
    Dario

    Comment


    • #3
      Dear Dario,
      Thank you for your kind answer!
      Now it works well!
      Thanks!

      Comment


      • #4
        Dear Dario,
        Thank you so much!
        It works!

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Understanding Genetic Influence on Infectious Disease
          by seqadmin




          During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

          Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
          09-09-2024, 10:59 AM
        • seqadmin
          Addressing Off-Target Effects in CRISPR Technologies
          by seqadmin






          The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
          08-27-2024, 04:44 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 06:25 AM
        0 responses
        13 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 01:02 PM
        0 responses
        12 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 09-18-2024, 06:39 AM
        0 responses
        14 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 09-11-2024, 02:44 PM
        0 responses
        14 views
        0 likes
        Last Post seqadmin  
        Working...
        X