Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Giorgio C
    replied
    Sure John,

    this is a very useful forum

    Leave a comment:


  • LudesMeyers
    replied
    This is a great discussion for a novice scriptor of Python and bash/C shells. Thank you Giorgio C for all of your questions. They provide food for my thoughts. And thanks to Richard Finney for pointing out the use of the "comm" and "awk" commands.

    This code is a combo of my first thoughts on how to approach the problem with subsequent introduction of the "comm" command.

    Code:
    bash-3.2$ cat junk
    chr1	chr2
    chr2	chr1
    chr3	chr4
    chr5	chr4
    bash-3.2$ cut -f1 junk | sort | uniq > junkcol1
    bash-3.2$ cut -f2 junk | sort | uniq > junkcol2
    bash-3.2$ comm -3 junkcol1 junkcol2 > commonjunk
    bash-3.2$ cat commonjunk
    chr3
    	chr4
    chr5
    Thanks everyone,
    John

    Leave a comment:


  • Giorgio C
    replied
    Yes infact, that was one of the thing I've noticed, in the tab separeted file needs to add (\t) to the script.

    By the way it works good !

    Thank you again for your precious help.

    Cheers,
    Giorgio

    Leave a comment:


  • dariober
    replied
    Originally posted by Giorgio C View Post
    Thank you Dariober,

    as you said it was a problem of some empty lines in the middle of the list that I did not see at first. Now it works good and however the 'skip blank script' version works greatly.

    Richard Finney thank you for your other suggestion, it works good too.
    Glad to hear it worked. By the way, I realized that if your input file is tab-separated you should replace in the script "split()" with "split('\t')". The script I posted splits lines into columns at every occurance of a blank space, so including, but not restricing to, tab characters.

    Good luck!
    Dario

    Leave a comment:


  • Giorgio C
    replied
    Thank you Dariober,

    as you said it was a problem of some empty lines in the middle of the list that I did not see at first. Now it works good and however the 'skip blank script' version works greatly.

    Richard Finney thank you for your other suggestion, it works good too.

    Leave a comment:


  • swbarnes2
    replied
    Rather than write your own tool, BEDTools intersectBed can do exactly this, if you convert your lists to .bed format

    Leave a comment:


  • Richard Finney
    replied
    "comm" command is useful for this stuff ..

    Code:
    -bash-3.00$ cat junk
    chr1    chr2
    chr2    chr1
    chr3    chr4
    chr5    chr4
    -bash-3.00$ awk '{print $1}' < junk | sort | uniq > junkcol1
    -bash-3.00$ awk '{print $2}' < junk | sort | uniq > junkcol2
    -bash-3.00$ comm -3 junkcol1 junkcol2
    chr3
            chr4
    chr5
    ----
    check out the "comm" command with "man comm" or via search engine.

    Leave a comment:


  • dariober
    replied
    Originally posted by Giorgio C View Post
    Thank you so much for your answer I've tried but it give me an error:

    Traceback (most recent call last):
    File "script.py", line 9, in <module>
    a= line.strip().split()[0]
    IndexError: list index out of range

    Do you know what may be the problem?
    Can you post a sample of the first few lines from your input file? It could be that the first line(s) are emtpy hence the error above.

    In fact, this version will skip blank lines:

    Code:
    python -c "
    fin= open('test.txt')
    
    col_a= set()
    for line in fin:
        if line.strip() == '':
            continue
        ## Unique elements in column A
        a= line.strip().split()[0]
        col_a.add(a)
    
    fin.seek(0)
    for line in fin:
        if line.strip() == '':
            continue
        ## Print out elements in column B not in set A
        b= line.strip().split()[1]
        if b not in col_a:
            print(b)
    fin.close()
    "
    By the way, this script pulls out elements in B not in A, but it doesn't pull out elements in A not in B. Is this ok?

    Leave a comment:


  • Giorgio C
    replied
    Thank you so much for your answer I've tried but it give me an error:

    Traceback (most recent call last):
    File "script.py", line 9, in <module>
    a= line.strip().split()[0]
    IndexError: list index out of range

    Do you know what may be the problem?

    Leave a comment:


  • dariober
    replied
    Hi Giorgio,

    See if this helps:

    say you have test.txt like this (tab-separated):
    Code:
    chr1    chr2
    chr2    chr1
    chr3    chr4
    chr5    chr4
    You want elements in column B (2nd) not in A (1st). So the output would be "chr4" outputted twice

    This python script should do it. It assumes that all the unique elements in column A can fit in memory:

    Code:
    python -c "
    fin= open('test.txt')
    
    col_a= set()
    for line in fin:
        ## Unique elements in column A
        a= line.strip().split()[0]
        col_a.add(a)
    
    fin.seek(0)
    for line in fin:
        ## Print out elements in column B not in set A
        b= line.strip().split()[1]
        if b not in col_a:
            print(b)
    fin.close()
    "
    Output goes to stdin so use > to send it to a file

    Hope it helps and I haven't made any mistake!

    Dario

    Leave a comment:


  • Giorgio C
    started a topic Scripting Help - Common Elements

    Scripting Help - Common Elements

    Hi all,

    I have a question hoping someone can help me.

    I have a file with two columns A and B with a list of gene coordinates in each of them:

    I'd like to write in another file every element that is present in column A but not in B.
    (I mean I'd like to write in output the NON common elements of the two columns).


    Do you have any suggestions ( little script in perl or python) not exel cuz the list is huge.


    Thanks you in advance,
    Giorgio

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 11:49 AM
0 responses
12 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 08:47 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
61 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X