Unconfigured Ad

**dariober** · 07-17-2012, 05:15 AM

Hi Giorgio,

See if this helps:

say you have test.txt like this (tab-separated):

Code:

chr1    chr2
chr2    chr1
chr3    chr4
chr5    chr4

You want elements in column B (2nd) not in A (1st). So the output would be "chr4" outputted twice

This python script should do it. It assumes that all the unique elements in column A can fit in memory:

Code:

python -c "
fin= open('test.txt')

col_a= set()
for line in fin:
    ## Unique elements in column A
    a= line.strip().split()[0]
    col_a.add(a)

fin.seek(0)
for line in fin:
    ## Print out elements in column B not in set A
    b= line.strip().split()[1]
    if b not in col_a:
        print(b)
fin.close()
"

Output goes to stdin so use > to send it to a file

Hope it helps and I haven't made any mistake!

Dario

**Giorgio C** · 07-17-2012, 05:31 AM

Thank you so much for your answer I've tried but it give me an error:

Traceback (most recent call last):
File "script.py", line 9, in <module>
a= line.strip().split()[0]
IndexError: list index out of range

Do you know what may be the problem?

**dariober** · 07-17-2012, 06:00 AM

Originally posted by Giorgio C View Post

Thank you so much for your answer I've tried but it give me an error:

Traceback (most recent call last):
File "script.py", line 9, in <module>
a= line.strip().split()[0]
IndexError: list index out of range

Do you know what may be the problem?

Can you post a sample of the first few lines from your input file? It could be that the first line(s) are emtpy hence the error above.

In fact, this version will skip blank lines:

Code:

python -c "
fin= open('test.txt')

col_a= set()
for line in fin:
    if line.strip() == '':
        continue
    ## Unique elements in column A
    a= line.strip().split()[0]
    col_a.add(a)

fin.seek(0)
for line in fin:
    if line.strip() == '':
        continue
    ## Print out elements in column B not in set A
    b= line.strip().split()[1]
    if b not in col_a:
        print(b)
fin.close()
"

By the way, this script pulls out elements in B not in A, but it doesn't pull out elements in A not in B. Is this ok?

**Richard Finney** · 07-17-2012, 06:53 AM

"comm" command is useful for this stuff ..

Code:

-bash-3.00$ cat junk
chr1    chr2
chr2    chr1
chr3    chr4
chr5    chr4
-bash-3.00$ awk '{print $1}' < junk | sort | uniq > junkcol1
-bash-3.00$ awk '{print $2}' < junk | sort | uniq > junkcol2
-bash-3.00$ comm -3 junkcol1 junkcol2
chr3
        chr4
chr5

----
check out the "comm" command with "man comm" or via search engine.

**swbarnes2** · 07-17-2012, 08:19 AM

Rather than write your own tool, BEDTools intersectBed can do exactly this, if you convert your lists to .bed format

**Giorgio C** · 07-18-2012, 01:04 AM

Thank you Dariober,

as you said it was a problem of some empty lines in the middle of the list that I did not see at first. Now it works good and however the 'skip blank script' version works greatly.

Richard Finney thank you for your other suggestion, it works good too.

**dariober** · 07-18-2012, 01:52 AM

Originally posted by Giorgio C View Post

Thank you Dariober,

as you said it was a problem of some empty lines in the middle of the list that I did not see at first. Now it works good and however the 'skip blank script' version works greatly.

Richard Finney thank you for your other suggestion, it works good too.

Glad to hear it worked. By the way, I realized that if your input file is tab-separated you should replace in the script "split()" with "split('\t')". The script I posted splits lines into columns at every occurance of a blank space, so including, but not restricing to, tab characters.

Good luck!
Dario

**Giorgio C** · 07-18-2012, 03:35 AM

Yes infact, that was one of the thing I've noticed, in the tab separeted file needs to add (\t) to the script.

By the way it works good !

Thank you again for your precious help.

Cheers,
Giorgio

**LudesMeyers** · 10-05-2013, 08:58 AM

This is a great discussion for a novice scriptor of Python and bash/C shells. Thank you Giorgio C for all of your questions. They provide food for my thoughts. And thanks to Richard Finney for pointing out the use of the "comm" and "awk" commands.

This code is a combo of my first thoughts on how to approach the problem with subsequent introduction of the "comm" command.

Code:

bash-3.2$ cat junk
chr1	chr2
chr2	chr1
chr3	chr4
chr5	chr4
bash-3.2$ cut -f1 junk | sort | uniq > junkcol1
bash-3.2$ cut -f2 junk | sort | uniq > junkcol2
bash-3.2$ comm -3 junkcol1 junkcol2 > commonjunk
bash-3.2$ cat commonjunk
chr3
	chr4
chr5

Thanks everyone,
John

**Giorgio C** · 10-05-2013, 10:53 AM

Sure John,

this is a very useful forum

Topics	Statistics	Last Post
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, Yesterday, 11:10 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 43 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 104 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 125 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM

Unconfigured Ad

Scripting Help - Common Elements

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News