Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Giorgio C
    Member
    • Oct 2010
    • 89

    Scripting Help - Common Elements

    Hi all,

    I have a question hoping someone can help me.

    I have a file with two columns A and B with a list of gene coordinates in each of them:

    I'd like to write in another file every element that is present in column A but not in B.
    (I mean I'd like to write in output the NON common elements of the two columns).


    Do you have any suggestions ( little script in perl or python) not exel cuz the list is huge.


    Thanks you in advance,
    Giorgio
  • dariober
    Senior Member
    • May 2010
    • 311

    #2
    Hi Giorgio,

    See if this helps:

    say you have test.txt like this (tab-separated):
    Code:
    chr1    chr2
    chr2    chr1
    chr3    chr4
    chr5    chr4
    You want elements in column B (2nd) not in A (1st). So the output would be "chr4" outputted twice

    This python script should do it. It assumes that all the unique elements in column A can fit in memory:

    Code:
    python -c "
    fin= open('test.txt')
    
    col_a= set()
    for line in fin:
        ## Unique elements in column A
        a= line.strip().split()[0]
        col_a.add(a)
    
    fin.seek(0)
    for line in fin:
        ## Print out elements in column B not in set A
        b= line.strip().split()[1]
        if b not in col_a:
            print(b)
    fin.close()
    "
    Output goes to stdin so use > to send it to a file

    Hope it helps and I haven't made any mistake!

    Dario

    Comment

    • Giorgio C
      Member
      • Oct 2010
      • 89

      #3
      Thank you so much for your answer I've tried but it give me an error:

      Traceback (most recent call last):
      File "script.py", line 9, in <module>
      a= line.strip().split()[0]
      IndexError: list index out of range

      Do you know what may be the problem?

      Comment

      • dariober
        Senior Member
        • May 2010
        • 311

        #4
        Originally posted by Giorgio C View Post
        Thank you so much for your answer I've tried but it give me an error:

        Traceback (most recent call last):
        File "script.py", line 9, in <module>
        a= line.strip().split()[0]
        IndexError: list index out of range

        Do you know what may be the problem?
        Can you post a sample of the first few lines from your input file? It could be that the first line(s) are emtpy hence the error above.

        In fact, this version will skip blank lines:

        Code:
        python -c "
        fin= open('test.txt')
        
        col_a= set()
        for line in fin:
            if line.strip() == '':
                continue
            ## Unique elements in column A
            a= line.strip().split()[0]
            col_a.add(a)
        
        fin.seek(0)
        for line in fin:
            if line.strip() == '':
                continue
            ## Print out elements in column B not in set A
            b= line.strip().split()[1]
            if b not in col_a:
                print(b)
        fin.close()
        "
        By the way, this script pulls out elements in B not in A, but it doesn't pull out elements in A not in B. Is this ok?

        Comment

        • Richard Finney
          Senior Member
          • Feb 2009
          • 701

          #5
          "comm" command is useful for this stuff ..

          Code:
          -bash-3.00$ cat junk
          chr1    chr2
          chr2    chr1
          chr3    chr4
          chr5    chr4
          -bash-3.00$ awk '{print $1}' < junk | sort | uniq > junkcol1
          -bash-3.00$ awk '{print $2}' < junk | sort | uniq > junkcol2
          -bash-3.00$ comm -3 junkcol1 junkcol2
          chr3
                  chr4
          chr5
          ----
          check out the "comm" command with "man comm" or via search engine.

          Comment

          • swbarnes2
            Senior Member
            • May 2008
            • 910

            #6
            Rather than write your own tool, BEDTools intersectBed can do exactly this, if you convert your lists to .bed format

            Comment

            • Giorgio C
              Member
              • Oct 2010
              • 89

              #7
              Thank you Dariober,

              as you said it was a problem of some empty lines in the middle of the list that I did not see at first. Now it works good and however the 'skip blank script' version works greatly.

              Richard Finney thank you for your other suggestion, it works good too.

              Comment

              • dariober
                Senior Member
                • May 2010
                • 311

                #8
                Originally posted by Giorgio C View Post
                Thank you Dariober,

                as you said it was a problem of some empty lines in the middle of the list that I did not see at first. Now it works good and however the 'skip blank script' version works greatly.

                Richard Finney thank you for your other suggestion, it works good too.
                Glad to hear it worked. By the way, I realized that if your input file is tab-separated you should replace in the script "split()" with "split('\t')". The script I posted splits lines into columns at every occurance of a blank space, so including, but not restricing to, tab characters.

                Good luck!
                Dario

                Comment

                • Giorgio C
                  Member
                  • Oct 2010
                  • 89

                  #9
                  Yes infact, that was one of the thing I've noticed, in the tab separeted file needs to add (\t) to the script.

                  By the way it works good !

                  Thank you again for your precious help.

                  Cheers,
                  Giorgio

                  Comment

                  • LudesMeyers
                    Junior Member
                    • Aug 2013
                    • 2

                    #10
                    This is a great discussion for a novice scriptor of Python and bash/C shells. Thank you Giorgio C for all of your questions. They provide food for my thoughts. And thanks to Richard Finney for pointing out the use of the "comm" and "awk" commands.

                    This code is a combo of my first thoughts on how to approach the problem with subsequent introduction of the "comm" command.

                    Code:
                    bash-3.2$ cat junk
                    chr1	chr2
                    chr2	chr1
                    chr3	chr4
                    chr5	chr4
                    bash-3.2$ cut -f1 junk | sort | uniq > junkcol1
                    bash-3.2$ cut -f2 junk | sort | uniq > junkcol2
                    bash-3.2$ comm -3 junkcol1 junkcol2 > commonjunk
                    bash-3.2$ cat commonjunk
                    chr3
                    	chr4
                    chr5
                    Thanks everyone,
                    John

                    Comment

                    • Giorgio C
                      Member
                      • Oct 2010
                      • 89

                      #11
                      Sure John,

                      this is a very useful forum

                      Comment

                      Latest Articles

                      Collapse

                      • SEQadmin2
                        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                        by SEQadmin2


                        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                        Here are nine questions we think about, in roughly the order they matter, before...
                        06-18-2026, 07:11 AM
                      • SEQadmin2
                        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                        by SEQadmin2


                        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                        ...
                        06-02-2026, 10:05 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, Yesterday, 11:10 AM
                      0 responses
                      7 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-17-2026, 06:09 AM
                      0 responses
                      43 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-09-2026, 11:58 AM
                      0 responses
                      104 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-05-2026, 10:09 AM
                      0 responses
                      125 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...