Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Scripting Help - Common Elements

    Hi all,

    I have a question hoping someone can help me.

    I have a file with two columns A and B with a list of gene coordinates in each of them:

    I'd like to write in another file every element that is present in column A but not in B.
    (I mean I'd like to write in output the NON common elements of the two columns).


    Do you have any suggestions ( little script in perl or python) not exel cuz the list is huge.


    Thanks you in advance,
    Giorgio

  • #2
    Hi Giorgio,

    See if this helps:

    say you have test.txt like this (tab-separated):
    Code:
    chr1    chr2
    chr2    chr1
    chr3    chr4
    chr5    chr4
    You want elements in column B (2nd) not in A (1st). So the output would be "chr4" outputted twice

    This python script should do it. It assumes that all the unique elements in column A can fit in memory:

    Code:
    python -c "
    fin= open('test.txt')
    
    col_a= set()
    for line in fin:
        ## Unique elements in column A
        a= line.strip().split()[0]
        col_a.add(a)
    
    fin.seek(0)
    for line in fin:
        ## Print out elements in column B not in set A
        b= line.strip().split()[1]
        if b not in col_a:
            print(b)
    fin.close()
    "
    Output goes to stdin so use > to send it to a file

    Hope it helps and I haven't made any mistake!

    Dario

    Comment


    • #3
      Thank you so much for your answer I've tried but it give me an error:

      Traceback (most recent call last):
      File "script.py", line 9, in <module>
      a= line.strip().split()[0]
      IndexError: list index out of range

      Do you know what may be the problem?

      Comment


      • #4
        Originally posted by Giorgio C View Post
        Thank you so much for your answer I've tried but it give me an error:

        Traceback (most recent call last):
        File "script.py", line 9, in <module>
        a= line.strip().split()[0]
        IndexError: list index out of range

        Do you know what may be the problem?
        Can you post a sample of the first few lines from your input file? It could be that the first line(s) are emtpy hence the error above.

        In fact, this version will skip blank lines:

        Code:
        python -c "
        fin= open('test.txt')
        
        col_a= set()
        for line in fin:
            if line.strip() == '':
                continue
            ## Unique elements in column A
            a= line.strip().split()[0]
            col_a.add(a)
        
        fin.seek(0)
        for line in fin:
            if line.strip() == '':
                continue
            ## Print out elements in column B not in set A
            b= line.strip().split()[1]
            if b not in col_a:
                print(b)
        fin.close()
        "
        By the way, this script pulls out elements in B not in A, but it doesn't pull out elements in A not in B. Is this ok?

        Comment


        • #5
          "comm" command is useful for this stuff ..

          Code:
          -bash-3.00$ cat junk
          chr1    chr2
          chr2    chr1
          chr3    chr4
          chr5    chr4
          -bash-3.00$ awk '{print $1}' < junk | sort | uniq > junkcol1
          -bash-3.00$ awk '{print $2}' < junk | sort | uniq > junkcol2
          -bash-3.00$ comm -3 junkcol1 junkcol2
          chr3
                  chr4
          chr5
          ----
          check out the "comm" command with "man comm" or via search engine.

          Comment


          • #6
            Rather than write your own tool, BEDTools intersectBed can do exactly this, if you convert your lists to .bed format

            Comment


            • #7
              Thank you Dariober,

              as you said it was a problem of some empty lines in the middle of the list that I did not see at first. Now it works good and however the 'skip blank script' version works greatly.

              Richard Finney thank you for your other suggestion, it works good too.

              Comment


              • #8
                Originally posted by Giorgio C View Post
                Thank you Dariober,

                as you said it was a problem of some empty lines in the middle of the list that I did not see at first. Now it works good and however the 'skip blank script' version works greatly.

                Richard Finney thank you for your other suggestion, it works good too.
                Glad to hear it worked. By the way, I realized that if your input file is tab-separated you should replace in the script "split()" with "split('\t')". The script I posted splits lines into columns at every occurance of a blank space, so including, but not restricing to, tab characters.

                Good luck!
                Dario

                Comment


                • #9
                  Yes infact, that was one of the thing I've noticed, in the tab separeted file needs to add (\t) to the script.

                  By the way it works good !

                  Thank you again for your precious help.

                  Cheers,
                  Giorgio

                  Comment


                  • #10
                    This is a great discussion for a novice scriptor of Python and bash/C shells. Thank you Giorgio C for all of your questions. They provide food for my thoughts. And thanks to Richard Finney for pointing out the use of the "comm" and "awk" commands.

                    This code is a combo of my first thoughts on how to approach the problem with subsequent introduction of the "comm" command.

                    Code:
                    bash-3.2$ cat junk
                    chr1	chr2
                    chr2	chr1
                    chr3	chr4
                    chr5	chr4
                    bash-3.2$ cut -f1 junk | sort | uniq > junkcol1
                    bash-3.2$ cut -f2 junk | sort | uniq > junkcol2
                    bash-3.2$ comm -3 junkcol1 junkcol2 > commonjunk
                    bash-3.2$ cat commonjunk
                    chr3
                    	chr4
                    chr5
                    Thanks everyone,
                    John

                    Comment


                    • #11
                      Sure John,

                      this is a very useful forum

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Exploring the Dynamics of the Tumor Microenvironment
                        by seqadmin




                        The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                        07-08-2024, 03:19 PM
                      • seqadmin
                        Exploring Human Diversity Through Large-Scale Omics
                        by seqadmin


                        In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                        06-25-2024, 06:43 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 07-10-2024, 07:30 AM
                      0 responses
                      24 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 07-03-2024, 09:45 AM
                      0 responses
                      201 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 07-03-2024, 08:54 AM
                      0 responses
                      211 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 07-02-2024, 03:00 PM
                      0 responses
                      192 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X