Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Scripting Help - Common Elements

    Hi all,

    I have a question hoping someone can help me.

    I have a file with two columns A and B with a list of gene coordinates in each of them:

    I'd like to write in another file every element that is present in column A but not in B.
    (I mean I'd like to write in output the NON common elements of the two columns).


    Do you have any suggestions ( little script in perl or python) not exel cuz the list is huge.


    Thanks you in advance,
    Giorgio

  • #2
    Hi Giorgio,

    See if this helps:

    say you have test.txt like this (tab-separated):
    Code:
    chr1    chr2
    chr2    chr1
    chr3    chr4
    chr5    chr4
    You want elements in column B (2nd) not in A (1st). So the output would be "chr4" outputted twice

    This python script should do it. It assumes that all the unique elements in column A can fit in memory:

    Code:
    python -c "
    fin= open('test.txt')
    
    col_a= set()
    for line in fin:
        ## Unique elements in column A
        a= line.strip().split()[0]
        col_a.add(a)
    
    fin.seek(0)
    for line in fin:
        ## Print out elements in column B not in set A
        b= line.strip().split()[1]
        if b not in col_a:
            print(b)
    fin.close()
    "
    Output goes to stdin so use > to send it to a file

    Hope it helps and I haven't made any mistake!

    Dario

    Comment


    • #3
      Thank you so much for your answer I've tried but it give me an error:

      Traceback (most recent call last):
      File "script.py", line 9, in <module>
      a= line.strip().split()[0]
      IndexError: list index out of range

      Do you know what may be the problem?

      Comment


      • #4
        Originally posted by Giorgio C View Post
        Thank you so much for your answer I've tried but it give me an error:

        Traceback (most recent call last):
        File "script.py", line 9, in <module>
        a= line.strip().split()[0]
        IndexError: list index out of range

        Do you know what may be the problem?
        Can you post a sample of the first few lines from your input file? It could be that the first line(s) are emtpy hence the error above.

        In fact, this version will skip blank lines:

        Code:
        python -c "
        fin= open('test.txt')
        
        col_a= set()
        for line in fin:
            if line.strip() == '':
                continue
            ## Unique elements in column A
            a= line.strip().split()[0]
            col_a.add(a)
        
        fin.seek(0)
        for line in fin:
            if line.strip() == '':
                continue
            ## Print out elements in column B not in set A
            b= line.strip().split()[1]
            if b not in col_a:
                print(b)
        fin.close()
        "
        By the way, this script pulls out elements in B not in A, but it doesn't pull out elements in A not in B. Is this ok?

        Comment


        • #5
          "comm" command is useful for this stuff ..

          Code:
          -bash-3.00$ cat junk
          chr1    chr2
          chr2    chr1
          chr3    chr4
          chr5    chr4
          -bash-3.00$ awk '{print $1}' < junk | sort | uniq > junkcol1
          -bash-3.00$ awk '{print $2}' < junk | sort | uniq > junkcol2
          -bash-3.00$ comm -3 junkcol1 junkcol2
          chr3
                  chr4
          chr5
          ----
          check out the "comm" command with "man comm" or via search engine.

          Comment


          • #6
            Rather than write your own tool, BEDTools intersectBed can do exactly this, if you convert your lists to .bed format

            Comment


            • #7
              Thank you Dariober,

              as you said it was a problem of some empty lines in the middle of the list that I did not see at first. Now it works good and however the 'skip blank script' version works greatly.

              Richard Finney thank you for your other suggestion, it works good too.

              Comment


              • #8
                Originally posted by Giorgio C View Post
                Thank you Dariober,

                as you said it was a problem of some empty lines in the middle of the list that I did not see at first. Now it works good and however the 'skip blank script' version works greatly.

                Richard Finney thank you for your other suggestion, it works good too.
                Glad to hear it worked. By the way, I realized that if your input file is tab-separated you should replace in the script "split()" with "split('\t')". The script I posted splits lines into columns at every occurance of a blank space, so including, but not restricing to, tab characters.

                Good luck!
                Dario

                Comment


                • #9
                  Yes infact, that was one of the thing I've noticed, in the tab separeted file needs to add (\t) to the script.

                  By the way it works good !

                  Thank you again for your precious help.

                  Cheers,
                  Giorgio

                  Comment


                  • #10
                    This is a great discussion for a novice scriptor of Python and bash/C shells. Thank you Giorgio C for all of your questions. They provide food for my thoughts. And thanks to Richard Finney for pointing out the use of the "comm" and "awk" commands.

                    This code is a combo of my first thoughts on how to approach the problem with subsequent introduction of the "comm" command.

                    Code:
                    bash-3.2$ cat junk
                    chr1	chr2
                    chr2	chr1
                    chr3	chr4
                    chr5	chr4
                    bash-3.2$ cut -f1 junk | sort | uniq > junkcol1
                    bash-3.2$ cut -f2 junk | sort | uniq > junkcol2
                    bash-3.2$ comm -3 junkcol1 junkcol2 > commonjunk
                    bash-3.2$ cat commonjunk
                    chr3
                    	chr4
                    chr5
                    Thanks everyone,
                    John

                    Comment


                    • #11
                      Sure John,

                      this is a very useful forum

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 03-27-2024, 06:37 PM
                      0 responses
                      13 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-27-2024, 06:07 PM
                      0 responses
                      11 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      53 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      69 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X