Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Scripting Help - Common Elements

    Hi all,

    I have a question hoping someone can help me.

    I have a file with two columns A and B with a list of gene coordinates in each of them:

    I'd like to write in another file every element that is present in column A but not in B.
    (I mean I'd like to write in output the NON common elements of the two columns).


    Do you have any suggestions ( little script in perl or python) not exel cuz the list is huge.


    Thanks you in advance,
    Giorgio

  • #2
    Hi Giorgio,

    See if this helps:

    say you have test.txt like this (tab-separated):
    Code:
    chr1    chr2
    chr2    chr1
    chr3    chr4
    chr5    chr4
    You want elements in column B (2nd) not in A (1st). So the output would be "chr4" outputted twice

    This python script should do it. It assumes that all the unique elements in column A can fit in memory:

    Code:
    python -c "
    fin= open('test.txt')
    
    col_a= set()
    for line in fin:
        ## Unique elements in column A
        a= line.strip().split()[0]
        col_a.add(a)
    
    fin.seek(0)
    for line in fin:
        ## Print out elements in column B not in set A
        b= line.strip().split()[1]
        if b not in col_a:
            print(b)
    fin.close()
    "
    Output goes to stdin so use > to send it to a file

    Hope it helps and I haven't made any mistake!

    Dario

    Comment


    • #3
      Thank you so much for your answer I've tried but it give me an error:

      Traceback (most recent call last):
      File "script.py", line 9, in <module>
      a= line.strip().split()[0]
      IndexError: list index out of range

      Do you know what may be the problem?

      Comment


      • #4
        Originally posted by Giorgio C View Post
        Thank you so much for your answer I've tried but it give me an error:

        Traceback (most recent call last):
        File "script.py", line 9, in <module>
        a= line.strip().split()[0]
        IndexError: list index out of range

        Do you know what may be the problem?
        Can you post a sample of the first few lines from your input file? It could be that the first line(s) are emtpy hence the error above.

        In fact, this version will skip blank lines:

        Code:
        python -c "
        fin= open('test.txt')
        
        col_a= set()
        for line in fin:
            if line.strip() == '':
                continue
            ## Unique elements in column A
            a= line.strip().split()[0]
            col_a.add(a)
        
        fin.seek(0)
        for line in fin:
            if line.strip() == '':
                continue
            ## Print out elements in column B not in set A
            b= line.strip().split()[1]
            if b not in col_a:
                print(b)
        fin.close()
        "
        By the way, this script pulls out elements in B not in A, but it doesn't pull out elements in A not in B. Is this ok?

        Comment


        • #5
          "comm" command is useful for this stuff ..

          Code:
          -bash-3.00$ cat junk
          chr1    chr2
          chr2    chr1
          chr3    chr4
          chr5    chr4
          -bash-3.00$ awk '{print $1}' < junk | sort | uniq > junkcol1
          -bash-3.00$ awk '{print $2}' < junk | sort | uniq > junkcol2
          -bash-3.00$ comm -3 junkcol1 junkcol2
          chr3
                  chr4
          chr5
          ----
          check out the "comm" command with "man comm" or via search engine.

          Comment


          • #6
            Rather than write your own tool, BEDTools intersectBed can do exactly this, if you convert your lists to .bed format

            Comment


            • #7
              Thank you Dariober,

              as you said it was a problem of some empty lines in the middle of the list that I did not see at first. Now it works good and however the 'skip blank script' version works greatly.

              Richard Finney thank you for your other suggestion, it works good too.

              Comment


              • #8
                Originally posted by Giorgio C View Post
                Thank you Dariober,

                as you said it was a problem of some empty lines in the middle of the list that I did not see at first. Now it works good and however the 'skip blank script' version works greatly.

                Richard Finney thank you for your other suggestion, it works good too.
                Glad to hear it worked. By the way, I realized that if your input file is tab-separated you should replace in the script "split()" with "split('\t')". The script I posted splits lines into columns at every occurance of a blank space, so including, but not restricing to, tab characters.

                Good luck!
                Dario

                Comment


                • #9
                  Yes infact, that was one of the thing I've noticed, in the tab separeted file needs to add (\t) to the script.

                  By the way it works good !

                  Thank you again for your precious help.

                  Cheers,
                  Giorgio

                  Comment


                  • #10
                    This is a great discussion for a novice scriptor of Python and bash/C shells. Thank you Giorgio C for all of your questions. They provide food for my thoughts. And thanks to Richard Finney for pointing out the use of the "comm" and "awk" commands.

                    This code is a combo of my first thoughts on how to approach the problem with subsequent introduction of the "comm" command.

                    Code:
                    bash-3.2$ cat junk
                    chr1	chr2
                    chr2	chr1
                    chr3	chr4
                    chr5	chr4
                    bash-3.2$ cut -f1 junk | sort | uniq > junkcol1
                    bash-3.2$ cut -f2 junk | sort | uniq > junkcol2
                    bash-3.2$ comm -3 junkcol1 junkcol2 > commonjunk
                    bash-3.2$ cat commonjunk
                    chr3
                    	chr4
                    chr5
                    Thanks everyone,
                    John

                    Comment


                    • #11
                      Sure John,

                      this is a very useful forum

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Latest Developments in Precision Medicine
                        by seqadmin



                        Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                        Somatic Genomics
                        “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                        05-24-2024, 01:16 PM
                      • seqadmin
                        Recent Advances in Sequencing Analysis Tools
                        by seqadmin


                        The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                        05-06-2024, 07:48 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 05-24-2024, 07:15 AM
                      0 responses
                      16 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 05-23-2024, 10:28 AM
                      0 responses
                      18 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 05-23-2024, 07:35 AM
                      0 responses
                      22 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 05-22-2024, 02:06 PM
                      0 responses
                      11 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X