Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • shyam_la
    Member
    • Mar 2012
    • 97

    Question about editing a tabbed text file..

    I have a tabbed text file with several rows and several columns.

    If x rows have the same contents under columns - A, B, G - then, I want to delete (x-1) rows fully and retain only 1 (sort of duplicate removal).

    Any command in linux / windows to do that?
  • Heisman
    Senior Member
    • Dec 2010
    • 534

    #2
    You might have to move some fields around, but look into using "sort" and "uniq". Is this a homework question?

    Comment

    • shyam_la
      Member
      • Mar 2012
      • 97

      #3
      Originally posted by Heisman View Post
      You might have to move some fields around, but look into using "sort" and "uniq". Is this a homework question?
      No. Its for an annotated mutation list I have generated by SNPEff. Most mutation loci have been assigned four or five lines each, because multiple transcripts are known to occur at that locus. I need want to get rid of the redundancy when the effect of the mutation is the same, irrespective of transcript..

      Comment

      • Heisman
        Senior Member
        • Dec 2010
        • 534

        #4
        Originally posted by shyam_la View Post
        No. Its for an annotated mutation list I have generated by SNPEff. Most mutation loci have been assigned four or five lines each, because multiple transcripts are known to occur at that locus. I need want to get rid of the redundancy when the effect of the mutation is the same, irrespective of transcript..
        There's probably a better way but you could use awk like this:

        awk '{print $0"\t"$A","$B","$G}' [input_file] | sort

        and then pipe that into uniq and use the -f option. That might work although I'm not sure you could easily specify which transcript you keep.

        Comment

        • shyam_la
          Member
          • Mar 2012
          • 97

          #5
          I am sorry, but I don't code. I am not a Bioinformatician; am a MD, working as a research associate and have no specialised computer training.

          I did this:

          $ awk '{print $0"\t"$A","$B","$J","$P}' out.txt | sort > out_mod.txt

          $ uniq --skip-fields=1 out_mod.txt out2.txt

          awk did appear to have sorted the file by A, then B, then J, then P, but it also messed things up by copying all the columns A to U over and over for 6 times, side by side..

          uniq didn't do anything.

          Is my syntax correct?

          Thank you.

          Comment

          • Heisman
            Senior Member
            • Dec 2010
            • 534

            #6
            I see. Is it possible for you to do this in excel? If you have a bunch of different files you can concatenate them together in linux and then have one file to work with in excel. If you want to add a column to each file with a sample specific ID that can be done pretty easily in linux before concatenating the files and putting it into excel.

            Otherwise, could you provide a few sample lines from one of your files that has lines you want to keep and get rid of? And then for those sample lines also provide a smaller subset of lines with the desired output? I could probably write a quick bash script to do it.

            Comment

            • ucpete
              Member
              • Dec 2008
              • 35

              #7
              I'd just do it in Python. Let's say you have 10 fields on each line and you care about the 1st, 2nd, and 7th in terms of defining uniqueness.

              HTML Code:
              inf = open("yourfile.txt")
              outf = open("yourfile_unique.txt",'w')
              uniqueValues = {}
              for line in inf:
                  fields = line.strip().split('\t')
                  keyTuple = (fields[0],fields[1],fields[6])
                  if keyTuple not in uniqueValues:
                      uniq[keyTuple] = None
                      outf.write(line)
              Yup. That'll do it.

              EDIT: The indentations were all off-- I cleaned up a bit.
              Last edited by ucpete; 06-13-2012, 03:39 PM.

              Comment

              • shyam_la
                Member
                • Mar 2012
                • 97

                #8
                Originally posted by Heisman View Post
                I see. Is it possible for you to do this in excel? If you have a bunch of different files you can concatenate them together in linux and then have one file to work with in excel. If you want to add a column to each file with a sample specific ID that can be done pretty easily in linux before concatenating the files and putting it into excel.

                Otherwise, could you provide a few sample lines from one of your files that has lines you want to keep and get rid of? And then for those sample lines also provide a smaller subset of lines with the desired output? I could probably write a quick bash script to do it.
                Yes, I have been using excel to view my results. I have only one sample in so far. So, there aren't multiple files to merge.. Just one.

                Just one list of mutations. I am experimenting with the different tools and callers to get a pipeline at the moment. Using the Exome manual here for pre processing and MuTect from Broad gave excellent mutation calls. After annotation, the type of mutations expected (UV signature) were found in huge amounts and also some of the genes to be mutated in this type of tumor were found mutated. I think I have a viable pipeline to run things through, once more sequences start coming in..

                Anyway, story aside - few lines as you asked..

                1 1653142 G A SNP Hom CDK11B.1 CDK11B mRNA NM_033493 NM_033493.ex.18 3 SYNONYMOUS_CODING D/D gaC/gaT 40 1 2310
                1 1653142 G A SNP Hom CDK11B.1 CDK11B mRNA NM_033492 NM_033492.ex.18 3 SYNONYMOUS_CODING D/D gaC/gaT 40 1 2337
                1 1653142 G A SNP Hom CDK11B.1 CDK11B mRNA NM_033486 NM_033486.ex.18 3 SYNONYMOUS_CODING D/D gaC/gaT 40 1 2343
                1 1653142 G A SNP Hom CDK11B.1 CDK11B mRNA NM_033487 NM_033487.ex.16 3 UTR_5_PRIME: 380 bases from TSS


                There are columns, A to U in there. If columns, A, B, J, O, P, S, T are the same, like the first three lines in the example above, I want only one line to be retained and the remaining two to be discarded.

                Thank you.

                PS: Three columns are mostly empty; thats why you see fewer than U columns there..
                Last edited by shyam_la; 06-13-2012, 03:53 PM. Reason: To make it clear, why there were fewer columns,..

                Comment

                • shyam_la
                  Member
                  • Mar 2012
                  • 97

                  #9
                  Originally posted by ucpete View Post
                  I'd just do it in Python. Let's say you have 10 fields on each line and you care about the 1st, 2nd, and 7th in terms of defining uniqueness.

                  HTML Code:
                  inf = open("yourfile.txt")
                  outf = open("yourfile_unique.txt",'w')
                  uniqueValues = {}
                  for line in inf:
                      fields = line.strip().split('\t')
                      keyTuple = (fields[0],fields[1],fields[6])
                      if keyTuple not in uniqueValues:
                          uniq[keyTuple] = None
                          outf.write(line)
                  Yup. That'll do it.

                  EDIT: The indentations were all off-- I cleaned up a bit.
                  I don't code; not a programmer. I just installed Python2.7.
                  Can do it only if you are willing to take me through it step by step!! :P

                  Comment

                  • ucpete
                    Member
                    • Dec 2008
                    • 35

                    #10
                    Type "python" from within the directory containing your file. Enter the above line-by-line, replacing "yourfile.txt" with whatever your file name is, and give it a descriptive output file name as well (not "yourfile_unique.txt"). Hit ctrl-d to exit python and boom, you got what you wanted. Then go to the python website and look for the basic tutorials. Once you've read a little bit there, come back and try to understand the code above. I'm a biologist and it took me about one week of practice to be able to write code like that above to accomplish simple tasks quickly. If you're not using computers to do your research, you're probably doing it wrong.

                    Comment

                    • shyam_la
                      Member
                      • Mar 2012
                      • 97

                      #11
                      Originally posted by ucpete View Post
                      Type "python" from within the directory containing your file. Enter the above line-by-line, replacing "yourfile.txt" with whatever your file name is, and give it a descriptive output file name as well (not "yourfile_unique.txt"). Hit ctrl-d to exit python and boom, you got what you wanted. Then go to the python website and look for the basic tutorials. Once you've read a little bit there, come back and try to understand the code above. I'm a biologist and it took me about one week of practice to be able to write code like that above to accomplish simple tasks quickly. If you're not using computers to do your research, you're probably doing it wrong.
                      I believe its Ctrl Z in windows. Anyway, did what you said, line by line and got this:

                      Traceback (most recent call last):
                      File "<stdin>", line 3, in <module>
                      IndexError: list index out of range

                      If you're not using computers to do your research, you're probably doing it wrong - Agreed. I am using it; I am just not coding, yet.

                      Comment

                      • ucpete
                        Member
                        • Dec 2008
                        • 35

                        #12
                        You'll be coding soon! It seems like your file might have a header line. Try this instead:

                        HTML Code:
                        inf = open("yourfile.txt")
                        outf = open("yourfile_unique.txt",'w')
                        uniqueValues = {}
                        for line in inf:
                            fields = line.strip().split('\t')
                            if len(fields) > 6:
                                keyTuple = (fields[0],fields[1],fields[6])
                                if keyTuple not in uniqueValues:
                                    uniq[keyTuple] = None
                                    outf.write(line)

                        Comment

                        • shyam_la
                          Member
                          • Mar 2012
                          • 97

                          #13
                          There is no header line. Atleast none that is visible in Excel..

                          Comment

                          • shyam_la
                            Member
                            • Mar 2012
                            • 97

                            #14
                            This is the error now:

                            Traceback (most recent call last):
                            File "<stdin>", line 4, in <module>
                            IndexError: list index out of range

                            Comment

                            • ucpete
                              Member
                              • Dec 2008
                              • 35

                              #15
                              I'm guessing now that your file isn't tab-delimited-- that it's space-delimited. Try changing to the split('\t') statement to just split(). I can't really help without much more information, sorry.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 08:59 AM
                              0 responses
                              9 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              30 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...