Seqanswers Leaderboard Ad

**kmcarr** · 02-08-2012, 10:15 AM

Originally posted by Aeolus Huios View Post

I want to compare first line with next line of a file containg single column. for example a file cotains:
NM_1
NM_1
NM_1
NM_2
NM_2
NM_3
NM_4
NM_5
NM_5
NM_5
NM_5
i want to get output as
1 NM_1
2 NM_1
3 NM_1
1 NM_2
2 NM_2
1 NM_3
1 NM_4
1 NM_5
2 NM_5
3 NM_5
4 NM_5
separated by tabs.

First question, do you really need every occurrence of each identical element written to the output? It seems to me that it would be just as informative, easier to read/parse and more compact if your output only contained one line for each unique element with the count for that element. For example:

Code:

3 NM_1
2 NM_2
1 NM_3
1 NM_4
4 NM_5
etc....

The way I would do this would be with the unix command uniq, and since uniq requires that its input be sorted I always use sort first since it's never good to assume that your input is already sorted. By default uniq collapses all identical lines into a single line and adding the -c option will also output the count of the number of element in the original file.

Code:

# sort <inputFile> | uniq -c > <outputFile>

I should note that uniq prints leading spaces before the count and the separator between the count and the element is a space not a tab. The output for the above example would look like:

Code:

   3 NM_1
   2 NM_2
   1 NM_3
   1 NM_4
   4 NM_5

You could clean these up by adding sed and tr to the command pipeline

Code:

# sort <inputFile> | uniq -c | sed -e 's/^ *//' | tr ' ' '\t' > <outputFile>

Which will produce an output which looks like:

3	NM_1
2	NM_2
1	NM_3
1	NM_4
4	NM_5

**Richard Finney** · 02-08-2012, 10:25 AM

cat file.txt| awk '{if ($1!=prev)k=1;else k++;print k"\t"$0;prev=$0}'

**adaptivegenome** · 02-08-2012, 10:59 AM

sort | uniq -c

this is the most elegant solution

**Aeolus Huios** · 02-08-2012, 11:39 AM

Hi kmcar, Gege

Thanks alot for reply but i know using uniq linux function for getting the frequency of repeated data. But i want the output as what i said.
:-) )))

Hi Rechard,

Let me try I will reply U back after a while.
Thanks alot. :-) )))

With reagrds,
Aeolus

**adaptivegenome** · 02-08-2012, 03:13 PM

Originally posted by Aeolus Huios View Post

Hi kmcar, Gege

Thanks alot for reply but i know using uniq linux function for getting the frequency of repeated data. But i want the output as what i said.
:-) )))

Hi Rechard,

Let me try I will reply U back after a while.
Thanks alot. :-) )))

With reagrds,
Aeolus

Certainly you can write a long script. But you could also do a "sort | uniq -c" then use "cut" to grab the columns individually and "paste" to reassemble however you want with whatever delimiter you want. So this makes it just a couple unix commands.

**Aeolus Huios** · 02-08-2012, 10:06 PM

Hi Rechard ,

Once again thanks alot. Its works very well. :-) ))))
Can you tell me a good guide book or online articles for AWK commands.
It will very grateful.

With regards,
Pawan

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, Today, 06:35 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, Yesterday, 02:46 PM	0 responses 18 views 0 likes	Last Post by seqadmin Yesterday, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 17 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

To get the no. of repeats and along with the repeated element

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News