Seqanswers Leaderboard Ad

**gokhulkrishnakilaru** · 02-26-2013, 04:07 PM

Originally posted by rkk View Post

command has to identify min and max value from col1 values.. and then bin that into 100bp regions...

I am afraid then your bins would be like this

Code:

10175-10275 8
10276-10375 1
10376-10475 1
10476-10575 2

**rkk** · 02-26-2013, 05:14 PM

Once minimum value is identified.. then nearest 100 should be calculated.. for example in this case min value is 10175 so the bins starting value should be 10100.. hope this helps

**gokhulkrishnakilaru** · 02-26-2013, 06:18 PM

Originally posted by rkk View Post

I should use that command in LINUX...

Now, I have another issue

I have a file like following..I need to bin the first column in 100bp regions and count the second column value for that bin
10175 1
10179 1
10189 1
10191 1
10201 1
10243 1
10249 1
10262 1
10313 1
10414 1
10485 1
10499 1

The output should be something like this..

10101-10200 4
10201-10300 4
10301-10400 1
10401-10500 3

Can someone help with this..

Thanks in advance..

@rkk,

Your input can have two solutions

Code:

[COLOR="DarkOrchid"]Solution 1(Considering your minimum and maximum value from col1:

cat input
10175	1
10179	1
10189	1
10191	1
10201	1
10243	1
10249	1
10262	1
10313	1
10414	1
10485	1
10499	1[/COLOR]

Code:

awk 'NR == 1 {max=$1 ; min=$1} $1 >= max {max = $1} $1 <= min {min = $1} END { print min"\t"max}' 1 | awk '{ print $1, i=$1+100;while(i++<$2) print i, i+=99}' > intermediate

Code:

cat intermediate

10175 10275
10276 10375
10376 10475
10476 10575

Now, consider the above intermediate file and run the following code

Code:

awk 'NR==FNR{
   C[NR]=$1 " " $2
   L[C[NR]]=0
   next
}
{
 for (t in C) {
    split(C[t],v," ")
    if($1>=v[1] && $1<=v[2])
       L[C[t]]+=$2
 }
}
END {
   for(i=1;i in C;i++)
       print C[i] " " L[C[i]]
}' intermediate input > output

Code:

cat output

10175 10275 8
10276 10375 1
10376 10475 1
10476 10575 2

###########################################

Code:

Solution 2 (Considering minimum value and nearest 100 and maximum value and nearest 100 from column1):

cat input
10175	1
10179	1
10189	1
10191	1
10201	1
10243	1
10249	1
10262	1
10313	1
10414	1
10485	1
10499	1

Code:

awk '{       
    min=$1<min||!min?$1:min
    max=$1>max||!max?$1:max
}      
END {
  s=int(min/100)*100
  e=int(max/100)*100+100
  print s " " s+100
  for(i=s+101;i<e;i+=100)
     print i " " i+99
}' input > intermediate

Code:

cat intermediate
10100 10200
10201 10300
10301 10400
10401 10500

Now, consider the above intermediate file and run the following code

Code:

awk 'NR==FNR{
   C[NR]=$1 " " $2
   L[C[NR]]=0
   next
}
{
 for (t in C) {
    split(C[t],v," ")
    if($1>=v[1] && $1<=v[2])
       L[C[t]]+=$2
 }
}
END {
   for(i=1;i in C;i++)
       print C[i] " " L[C[i]]
}' intermediate input > output

Code:

cat output

10100 10200 4
10201 10300 4
10301 10400 1
10401 10500 3

**francois.sabot** · 02-27-2013, 02:12 AM

Originally posted by rkk View Post

Hello,

I have a file like the following

chr1 1234
chr1 2345
chr2 94837
chr2 73457

how can I split this data into two files

chr1.txt

chr1 1234
chr1 2345

chr2.txt

chr2 94837
chr2 73457

Thanks in advance.

What about a simple grep ?

grep 'chr1' FILE > chr1.txt
grep 'chr2' FILE > chr2.txt

**gokhulkrishnakilaru** · 02-27-2013, 04:09 AM

Originally posted by francois.sabot View Post

What about a simple grep ?

grep 'chr1' FILE > chr1.txt
grep 'chr2' FILE > chr2.txt

Francois,

Grep is a handy tool. But, you have to repeat that command for each chromosome in ur first column. And with awk, a simple command when used once, will do the task easily.

After all, it's a life worth counting on the clock. No one wants to sit there typing each chromosome, at least myself.

**syfo** · 02-28-2013, 02:56 AM

Originally posted by gokhulkrishnakilaru View Post

Code:

awk '{print > $1".txt"}' input

This is the correct and the best answer to the original question of the thread. The other awk command that was posted almost at the same time has a space in the output file name after "$1", it should not change anything but if you got an error try it as quoted here.

As for the second problem, since you already know the resolution you want you don't need to compute min and max. Everything in one step:

Code:

awk '{bin[int($1/100)]+=$2}END{for (i in bin)print i*100+1"-"(i+1)*100,bin[i]}' input

This line should give exactly the output you want. Pipe it on a sort -n if needed and/or change the separator "-".

**alexdobin** · 02-28-2013, 06:02 AM

Originally posted by syfo View Post

The other awk command that was posted almost at the same time has a space in the output file name after "$1", it should not change anything but if you got an error try it as quoted here.

The space in $1 ".txt" is perfectly valid and cannot cause any problems. When you concatenate strings in awk, you separate them by spaces in the right-hand side: http://www.gnu.org/software/gawk/man...atenation.html
Leaving space out in this case does not cause a problem, however it is a better practice to have space between concatenated strings. For example, if you concatenate several awk variables, you have to have space between them: v3=v1 v2. Of course, v3=v1v2 will not work.

**syfo** · 02-28-2013, 06:30 AM

OK good, thanks Alex for the precision. Both commands should work then, I don't see any reason for an error either. Maybe try \awk instead of awk in case of some alias or shortcut?

Rkk, let me know if there is any issue with my one-liner for your second task.

**syfo** · 02-28-2013, 06:34 AM

Originally posted by francois.sabot View Post

What about a simple grep ?

grep 'chr1' FILE > chr1.txt
grep 'chr2' FILE > chr2.txt

A more generic grep solution could be something like

Code:

for i in `cut -d" " -f1 input | sort -u`; do grep -w $i input > $i.txt ; done

But the awk alternative is better.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News