Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • syfo
    replied
    Originally posted by francois.sabot View Post
    What about a simple grep ?

    grep 'chr1' FILE > chr1.txt
    grep 'chr2' FILE > chr2.txt
    A more generic grep solution could be something like
    Code:
    for i in `cut -d" " -f1 input | sort -u`; do grep -w $i input > $i.txt ; done
    But the awk alternative is better.

    Leave a comment:


  • syfo
    replied
    OK good, thanks Alex for the precision. Both commands should work then, I don't see any reason for an error either. Maybe try \awk instead of awk in case of some alias or shortcut?

    Rkk, let me know if there is any issue with my one-liner for your second task.

    Leave a comment:


  • alexdobin
    replied
    Originally posted by syfo View Post
    The other awk command that was posted almost at the same time has a space in the output file name after "$1", it should not change anything but if you got an error try it as quoted here.
    The space in $1 ".txt" is perfectly valid and cannot cause any problems. When you concatenate strings in awk, you separate them by spaces in the right-hand side: http://www.gnu.org/software/gawk/man...atenation.html
    Leaving space out in this case does not cause a problem, however it is a better practice to have space between concatenated strings. For example, if you concatenate several awk variables, you have to have space between them: v3=v1 v2. Of course, v3=v1v2 will not work.

    Leave a comment:


  • syfo
    replied
    Originally posted by gokhulkrishnakilaru View Post
    Code:
    awk '{print > $1".txt"}' input
    This is the correct and the best answer to the original question of the thread. The other awk command that was posted almost at the same time has a space in the output file name after "$1", it should not change anything but if you got an error try it as quoted here.

    As for the second problem, since you already know the resolution you want you don't need to compute min and max. Everything in one step:

    Code:
    awk '{bin[int($1/100)]+=$2}END{for (i in bin)print i*100+1"-"(i+1)*100,bin[i]}' input
    This line should give exactly the output you want. Pipe it on a sort -n if needed and/or change the separator "-".

    Leave a comment:


  • gokhulkrishnakilaru
    replied
    Originally posted by francois.sabot View Post
    What about a simple grep ?

    grep 'chr1' FILE > chr1.txt
    grep 'chr2' FILE > chr2.txt
    Francois,

    Grep is a handy tool. But, you have to repeat that command for each chromosome in ur first column. And with awk, a simple command when used once, will do the task easily.

    After all, it's a life worth counting on the clock. No one wants to sit there typing each chromosome, at least myself.

    Leave a comment:


  • francois.sabot
    replied
    Originally posted by rkk View Post
    Hello,

    I have a file like the following

    chr1 1234
    chr1 2345
    chr2 94837
    chr2 73457

    how can I split this data into two files

    chr1.txt

    chr1 1234
    chr1 2345

    chr2.txt

    chr2 94837
    chr2 73457

    Thanks in advance.
    What about a simple grep ?

    grep 'chr1' FILE > chr1.txt
    grep 'chr2' FILE > chr2.txt

    Leave a comment:


  • gokhulkrishnakilaru
    replied
    Originally posted by rkk View Post
    I should use that command in LINUX...

    Now, I have another issue

    I have a file like following..I need to bin the first column in 100bp regions and count the second column value for that bin
    10175 1
    10179 1
    10189 1
    10191 1
    10201 1
    10243 1
    10249 1
    10262 1
    10313 1
    10414 1
    10485 1
    10499 1

    The output should be something like this..

    10101-10200 4
    10201-10300 4
    10301-10400 1
    10401-10500 3

    Can someone help with this..

    Thanks in advance..
    @rkk,

    Your input can have two solutions

    Code:
    [COLOR="DarkOrchid"]Solution 1(Considering your minimum and maximum value from col1:
    
    cat input
    10175	1
    10179	1
    10189	1
    10191	1
    10201	1
    10243	1
    10249	1
    10262	1
    10313	1
    10414	1
    10485	1
    10499	1[/COLOR]


    Code:
    awk 'NR == 1 {max=$1 ; min=$1} $1 >= max {max = $1} $1 <= min {min = $1} END { print min"\t"max}' 1 | awk '{ print $1, i=$1+100;while(i++<$2) print i, i+=99}' > intermediate
    Code:
    cat intermediate
    
    10175 10275
    10276 10375
    10376 10475
    10476 10575
    Now, consider the above intermediate file and run the following code

    Code:
    awk 'NR==FNR{
       C[NR]=$1 " " $2
       L[C[NR]]=0
       next
    }
    {
     for (t in C) {
        split(C[t],v," ")
        if($1>=v[1] && $1<=v[2])
           L[C[t]]+=$2
     }
    }
    END {
       for(i=1;i in C;i++)
           print C[i] " " L[C[i]]
    }' intermediate input > output

    Code:
    cat output
    
    10175 10275 8
    10276 10375 1
    10376 10475 1
    10476 10575 2


    ###########################################


    Code:
    Solution 2 (Considering minimum value and nearest 100 and maximum value and nearest 100 from column1):
    
    cat input
    10175	1
    10179	1
    10189	1
    10191	1
    10201	1
    10243	1
    10249	1
    10262	1
    10313	1
    10414	1
    10485	1
    10499	1
    Code:
    awk '{       
        min=$1<min||!min?$1:min
        max=$1>max||!max?$1:max
    }      
    END {
      s=int(min/100)*100
      e=int(max/100)*100+100
      print s " " s+100
      for(i=s+101;i<e;i+=100)
         print i " " i+99
    }' input > intermediate
    Code:
    cat intermediate
    10100 10200
    10201 10300
    10301 10400
    10401 10500

    Now, consider the above intermediate file and run the following code

    Code:
    awk 'NR==FNR{
       C[NR]=$1 " " $2
       L[C[NR]]=0
       next
    }
    {
     for (t in C) {
        split(C[t],v," ")
        if($1>=v[1] && $1<=v[2])
           L[C[t]]+=$2
     }
    }
    END {
       for(i=1;i in C;i++)
           print C[i] " " L[C[i]]
    }' intermediate input > output
    Code:
    cat output
    
    10100 10200 4
    10201 10300 4
    10301 10400 1
    10401 10500 3

    Leave a comment:


  • rkk
    replied
    Once minimum value is identified.. then nearest 100 should be calculated.. for example in this case min value is 10175 so the bins starting value should be 10100.. hope this helps

    Leave a comment:


  • gokhulkrishnakilaru
    replied
    Originally posted by rkk View Post
    command has to identify min and max value from col1 values.. and then bin that into 100bp regions...
    I am afraid then your bins would be like this

    Code:
    10175-10275 8
    10276-10375 1
    10376-10475 1
    10476-10575 2

    Leave a comment:


  • rkk
    replied
    command has to identify min and max value from col1 values.. and then bin that into 100bp regions...

    Leave a comment:


  • gokhulkrishnakilaru
    replied
    Originally posted by rkk View Post
    I should use that command in LINUX...

    Now, I have another issue

    I have a file like following..I need to bin the first column in 100bp regions and count the second column value for that bin
    10175 1
    10179 1
    10189 1
    10191 1
    10201 1
    10243 1
    10249 1
    10262 1
    10313 1
    10414 1
    10485 1
    10499 1

    The output should be something like this..

    10101-10200 4
    10201-10300 4
    10301-10400 1
    10401-10500 3

    Can someone help with this..

    Thanks in advance..
    Do you already know your bins?

    If not, what are your start values and end values to consider bins at 100bp?

    Leave a comment:


  • rkk
    replied
    I should use that command in LINUX...

    Now, I have another issue

    I have a file like following..I need to bin the first column in 100bp regions and count the second column value for that bin
    10175 1
    10179 1
    10189 1
    10191 1
    10201 1
    10243 1
    10249 1
    10262 1
    10313 1
    10414 1
    10485 1
    10499 1

    The output should be something like this..

    10101-10200 4
    10201-10300 4
    10301-10400 1
    10401-10500 3

    Can someone help with this..

    Thanks in advance..

    Leave a comment:


  • gene_x
    replied
    Originally posted by gokhulkrishnakilaru View Post
    Code:
    awk '{print > $1".txt"}' input
    $1 refers to the first column.

    for each distinct column1,
    Code:
    print
    to another file
    Code:
    >
    with the same column name
    Code:
    $1
    I can understand print to another file with the same column name. What I don't get is where the separation based on first column contents happened..

    Leave a comment:


  • gokhulkrishnakilaru
    replied
    Originally posted by gene_x View Post
    Good to learn a easier way to do this.. can you explain a bit how did it work?

    Code:
    awk '{print > $1".txt"}' input
    $1 refers to the first column.

    for each distinct column1,
    Code:
    print
    to another file
    Code:
    >
    with the same column name
    Code:
    $1

    Leave a comment:


  • gokhulkrishnakilaru
    replied
    Originally posted by rkk View Post
    $head -5 test.txt

    1 9992
    1 9992
    1 9993
    1 9994
    1 9994


    $awk '{print > $1 ".txt"}' test.txt

    awk: syntax error at source line 1
    context is
    {print > $1 >>> ".txt" <<<
    awk: illegal statement at source line 1

    This is what I get for my test.txt file
    Where r u running it on?

    Are you on linux server or running at your Mac's terminal?

    Try using nawk or gawk instead of awk.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    The Impact of AI in Genomic Medicine
    by seqadmin



    Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
    02-26-2024, 02:07 PM
  • seqadmin
    Multiomics Techniques Advancing Disease Research
    by seqadmin


    New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

    A major leap in the field has
    ...
    02-08-2024, 06:33 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 06:12 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-23-2024, 04:11 PM
0 responses
67 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-21-2024, 08:52 AM
0 responses
75 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-20-2024, 08:57 AM
0 responses
66 views
0 likes
Last Post seqadmin  
Working...
X