Hello,
My fasta file has a long consensus sequence (gigabases long) that is padded with 'n' between the actual sequences. Like this:
actgggacnnnnnnnnnnnnnnnnnnnnnnnnnnactgacgtggattgc
aatnnnnnnnnnnnnnnnaccaattggatagagaccnnn
I've searched rather intensively to see if the internet can solve my problem. Not really, though they were close. For example, people report splitting the long sequence into smaller files but I do not want that. I want the n removed and have all sequences in the same file. Better yet, with the start and end of the sequences.
Desired output (>seqname_start_end) :
>seq1_1_8
actgggac
>seq2_35_52
actgacgtggattgcaat
>seq3_68_85
accaattggatagagacc
If anyone could point me towards a right tool (bioperl, etc) or give me a pseudo-code in perl, I would appreciate it.
Thanks.
My fasta file has a long consensus sequence (gigabases long) that is padded with 'n' between the actual sequences. Like this:
actgggacnnnnnnnnnnnnnnnnnnnnnnnnnnactgacgtggattgc
aatnnnnnnnnnnnnnnnaccaattggatagagaccnnn
I've searched rather intensively to see if the internet can solve my problem. Not really, though they were close. For example, people report splitting the long sequence into smaller files but I do not want that. I want the n removed and have all sequences in the same file. Better yet, with the start and end of the sequences.
Desired output (>seqname_start_end) :
>seq1_1_8
actgggac
>seq2_35_52
actgacgtggattgcaat
>seq3_68_85
accaattggatagagacc
If anyone could point me towards a right tool (bioperl, etc) or give me a pseudo-code in perl, I would appreciate it.
Thanks.
Comment