Hi all !
After several days looking and testing, I'm finally here, looking for help...
I have a large multi fasta file from RNA-seq which contain isoforms.
The file looks like this : (truncated for the example)
Here the sequence ">comp32_c0_seq1 len=365 path=[18710:0-364]" is unique while the sequence ">comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]" and ">comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]" are isoforms (note the "seq1" and "seq2").
Tipically what I want is a small script that is able to
1) check the #1 Id whit the #2, #3, #4...
2) If the #1 Id is unique, pass to the #2 Id
3) if the #1 Id is not unique, then compare its length (from the "len=XXX" commentary string in the header or from the sequence itself, it doesn't matters) whit the length of the other Ids, and finally keep the longest.
Note that some sequence have more than 10 isoforms...
Is it something feasible ?
I would really appreciate if you guys could give me a hand on this..!
Thanks !
Gabriel.
After several days looking and testing, I'm finally here, looking for help...
I have a large multi fasta file from RNA-seq which contain isoforms.
The file looks like this : (truncated for the example)
Code:
>comp32_c0_seq1 len=365 path=[18710:0-364] CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC >comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333] ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG >comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322] ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG >comp36_c0_seq1 len=275 path=[22213:0-274] CAGAGGCTGGCCGGCGGCTGGAGGCTGCAGAGGCTGGCCGCCGTGCGGGCGCCGCA
Tipically what I want is a small script that is able to
1) check the #1 Id whit the #2, #3, #4...
2) If the #1 Id is unique, pass to the #2 Id
3) if the #1 Id is not unique, then compare its length (from the "len=XXX" commentary string in the header or from the sequence itself, it doesn't matters) whit the length of the other Ids, and finally keep the longest.
Note that some sequence have more than 10 isoforms...
Is it something feasible ?
I would really appreciate if you guys could give me a hand on this..!
Thanks !
Gabriel.
Comment