Hi I have a fasta file, sequence like this. Basically, it is an annotated files the sequences name include fuction name, and organism.
I want to do this kind of filtering.
1> extract the sequence name ("mgm4510423.3|contig02227|RefSeq|73954f841ecd7c512c5428ed1b1a747e accession=[NP_559375.1],function=[carbamate kinase],organism=[Pyrobaculum aerophilum str. IM2]")to a text file. I would be better to separate by comma. That is, make three columns. ID, function and organism.
2> After I create the upper text file. I can choose the organism that I want to keep. Filter the fasta files, so I will get all the sequences that I need for particular organisms.
Any software or Unix command like grep /awk can do this.
I want to do this kind of filtering.
1> extract the sequence name ("mgm4510423.3|contig02227|RefSeq|73954f841ecd7c512c5428ed1b1a747e accession=[NP_559375.1],function=[carbamate kinase],organism=[Pyrobaculum aerophilum str. IM2]")to a text file. I would be better to separate by comma. That is, make three columns. ID, function and organism.
2> After I create the upper text file. I can choose the organism that I want to keep. Filter the fasta files, so I will get all the sequences that I need for particular organisms.
Any software or Unix command like grep /awk can do this.
Code:
>mgm4510423.3|contig02227|RefSeq|73954f841ecd7c512c5428ed1b1a747e accession=[NP_559375.1],function=[carbamate kinase],organism=[Pyrobaculum aerophilum str. IM2] AAGAAACGTCGACGTAGCCGCCAGAGTCGTGGCAGGGgTAATGCAGGGAGGCCACCAGGTGGTGGTGACGCACGGCAACGGGCCCCAGGTGGGCTACCTGGCGgAGTTGCaGAgaGACAACGGCACATTTCGGCTGGACGCCCTAAACGCCATGACGCaGGGgATGCTCGGCTACTTCCTTGTCTCTGCGCTTGATAAATACTTAGGCAGGGGGAGGGCCGCGGCTTTGGTGACCAGAGTCGAGGTGGACTGCGACGACCCGGCTTTTaaagaCCCGACcAAGTTCATAGGTCCCCTATACGGCAAGGAaCaGgCTGAGGCCCTCGCACAGAGGTACGGGTGGCAGTTTAGGCAAGACCCAAGAGGAGGCTGGCgtCGCGTCGTCGCGTCGCCTACGCCGCTCAGAAtcGTGGAGATAGAGGCCGTAAAGaGGTTGCTGgACGCGgGTTTCGTCGTTGTGGCGgCGGGCGGCGGCGGTaTACCGCTCTGCGGAGACAGAgaCGTAGAGGGGGTTATAGACAAGGACTTGGCCTCTTCTCTCCTCGCTGTGGAGCTCGGCGCGGACTTCTTCATGATGCTGACCGACATAGACGCCGTCTACCTAAACTACGGGAaGCCGAACCAGAGGAGGCTAGACAGCGTAGGGGCTGACGAGCTGGAGAGGTATTTCGCCGAtGGCcACTTCCCGCCGGGCTCCATGGGGCCGAAGGTGCAGGCCGCGATAAACTTCGTGAAacAAAcGGggaGAaGGGCGGCCATCGGGGCGCTGGAGGAGGGCTAtGACGtGTTCAGGGGAATAAAGGGGACCCAGGTgACGCCTTAGAGCTCGTTTATTGGCTTTTCGTATTCCTCCCTcTtCtGGAGGTCTCGgATCTTgACTACGCCGCGCTCCAGCTCTTTCTTGCCGATTATGATTAGGtACCGCGTGCCTATCTTCAAGATGTATTCAAAGGCCTcTTTtAGGCTTTTCTCGCCCAGCTCCACAGCCACGCTGAAGCCTGCGCTCCTCAGCTTCTtcGCAACTGCCACGGCCTGCGGGTACGCCTCgTCGTCGAAGATGTAGATGTAGTAGTCCAGCGGCTTCTCCACGTTGTGGAGCCCcACGgCCTCcATAAACcTCTCAACGCCGATGGCGAaCCCCAGCGCCGgCGtCtttACGCCGCTGTAGAGCT
Comment