Hi all,
I'm using pfam_scan.pl to annotate some sequences from a gene family. I'd like to take the output of my pfam annotation, which looks like this...
...and append the annotations to the headers of my original fasta file (don't worry, I used an unaligned one to actually run pfam), which looks like this:
Such that I get a final product that looks like this (I want all relevant annotations appended after the name of the sequence but no annotation added if that sequence wasn't annotated):
I feel like this should be a relatively simple awk script but I'm not sure how to do it. Here's What I've come up with so far but this isn't working and won't actually take the sequence from the fasta file anyway, only the header:
Any assistance would be *greatly* appreciated!!
Thanks,
Kevin
I'm using pfam_scan.pl to annotate some sequences from a gene family. I'd like to take the output of my pfam annotation, which looks like this...
Code:
# pfam_scan.pl, run at Mon May 5 14:19:29 2014 # # Copyright (c) 2009 Genome Research Ltd # Freely distributed under the GNU # General Public License # # Authors: Jaina Mistry ([email protected]), John Tate ([email protected]), # Rob Finn ([email protected]) # # This is free software; you can redistribute it and/or modify it under # the terms of the GNU General Public License as published by the Free Software # Foundation; either version 2 of the License, or (at your option) any later version. # This program is distributed in the hope that it will be useful, but WITHOUT # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS # FOR A PARTICULAR PURPOSE. See the GNU General Public License for more # details. # # You should have received a copy of the GNU General Public License along with # this program. If not, see <http://www.gnu.org/licenses/>. # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = # query sequence file: result_AA.fas.unaligned # cpu number specified: 1 # searching against: /media/kmkocot/Sclerite/blast_dbs/Pfam27/Pfam-A.hmm, with cut off --cut_ga # resolve clan overlaps: on # predict active sites: off # searching against: /media/kmkocot/Sclerite/blast_dbs/Pfam27/Pfam-B.hmm, with cut off -E 0.001 --domE 0.001 # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = # # <seq id> <alignment start> <alignment end> <envelope start> <envelope end> <hmm acc> <hmm name> <type> <hmm start> <hmm end> <hmm length> <bit score> <E-value> <significance> <clan> AENT|01543_gene_01269_len_492_2 19 72 19 72 PF00219.13 IGFBP Domain 1 53 53 24.8 2.1e-05 1 No_clan #HMM CppCtee.CpeepprCpegvslvldgcgCCkvCarqegesCg...veterCakgLrC #MATCH C++C+++ C + p C g+ cgCC+vCa+ +g++Cg + + rC kgL+C #PP 99**99856.68899***9666..99***************74446789******** #SEQ CRRCDKSkC-KAPVGCRGGTVT--GICGCCNVCAKVKGQKCGgrwNMLGRCDKGLTC #CS -----HH.-.-------SEEE-..------EEE---------...--S-------EE AENT|01543_gene_01269_len_492_2 20 57 13 70 PB003492 Pfam-B_3492 Pfam-B 197 233 377 21.3 0.00017 NA NA #HMM tRCDVSkCPsP.sCPGGYVPDRCNCCLVCAaaEGeACG #MATCH RCD SkC +P +C GG V C CC VCA+ G+ CG #PP 5**********99************************9 #SEQ RRCDKSKCKAPvGCRGGTVTGICGCCNVCAKVKGQKCG AENT|01543_gene_01269_len_492_2 29 83 12 93 PB000053 Pfam-B_53 Pfam-B 276 332 720 27.2 1.6e-06 NA NA #HMM lklqlkggvttevvgccpvcarvedeisggaedilskvdkGrmsqevvlcevvvdea #MATCH + + gg t +gcc vca+v++ +gg + +l+++dkG++ q+ ++ d + #PP 34568999999**********************************9988888..444 #SEQ APVGCRGGTVTGICGCCNVCAKVKGQKCGGRWNMLGRCDKGLTCQKEFTGKP--DRR #CS --------------------------------------------------------- CAPI|Contig14_4 29 76 26 76 PF00219.13 IGFBP Domain 4 53 53 21.5 0.00022 1 No_clan #HMM Ctee..CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC #MATCH C ++ C + pp C+e ++ cgCC vC ge C++ ++C +gL+C #PP 5444547.68889*999776...9**************************** #SEQ CVNHptC-QAPPVCEEYGRE---LCGCCDVCKLGFGEVCNSRNAPCMSGLVC #CS --HH..-.-------SEEE-..------EEE-----------S-------EE HNAG|Contig2818_10 28 76 23 76 PF00219.13 IGFBP Domain 4 53 53 21.5 0.00023 1 No_clan #HMM Ctee...CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC #MATCH C + Cp+ +p+C e + cgCC vC ++ge+C++ ++C +gL+C #PP 3222445985.5569887666...7**************************** #SEQ CLNVqttCPP-TPECHEYGRR---LCGCCDVCKLELGETCNNGNAPCMSGLKC #CS --HH...-.-------SEEE-..------EEE-----------S-------EE LRUG|comp11289_c0_seq1_18 23 73 23 73 PF00219.13 IGFBP Domain 1 53 53 25.6 1.2e-05 1 No_clan #HMM CppCtee..CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC #MATCH C C Cp ++p+C+e ++ +cgCC vC eg++C ++ ++C +gL C #PP 77785557787.7788***9777...****************************9 #SEQ CISCASLppCP-PRPDCQEYGRK---QCGCCDVCNLPEGRNCSTYSQPCLSGLLC #CS -----HH..-.-------SEEE-..------EEE-----------S-------EE PFUC|pfu_aug1.0_374.1_29219.t1_26 25 81 19 81 PF00219.13 IGFBP Domain 1 53 53 25.1 1.6e-05 1 No_clan #HMM CppCtee.......C.peepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC #MATCH Cp C + C ++ + C++ c+CC +Car+ ge+C t+rCa+gL C #PP 788866555555555345566665554....46**************************** #SEQ CPSCGKLttsglpdCtKHLDIGCERVR----RPCSCCTTCARNIGETCSGRTPRCASGLMC #CS -----HH.......-..-------SEEE-..------EEE-----------S-------EE SCON|comp34952_c0_seq1_28 32 77 26 77 PF00219.13 IGFBP Domain 7 53 53 25.0 1.8e-05 1 No_clan #HMM e...CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC #MATCH Cp +pp C+e + cgCC vC ge C++ ++C++gLrC #PP 244487.6777*999777...8**************************** #SEQ YpptCP-TPPICEEYGRV---LCGCCDVCKLAFGEVCNSWNAPCKTGLRC #CS H...-.-------SEEE-..------EEE-----------S-------EE Patella_vulgata_HE962376.1_30 25 81 25 81 PF00219.13 IGFBP Domain 1 53 53 30.4 3.9e-07 1 No_clan #HMM CppCtee........CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC #MATCH C+pC ++ Cp+ p++C++ + cgCC vCa + g+ Cg rC+kgL+C #PP 99998888999999***********9666....**************************** #SEQ CAPCLNHrflprkriCPKLPKTCEPASKP----CGCCPVCAGKVGDRCGRFQVRCEKGLTC #CS -----HH........-.-------SEEE-..------EEE-----------S-------EE
Code:
>AENT|01543_gene_01269_len_492_1 MLYSISEFNKFLLV-----CSLIICCHCNLYSLELVGTFSI-------LHYF-------QSLSLTSTTEFHSADTASWPTIWFPC---KFFLTGQTFVAPPKHIPPAS-----AL--LSLNLSANITTTTYACNCPSTATNWGFT----FRLITPSAT-QGNDTANKSKKYDNSK--HLLGKQRLYFCLYTITRFCL------------------------------------------ >AENT|01543_gene_01269_len_492_2 ---MFRIVVFLALI-----CSVVAL----------------------------------SCRR---------------------C-------DKS--KCKA--PVGCR-----GGTVTGICGCCNVCAKVKGQKCGGRWNMLGRC----DKGLTCQKE-----FTGKPDRR------PGSGV-----CRVKFSCTC------------------------------------------- >CNIT|CNIT_1987399994_17D08_3 -----RNPGYVWVL-----FAVVLF--------------AA-------FSSL-------KALR---------------------C---AR--PADL-VCPPR-PDCTE-----YG--QELCGFCDVCRLSVGAPC-DAWKA--PC----ESHLVCRTA-EGGDYIGRPPWNL-----DHSGV-----CSIPDPR--------------------------------------------- >CAPI|Contig14_4 --MPTMKSVVFHSV-----VAVILL-----------------------AAAT-------ESLR---------------------C---GC-VNHP--TCQAP-PVCEE-----YG--RELCGCCDVCKLGFGEVC-NSRNA--PC----MSGLVCLAP-DGQVYGERPLWHLF----EVQGV-----CVKLPPSEVSV----------------------------------------- >CAPI|Contig14_5 MYLANFTTSKIIMP-----PFSVSS------------KPAI-------LRQL-------KHLH-------------------TDT---SLGGNFT--HTPWT-SNKCH-----SG--LSPYTCPSGAKQTSPDIH--------GA----LRELHTSPN-PSLQTSQHPQSSR-----PYSSQ-----TGGA------------------------------------------------ >CAPI|Contig10_6 --------MPPFSV-----SSKAAI-----------------------LRHL-------KHLH-----------------T--DA---SLGGNFT--HTPWT-SNKCH-----SG--LSPYTCPXGRXSRQVQTYMVRYVN--YT----LLQIQVCRH-HS-IRKALAHILRKLEELDMLGG--------------------------------------------------------- >LRUG|comp11289_c0_seq1_19 -------RSHLQQL-----SNQIDD--------------AP-------LGAL-------HSLA---------------------V-------HRYL-PGEWY-TKV-------VFRETRQMVCHIKVQINRVANTWNSFDPLGDCTHHNNRTVSCRIPGNPVSVDKGVERHKI----YSSGY-----AQQRSHTRL------------------------------------------- >FCAU|comp43822_c0_seq1_len_518_7 -----MAMEKTVYL-----CGVLAVVL-----------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS--------------------------------------- >FCAU|comp43822_c0_seq2_len_507_8 -----MAMEKTVYL-----CGVLAVVL-----------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS--------------------------------------- >FCAU|comp43822_c0_seq3_len_454_9 ---MAMEKSVYLRG-----VLVVAL-------------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS--------------------------------------- >HNAG|Contig2818_10 -----MARAILLLV-----CLTSTL--------------VS-------LSMV-------DCLR---------------------C---SC-LNVQT-TCPPT-PECHE-----YG--RRLCGCCDVCKLELGETC-NNGNA--PC----MSGLKCNTS-EG-LFDGRPPWFMF----DAEGQ-----CVDQ------------------------------------------------ >LHYA|Contig1125_11 ------MNSLVALL-----SMVIVG--------------AL-------AGGY-------DCPD-------------------DDC---PV-------TCPEY-GDCID-----MR--SYPCACCADCIKPVGEDCSGEFVS---C----DNGLLCNEK-HICVVHADMTEAAR----QRRGI-----HKK------------------------------------------------- >LRUG|comp46482_c0_seq1_12 -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------ >LRUG|comp46482_c0_seq2_13 -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------ >LRUG|comp46482_c0_seq3_14 -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFGI----------------------------------------------- >LRUG|comp46482_c0_seq4_15 -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFGI----------------------------------------------- >LRUG|comp46482_c0_seq5_16 -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------ >LRUG|comp46482_c0_seq6_17 -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------ >LRUG|comp11289_c0_seq1_18 -----MDKLPLLLF-----ITVYGF--------------SV-------VHSL-------SCIS---------------------C------ASLP--PCPPR-PDCQE-----YG--RKQCGCCDVCNLPEGRNC-STYSQ--PC----LSGLLCDTP-SG-AFHGKPPWYTI----HLEGT-----CVQPENAMHPMGHRLFG----------------------------------- >LRUG|comp47698_c0_seq1_20 LHFSQDSHGGLNAV-------HISP-----------------------IPSLQTSQQPQSRRP---------------------Y---SL-QSGVGAQVPWK-FTQEH-----RRHSTDGRGINVAVRQIIRKSS--------RA----ISNVIFLRI-VWIWNDGSAILYCTVILKSFRR------HSQHVSLVCL------------------------------------------ >LGIG|171051_21 -----MVIYVLLPIGSENRCAVTPTKL-----------YKILKKCAGSVNGFILKNDVGGCKP---------------------C-------PAVP-NCAPL-SRKY------CVVKRRPCGCCDECAGRHKDPC-DRYSV--PC----DDQFECVND-KGYGLKHIENDL------DFHGV-----CRFRARKGQFPYISRRSRPYIIKG---------------------------- >LGIG|228219_22 ------MQWLLLTI-----LALATL--------------GS-------VAAL-------SCRQ---------------------C-------QPDH-ECPAL-PNDGK-----CHPARRPCSCCDECAGLRGDDC-GPFTA--RC----HPDLVCVNE-NG-EEKETVQWHE-----KFKGV-----CKRSKAERAERACKRLNQLFRLFNSTNGRPGRFLRRWLKRLYKRCLAKYNVN >LGIG|228220_23 ----MASVTIYMIL-----ILSVTS--------------VV-------FSLS--------CVG---------------------C-------DKAA-PCPLL-PETKE-----CFKARAPCACCDTCASGLGAEC-GALKI--RC----HPDYVCVNK-DG-VEKVMIPWFMM----GFKGT-----CMPTGTGKIV------------------------------------------ >LGIG|152660_24 ----MASMIKLSIL-----CSMIAT-----------------------VTSL-------SCVA---------------------C-------PKDQ-VCDPL-PESAE-----CFPAKAACACCKTCAGRFGDKC-STLSV--RC----HPDFVCVNE-DG-VELSSVPWYTF----DFRGI-----CVRDRCPEPSTGGDGGIVPLPVGK---------------------------- >LGIG|238970_25 -----MKFGVGFLL-----SCLVAL-----------NTVQN-------MLAL-------SCLP---------------------C-------DFDTLKCSPL-PDDDD-----CFPAYTPCGCCPQCAGEEDDFC-DNFTV--RC----HPDLVCVNA-TG-FEKKFVYWYEF----DFKGT-----CQESELETE-----------------------YEYEYEENETKK-------- >PFUC|pfu_aug1.0_374.1_29219.t1_26 --MRNLRFSFFVIS-----VIGVVI--------------CD-------AGRH--------CPS---------------------C--------GKL-TTSGL-PDCTKHLDIGCERVRRPCSCCTTCARNIGETC-SGRTP--RC----ASGLMCVNG-HGEALKTIPRNMR-----HYKGV-----CQNVEVCPVVVENLEVDDRRFGSDHDSSRV---------------------- >PFUC|pfu_aug1.0_374.1_29219.t1_27 ---SHSAGVMIRTK-----TTVIHL----------------------------------QILN------------------NHRA---YFYILTDSFVMPHI-PGDCF-----KGLSMAIYTHQSTCT----PRC-SSGAS--------FSDISSTSS-------TTRAWSANTLTANVQMF-----GTIRQS---------------------------------------------- >SCON|comp34952_c0_seq1_28 --MVAMKSVVLYSV-----AMAIFF-----------------------TLGA-------ESLR---------------------C---SCGLYPP--TCPTP-PICEE-----YG--RVLCGCCDVCKLAFGEVC-NSWNA--PC----KTGLRCLTS-DGQVYNGRPPWFKF----SEEGV-----CVQLPRGSPDQ----------------------------------------- >WARG|GJN0W6B01BRRYQ_29 --MSYTAPRLAATT-----CFVVALVL-----------LQI-----SEVSSL-------RCLP---------------------C-------APDV-ECPTL-PDDCQ-----PT--KRPCGCCPECKGKVGAQC-SNMGVELRV----GSDVCQQAW-SG-HSCRQMALLV-----GFKG---------------------------------------------------------- >Patella_vulgata_HE962376.1_30 ------MKTLFLHI-----CVVLVVIV-----------VTG-------SDAL-------SCAP---------------------CLNHRF-LPRKR-ICPKL-PKTCE-----PA--SKPCGCCPVCAGKVGDRC-GRFQV--RC----EKGLTCQSQSEPTSLLGAYTISFNY---LRQGI-----CRKP------------------------------------------------
Code:
>AENT|01543_gene_01269_len_492_1 MLYSISEFNKFLLV-----CSLIICCHCNLYSLELVGTFSI-------LHYF-------QSLSLTSTTEFHSADTASWPTIWFPC---KFFLTGQTFVAPPKHIPPAS-----AL--LSLNLSANITTTTYACNCPSTATNWGFT----FRLITPSAT-QGNDTANKSKKYDNSK--HLLGKQRLYFCLYTITRFCL------------------------------------------ >AENT|01543_gene_01269_len_492_2 IGFBP Pfam-B_3492 Pfam-B_53 ---MFRIVVFLALI-----CSVVAL----------------------------------SCRR---------------------C-------DKS--KCKA--PVGCR-----GGTVTGICGCCNVCAKVKGQKCGGRWNMLGRC----DKGLTCQKE-----FTGKPDRR------PGSGV-----CRVKFSCTC------------------------------------------- >CNIT|CNIT_1987399994_17D08_3 -----RNPGYVWVL-----FAVVLF--------------AA-------FSSL-------KALR---------------------C---AR--PADL-VCPPR-PDCTE-----YG--QELCGFCDVCRLSVGAPC-DAWKA--PC----ESHLVCRTA-EGGDYIGRPPWNL-----DHSGV-----CSIPDPR--------------------------------------------- >CAPI|Contig14_4 IGFBP --MPTMKSVVFHSV-----VAVILL-----------------------AAAT-------ESLR---------------------C---GC-VNHP--TCQAP-PVCEE-----YG--RELCGCCDVCKLGFGEVC-NSRNA--PC----MSGLVCLAP-DGQVYGERPLWHLF----EVQGV-----CVKLPPSEVSV----------------------------------------- >CAPI|Contig14_5 MYLANFTTSKIIMP-----PFSVSS------------KPAI-------LRQL-------KHLH-------------------TDT---SLGGNFT--HTPWT-SNKCH-----SG--LSPYTCPSGAKQTSPDIH--------GA----LRELHTSPN-PSLQTSQHPQSSR-----PYSSQ-----TGGA------------------------------------------------ >CAPI|Contig10_6 --------MPPFSV-----SSKAAI-----------------------LRHL-------KHLH-----------------T--DA---SLGGNFT--HTPWT-SNKCH-----SG--LSPYTCPXGRXSRQVQTYMVRYVN--YT----LLQIQVCRH-HS-IRKALAHILRKLEELDMLGG--------------------------------------------------------- >LRUG|comp11289_c0_seq1_19 -------RSHLQQL-----SNQIDD--------------AP-------LGAL-------HSLA---------------------V-------HRYL-PGEWY-TKV-------VFRETRQMVCHIKVQINRVANTWNSFDPLGDCTHHNNRTVSCRIPGNPVSVDKGVERHKI----YSSGY-----AQQRSHTRL------------------------------------------- >FCAU|comp43822_c0_seq1_len_518_7 -----MAMEKTVYL-----CGVLAVVL-----------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS--------------------------------------- >FCAU|comp43822_c0_seq2_len_507_8 -----MAMEKTVYL-----CGVLAVVL-----------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS--------------------------------------- >FCAU|comp43822_c0_seq3_len_454_9 ---MAMEKSVYLRG-----VLVVAL-------------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS--------------------------------------- >HNAG|Contig2818_10 IGFBP -----MARAILLLV-----CLTSTL--------------VS-------LSMV-------DCLR---------------------C---SC-LNVQT-TCPPT-PECHE-----YG--RRLCGCCDVCKLELGETC-NNGNA--PC----MSGLKCNTS-EG-LFDGRPPWFMF----DAEGQ-----CVDQ------------------------------------------------ >LHYA|Contig1125_11 ------MNSLVALL-----SMVIVG--------------AL-------AGGY-------DCPD-------------------DDC---PV-------TCPEY-GDCID-----MR--SYPCACCADCIKPVGEDCSGEFVS---C----DNGLLCNEK-HICVVHADMTEAAR----QRRGI-----HKK------------------------------------------------- >LRUG|comp46482_c0_seq1_12 -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------ >LRUG|comp46482_c0_seq2_13 -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------ >LRUG|comp46482_c0_seq3_14 -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFGI----------------------------------------------- >LRUG|comp46482_c0_seq4_15 -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFGI----------------------------------------------- >LRUG|comp46482_c0_seq5_16 -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------ >LRUG|comp46482_c0_seq6_17 -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------ >LRUG|comp11289_c0_seq1_18 IGFBP -----MDKLPLLLF-----ITVYGF--------------SV-------VHSL-------SCIS---------------------C------ASLP--PCPPR-PDCQE-----YG--RKQCGCCDVCNLPEGRNC-STYSQ--PC----LSGLLCDTP-SG-AFHGKPPWYTI----HLEGT-----CVQPENAMHPMGHRLFG----------------------------------- >LRUG|comp47698_c0_seq1_20 LHFSQDSHGGLNAV-------HISP-----------------------IPSLQTSQQPQSRRP---------------------Y---SL-QSGVGAQVPWK-FTQEH-----RRHSTDGRGINVAVRQIIRKSS--------RA----ISNVIFLRI-VWIWNDGSAILYCTVILKSFRR------HSQHVSLVCL------------------------------------------ >LGIG|171051_21 -----MVIYVLLPIGSENRCAVTPTKL-----------YKILKKCAGSVNGFILKNDVGGCKP---------------------C-------PAVP-NCAPL-SRKY------CVVKRRPCGCCDECAGRHKDPC-DRYSV--PC----DDQFECVND-KGYGLKHIENDL------DFHGV-----CRFRARKGQFPYISRRSRPYIIKG---------------------------- >LGIG|228219_22 ------MQWLLLTI-----LALATL--------------GS-------VAAL-------SCRQ---------------------C-------QPDH-ECPAL-PNDGK-----CHPARRPCSCCDECAGLRGDDC-GPFTA--RC----HPDLVCVNE-NG-EEKETVQWHE-----KFKGV-----CKRSKAERAERACKRLNQLFRLFNSTNGRPGRFLRRWLKRLYKRCLAKYNVN >LGIG|228220_23 ----MASVTIYMIL-----ILSVTS--------------VV-------FSLS--------CVG---------------------C-------DKAA-PCPLL-PETKE-----CFKARAPCACCDTCASGLGAEC-GALKI--RC----HPDYVCVNK-DG-VEKVMIPWFMM----GFKGT-----CMPTGTGKIV------------------------------------------ >LGIG|152660_24 ----MASMIKLSIL-----CSMIAT-----------------------VTSL-------SCVA---------------------C-------PKDQ-VCDPL-PESAE-----CFPAKAACACCKTCAGRFGDKC-STLSV--RC----HPDFVCVNE-DG-VELSSVPWYTF----DFRGI-----CVRDRCPEPSTGGDGGIVPLPVGK---------------------------- >LGIG|238970_25 -----MKFGVGFLL-----SCLVAL-----------NTVQN-------MLAL-------SCLP---------------------C-------DFDTLKCSPL-PDDDD-----CFPAYTPCGCCPQCAGEEDDFC-DNFTV--RC----HPDLVCVNA-TG-FEKKFVYWYEF----DFKGT-----CQESELETE-----------------------YEYEYEENETKK-------- >PFUC|pfu_aug1.0_374.1_29219.t1_26 IGFBP --MRNLRFSFFVIS-----VIGVVI--------------CD-------AGRH--------CPS---------------------C--------GKL-TTSGL-PDCTKHLDIGCERVRRPCSCCTTCARNIGETC-SGRTP--RC----ASGLMCVNG-HGEALKTIPRNMR-----HYKGV-----CQNVEVCPVVVENLEVDDRRFGSDHDSSRV---------------------- >PFUC|pfu_aug1.0_374.1_29219.t1_27 ---SHSAGVMIRTK-----TTVIHL----------------------------------QILN------------------NHRA---YFYILTDSFVMPHI-PGDCF-----KGLSMAIYTHQSTCT----PRC-SSGAS--------FSDISSTSS-------TTRAWSANTLTANVQMF-----GTIRQS---------------------------------------------- >SCON|comp34952_c0_seq1_28 IGFBP --MVAMKSVVLYSV-----AMAIFF-----------------------TLGA-------ESLR---------------------C---SCGLYPP--TCPTP-PICEE-----YG--RVLCGCCDVCKLAFGEVC-NSWNA--PC----KTGLRCLTS-DGQVYNGRPPWFKF----SEEGV-----CVQLPRGSPDQ----------------------------------------- >WARG|GJN0W6B01BRRYQ_29 --MSYTAPRLAATT-----CFVVALVL-----------LQI-----SEVSSL-------RCLP---------------------C-------APDV-ECPTL-PDDCQ-----PT--KRPCGCCPECKGKVGAQC-SNMGVELRV----GSDVCQQAW-SG-HSCRQMALLV-----GFKG---------------------------------------------------------- >Patella_vulgata_HE962376.1_30 IGFBP ------MKTLFLHI-----CVVLVVIV-----------VTG-------SDAL-------SCAP---------------------CLNHRF-LPRKR-ICPKL-PKTCE-----PA--SKPCGCCPVCAGKVGDRC-GRFQV--RC----EKGLTCQSQSEPTSLLGAYTISFNY---LRQGI-----CRKP------------------------------------------------
Code:
sed '/^#/d' result_pfamA_and_B_annotation.txt > annotation.txt sed -i '/^$/d' annotation.txt sed -i 's/>//g' result_AA.fas awk -F'\t' -v OFS=' ' ' NR==FNR { a[$1]=$0; next } {if (a[$1]) { print a[$1],$7}}' result_AA.fas annotation.txt
Thanks,
Kevin