Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • kmkocot
    Member
    • Jun 2009
    • 51

    PFAM annotation - need help with awk scripting

    Hi all,

    I'm using pfam_scan.pl to annotate some sequences from a gene family. I'd like to take the output of my pfam annotation, which looks like this...

    Code:
    # pfam_scan.pl,  run at Mon May  5 14:19:29 2014
    #
    # Copyright (c) 2009 Genome Research Ltd
    # Freely distributed under the GNU 
    # General Public License
    #
    # Authors: Jaina Mistry ([email protected]), John Tate ([email protected]), 
    #          Rob Finn ([email protected])
    #
    # This is free software; you can redistribute it and/or modify it under
    # the terms of the GNU General Public License as published by the Free Software
    # Foundation; either version 2 of the License, or (at your option) any later version.
    # This program is distributed in the hope that it will be useful, but WITHOUT
    # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
    # FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
    # details.
    #
    # You should have received a copy of the GNU General Public License along with
    # this program. If not, see <http://www.gnu.org/licenses/>. 
    # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
    #      query sequence file: result_AA.fas.unaligned
    #     cpu number specified: 1
    #        searching against: /media/kmkocot/Sclerite/blast_dbs/Pfam27/Pfam-A.hmm, with cut off --cut_ga
    #    resolve clan overlaps: on
    #     predict active sites: off
    #        searching against: /media/kmkocot/Sclerite/blast_dbs/Pfam27/Pfam-B.hmm, with cut off -E 0.001 --domE 0.001
    # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
    #
    # <seq id> <alignment start> <alignment end> <envelope start> <envelope end> <hmm acc> <hmm name> <type> <hmm start> <hmm end> <hmm length> <bit score> <E-value> <significance> <clan>
    
    AENT|01543_gene_01269_len_492_2       19     72     19     72 PF00219.13  IGFBP             Domain     1    53    53     24.8   2.1e-05   1 No_clan  
    #HMM       CppCtee.CpeepprCpegvslvldgcgCCkvCarqegesCg...veterCakgLrC
    #MATCH     C++C+++ C + p  C  g+      cgCC+vCa+ +g++Cg   + + rC kgL+C
    #PP        99**99856.68899***9666..99***************74446789********
    #SEQ       CRRCDKSkC-KAPVGCRGGTVT--GICGCCNVCAKVKGQKCGgrwNMLGRCDKGLTC
    #CS        -----HH.-.-------SEEE-..------EEE---------...--S-------EE
    AENT|01543_gene_01269_len_492_2       20     57     13     70 PB003492    Pfam-B_3492       Pfam-B   197   233   377     21.3   0.00017  NA NA      
    #HMM       tRCDVSkCPsP.sCPGGYVPDRCNCCLVCAaaEGeACG
    #MATCH      RCD SkC +P +C GG V   C CC VCA+  G+ CG
    #PP        5**********99************************9
    #SEQ       RRCDKSKCKAPvGCRGGTVTGICGCCNVCAKVKGQKCG
    AENT|01543_gene_01269_len_492_2       29     83     12     93 PB000053    Pfam-B_53         Pfam-B   276   332   720     27.2   1.6e-06  NA NA      
    #HMM       lklqlkggvttevvgccpvcarvedeisggaedilskvdkGrmsqevvlcevvvdea
    #MATCH       + + gg  t  +gcc vca+v++  +gg + +l+++dkG++ q+    ++  d +
    #PP        34568999999**********************************9988888..444
    #SEQ       APVGCRGGTVTGICGCCNVCAKVKGQKCGGRWNMLGRCDKGLTCQKEFTGKP--DRR
    #CS        ---------------------------------------------------------
    CAPI|Contig14_4                       29     76     26     76 PF00219.13  IGFBP             Domain     4    53    53     21.5   0.00022   1 No_clan  
    #HMM       Ctee..CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC
    #MATCH     C ++  C + pp C+e  ++    cgCC vC    ge C++  ++C +gL+C
    #PP        5444547.68889*999776...9****************************
    #SEQ       CVNHptC-QAPPVCEEYGRE---LCGCCDVCKLGFGEVCNSRNAPCMSGLVC
    #CS        --HH..-.-------SEEE-..------EEE-----------S-------EE
    HNAG|Contig2818_10                    28     76     23     76 PF00219.13  IGFBP             Domain     4    53    53     21.5   0.00023   1 No_clan  
    #HMM       Ctee...CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC
    #MATCH     C +    Cp+ +p+C e  +     cgCC vC  ++ge+C++  ++C +gL+C
    #PP        3222445985.5569887666...7****************************
    #SEQ       CLNVqttCPP-TPECHEYGRR---LCGCCDVCKLELGETCNNGNAPCMSGLKC
    #CS        --HH...-.-------SEEE-..------EEE-----------S-------EE
    LRUG|comp11289_c0_seq1_18             23     73     23     73 PF00219.13  IGFBP             Domain     1    53    53     25.6   1.2e-05   1 No_clan  
    #HMM       CppCtee..CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC
    #MATCH     C  C     Cp ++p+C+e  ++   +cgCC vC   eg++C ++ ++C +gL C
    #PP        77785557787.7788***9777...****************************9
    #SEQ       CISCASLppCP-PRPDCQEYGRK---QCGCCDVCNLPEGRNCSTYSQPCLSGLLC
    #CS        -----HH..-.-------SEEE-..------EEE-----------S-------EE
    PFUC|pfu_aug1.0_374.1_29219.t1_26     25     81     19     81 PF00219.13  IGFBP             Domain     1    53    53     25.1   1.6e-05   1 No_clan  
    #HMM       CppCtee.......C.peepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC
    #MATCH     Cp C +        C ++ +  C++        c+CC +Car+ ge+C   t+rCa+gL C
    #PP        788866555555555345566665554....46****************************
    #SEQ       CPSCGKLttsglpdCtKHLDIGCERVR----RPCSCCTTCARNIGETCSGRTPRCASGLMC
    #CS        -----HH.......-..-------SEEE-..------EEE-----------S-------EE
    SCON|comp34952_c0_seq1_28             32     77     26     77 PF00219.13  IGFBP             Domain     7    53    53     25.0   1.8e-05   1 No_clan  
    #HMM       e...CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC
    #MATCH         Cp +pp C+e  +     cgCC vC    ge C++  ++C++gLrC
    #PP        244487.6777*999777...8****************************
    #SEQ       YpptCP-TPPICEEYGRV---LCGCCDVCKLAFGEVCNSWNAPCKTGLRC
    #CS        H...-.-------SEEE-..------EEE-----------S-------EE
    Patella_vulgata_HE962376.1_30         25     81     25     81 PF00219.13  IGFBP             Domain     1    53    53     30.4   3.9e-07   1 No_clan  
    #HMM       CppCtee........CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC
    #MATCH     C+pC ++        Cp+ p++C++  +     cgCC vCa + g+ Cg    rC+kgL+C
    #PP        99998888999999***********9666....****************************
    #SEQ       CAPCLNHrflprkriCPKLPKTCEPASKP----CGCCPVCAGKVGDRCGRFQVRCEKGLTC
    #CS        -----HH........-.-------SEEE-..------EEE-----------S-------EE
    ...and append the annotations to the headers of my original fasta file (don't worry, I used an unaligned one to actually run pfam), which looks like this:
    Code:
    >AENT|01543_gene_01269_len_492_1
    MLYSISEFNKFLLV-----CSLIICCHCNLYSLELVGTFSI-------LHYF-------QSLSLTSTTEFHSADTASWPTIWFPC---KFFLTGQTFVAPPKHIPPAS-----AL--LSLNLSANITTTTYACNCPSTATNWGFT----FRLITPSAT-QGNDTANKSKKYDNSK--HLLGKQRLYFCLYTITRFCL------------------------------------------
    >AENT|01543_gene_01269_len_492_2
    ---MFRIVVFLALI-----CSVVAL----------------------------------SCRR---------------------C-------DKS--KCKA--PVGCR-----GGTVTGICGCCNVCAKVKGQKCGGRWNMLGRC----DKGLTCQKE-----FTGKPDRR------PGSGV-----CRVKFSCTC-------------------------------------------
    >CNIT|CNIT_1987399994_17D08_3
    -----RNPGYVWVL-----FAVVLF--------------AA-------FSSL-------KALR---------------------C---AR--PADL-VCPPR-PDCTE-----YG--QELCGFCDVCRLSVGAPC-DAWKA--PC----ESHLVCRTA-EGGDYIGRPPWNL-----DHSGV-----CSIPDPR---------------------------------------------
    >CAPI|Contig14_4
    --MPTMKSVVFHSV-----VAVILL-----------------------AAAT-------ESLR---------------------C---GC-VNHP--TCQAP-PVCEE-----YG--RELCGCCDVCKLGFGEVC-NSRNA--PC----MSGLVCLAP-DGQVYGERPLWHLF----EVQGV-----CVKLPPSEVSV-----------------------------------------
    >CAPI|Contig14_5
    MYLANFTTSKIIMP-----PFSVSS------------KPAI-------LRQL-------KHLH-------------------TDT---SLGGNFT--HTPWT-SNKCH-----SG--LSPYTCPSGAKQTSPDIH--------GA----LRELHTSPN-PSLQTSQHPQSSR-----PYSSQ-----TGGA------------------------------------------------
    >CAPI|Contig10_6
    --------MPPFSV-----SSKAAI-----------------------LRHL-------KHLH-----------------T--DA---SLGGNFT--HTPWT-SNKCH-----SG--LSPYTCPXGRXSRQVQTYMVRYVN--YT----LLQIQVCRH-HS-IRKALAHILRKLEELDMLGG---------------------------------------------------------
    >LRUG|comp11289_c0_seq1_19
    -------RSHLQQL-----SNQIDD--------------AP-------LGAL-------HSLA---------------------V-------HRYL-PGEWY-TKV-------VFRETRQMVCHIKVQINRVANTWNSFDPLGDCTHHNNRTVSCRIPGNPVSVDKGVERHKI----YSSGY-----AQQRSHTRL-------------------------------------------
    >FCAU|comp43822_c0_seq1_len_518_7
    -----MAMEKTVYL-----CGVLAVVL-----------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS---------------------------------------
    >FCAU|comp43822_c0_seq2_len_507_8
    -----MAMEKTVYL-----CGVLAVVL-----------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS---------------------------------------
    >FCAU|comp43822_c0_seq3_len_454_9
    ---MAMEKSVYLRG-----VLVVAL-------------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS---------------------------------------
    >HNAG|Contig2818_10
    -----MARAILLLV-----CLTSTL--------------VS-------LSMV-------DCLR---------------------C---SC-LNVQT-TCPPT-PECHE-----YG--RRLCGCCDVCKLELGETC-NNGNA--PC----MSGLKCNTS-EG-LFDGRPPWFMF----DAEGQ-----CVDQ------------------------------------------------
    >LHYA|Contig1125_11
    ------MNSLVALL-----SMVIVG--------------AL-------AGGY-------DCPD-------------------DDC---PV-------TCPEY-GDCID-----MR--SYPCACCADCIKPVGEDCSGEFVS---C----DNGLLCNEK-HICVVHADMTEAAR----QRRGI-----HKK-------------------------------------------------
    >LRUG|comp46482_c0_seq1_12
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp46482_c0_seq2_13
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp46482_c0_seq3_14
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFGI-----------------------------------------------
    >LRUG|comp46482_c0_seq4_15
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFGI-----------------------------------------------
    >LRUG|comp46482_c0_seq5_16
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp46482_c0_seq6_17
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp11289_c0_seq1_18
    -----MDKLPLLLF-----ITVYGF--------------SV-------VHSL-------SCIS---------------------C------ASLP--PCPPR-PDCQE-----YG--RKQCGCCDVCNLPEGRNC-STYSQ--PC----LSGLLCDTP-SG-AFHGKPPWYTI----HLEGT-----CVQPENAMHPMGHRLFG-----------------------------------
    >LRUG|comp47698_c0_seq1_20
    LHFSQDSHGGLNAV-------HISP-----------------------IPSLQTSQQPQSRRP---------------------Y---SL-QSGVGAQVPWK-FTQEH-----RRHSTDGRGINVAVRQIIRKSS--------RA----ISNVIFLRI-VWIWNDGSAILYCTVILKSFRR------HSQHVSLVCL------------------------------------------
    >LGIG|171051_21
    -----MVIYVLLPIGSENRCAVTPTKL-----------YKILKKCAGSVNGFILKNDVGGCKP---------------------C-------PAVP-NCAPL-SRKY------CVVKRRPCGCCDECAGRHKDPC-DRYSV--PC----DDQFECVND-KGYGLKHIENDL------DFHGV-----CRFRARKGQFPYISRRSRPYIIKG----------------------------
    >LGIG|228219_22
    ------MQWLLLTI-----LALATL--------------GS-------VAAL-------SCRQ---------------------C-------QPDH-ECPAL-PNDGK-----CHPARRPCSCCDECAGLRGDDC-GPFTA--RC----HPDLVCVNE-NG-EEKETVQWHE-----KFKGV-----CKRSKAERAERACKRLNQLFRLFNSTNGRPGRFLRRWLKRLYKRCLAKYNVN
    >LGIG|228220_23
    ----MASVTIYMIL-----ILSVTS--------------VV-------FSLS--------CVG---------------------C-------DKAA-PCPLL-PETKE-----CFKARAPCACCDTCASGLGAEC-GALKI--RC----HPDYVCVNK-DG-VEKVMIPWFMM----GFKGT-----CMPTGTGKIV------------------------------------------
    >LGIG|152660_24
    ----MASMIKLSIL-----CSMIAT-----------------------VTSL-------SCVA---------------------C-------PKDQ-VCDPL-PESAE-----CFPAKAACACCKTCAGRFGDKC-STLSV--RC----HPDFVCVNE-DG-VELSSVPWYTF----DFRGI-----CVRDRCPEPSTGGDGGIVPLPVGK----------------------------
    >LGIG|238970_25
    -----MKFGVGFLL-----SCLVAL-----------NTVQN-------MLAL-------SCLP---------------------C-------DFDTLKCSPL-PDDDD-----CFPAYTPCGCCPQCAGEEDDFC-DNFTV--RC----HPDLVCVNA-TG-FEKKFVYWYEF----DFKGT-----CQESELETE-----------------------YEYEYEENETKK--------
    >PFUC|pfu_aug1.0_374.1_29219.t1_26
    --MRNLRFSFFVIS-----VIGVVI--------------CD-------AGRH--------CPS---------------------C--------GKL-TTSGL-PDCTKHLDIGCERVRRPCSCCTTCARNIGETC-SGRTP--RC----ASGLMCVNG-HGEALKTIPRNMR-----HYKGV-----CQNVEVCPVVVENLEVDDRRFGSDHDSSRV----------------------
    >PFUC|pfu_aug1.0_374.1_29219.t1_27
    ---SHSAGVMIRTK-----TTVIHL----------------------------------QILN------------------NHRA---YFYILTDSFVMPHI-PGDCF-----KGLSMAIYTHQSTCT----PRC-SSGAS--------FSDISSTSS-------TTRAWSANTLTANVQMF-----GTIRQS----------------------------------------------
    >SCON|comp34952_c0_seq1_28
    --MVAMKSVVLYSV-----AMAIFF-----------------------TLGA-------ESLR---------------------C---SCGLYPP--TCPTP-PICEE-----YG--RVLCGCCDVCKLAFGEVC-NSWNA--PC----KTGLRCLTS-DGQVYNGRPPWFKF----SEEGV-----CVQLPRGSPDQ-----------------------------------------
    >WARG|GJN0W6B01BRRYQ_29
    --MSYTAPRLAATT-----CFVVALVL-----------LQI-----SEVSSL-------RCLP---------------------C-------APDV-ECPTL-PDDCQ-----PT--KRPCGCCPECKGKVGAQC-SNMGVELRV----GSDVCQQAW-SG-HSCRQMALLV-----GFKG----------------------------------------------------------
    >Patella_vulgata_HE962376.1_30
    ------MKTLFLHI-----CVVLVVIV-----------VTG-------SDAL-------SCAP---------------------CLNHRF-LPRKR-ICPKL-PKTCE-----PA--SKPCGCCPVCAGKVGDRC-GRFQV--RC----EKGLTCQSQSEPTSLLGAYTISFNY---LRQGI-----CRKP------------------------------------------------
    Such that I get a final product that looks like this (I want all relevant annotations appended after the name of the sequence but no annotation added if that sequence wasn't annotated):
    Code:
    >AENT|01543_gene_01269_len_492_1
    MLYSISEFNKFLLV-----CSLIICCHCNLYSLELVGTFSI-------LHYF-------QSLSLTSTTEFHSADTASWPTIWFPC---KFFLTGQTFVAPPKHIPPAS-----AL--LSLNLSANITTTTYACNCPSTATNWGFT----FRLITPSAT-QGNDTANKSKKYDNSK--HLLGKQRLYFCLYTITRFCL------------------------------------------
    >AENT|01543_gene_01269_len_492_2 IGFBP Pfam-B_3492 Pfam-B_53
    ---MFRIVVFLALI-----CSVVAL----------------------------------SCRR---------------------C-------DKS--KCKA--PVGCR-----GGTVTGICGCCNVCAKVKGQKCGGRWNMLGRC----DKGLTCQKE-----FTGKPDRR------PGSGV-----CRVKFSCTC-------------------------------------------
    >CNIT|CNIT_1987399994_17D08_3
    -----RNPGYVWVL-----FAVVLF--------------AA-------FSSL-------KALR---------------------C---AR--PADL-VCPPR-PDCTE-----YG--QELCGFCDVCRLSVGAPC-DAWKA--PC----ESHLVCRTA-EGGDYIGRPPWNL-----DHSGV-----CSIPDPR---------------------------------------------
    >CAPI|Contig14_4 IGFBP
    --MPTMKSVVFHSV-----VAVILL-----------------------AAAT-------ESLR---------------------C---GC-VNHP--TCQAP-PVCEE-----YG--RELCGCCDVCKLGFGEVC-NSRNA--PC----MSGLVCLAP-DGQVYGERPLWHLF----EVQGV-----CVKLPPSEVSV-----------------------------------------
    >CAPI|Contig14_5
    MYLANFTTSKIIMP-----PFSVSS------------KPAI-------LRQL-------KHLH-------------------TDT---SLGGNFT--HTPWT-SNKCH-----SG--LSPYTCPSGAKQTSPDIH--------GA----LRELHTSPN-PSLQTSQHPQSSR-----PYSSQ-----TGGA------------------------------------------------
    >CAPI|Contig10_6
    --------MPPFSV-----SSKAAI-----------------------LRHL-------KHLH-----------------T--DA---SLGGNFT--HTPWT-SNKCH-----SG--LSPYTCPXGRXSRQVQTYMVRYVN--YT----LLQIQVCRH-HS-IRKALAHILRKLEELDMLGG---------------------------------------------------------
    >LRUG|comp11289_c0_seq1_19
    -------RSHLQQL-----SNQIDD--------------AP-------LGAL-------HSLA---------------------V-------HRYL-PGEWY-TKV-------VFRETRQMVCHIKVQINRVANTWNSFDPLGDCTHHNNRTVSCRIPGNPVSVDKGVERHKI----YSSGY-----AQQRSHTRL-------------------------------------------
    >FCAU|comp43822_c0_seq1_len_518_7
    -----MAMEKTVYL-----CGVLAVVL-----------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS---------------------------------------
    >FCAU|comp43822_c0_seq2_len_507_8
    -----MAMEKTVYL-----CGVLAVVL-----------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS---------------------------------------
    >FCAU|comp43822_c0_seq3_len_454_9
    ---MAMEKSVYLRG-----VLVVAL-------------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS---------------------------------------
    >HNAG|Contig2818_10 IGFBP
    -----MARAILLLV-----CLTSTL--------------VS-------LSMV-------DCLR---------------------C---SC-LNVQT-TCPPT-PECHE-----YG--RRLCGCCDVCKLELGETC-NNGNA--PC----MSGLKCNTS-EG-LFDGRPPWFMF----DAEGQ-----CVDQ------------------------------------------------
    >LHYA|Contig1125_11
    ------MNSLVALL-----SMVIVG--------------AL-------AGGY-------DCPD-------------------DDC---PV-------TCPEY-GDCID-----MR--SYPCACCADCIKPVGEDCSGEFVS---C----DNGLLCNEK-HICVVHADMTEAAR----QRRGI-----HKK-------------------------------------------------
    >LRUG|comp46482_c0_seq1_12
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp46482_c0_seq2_13
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp46482_c0_seq3_14
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFGI-----------------------------------------------
    >LRUG|comp46482_c0_seq4_15
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFGI-----------------------------------------------
    >LRUG|comp46482_c0_seq5_16
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp46482_c0_seq6_17
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp11289_c0_seq1_18 IGFBP
    -----MDKLPLLLF-----ITVYGF--------------SV-------VHSL-------SCIS---------------------C------ASLP--PCPPR-PDCQE-----YG--RKQCGCCDVCNLPEGRNC-STYSQ--PC----LSGLLCDTP-SG-AFHGKPPWYTI----HLEGT-----CVQPENAMHPMGHRLFG-----------------------------------
    >LRUG|comp47698_c0_seq1_20
    LHFSQDSHGGLNAV-------HISP-----------------------IPSLQTSQQPQSRRP---------------------Y---SL-QSGVGAQVPWK-FTQEH-----RRHSTDGRGINVAVRQIIRKSS--------RA----ISNVIFLRI-VWIWNDGSAILYCTVILKSFRR------HSQHVSLVCL------------------------------------------
    >LGIG|171051_21
    -----MVIYVLLPIGSENRCAVTPTKL-----------YKILKKCAGSVNGFILKNDVGGCKP---------------------C-------PAVP-NCAPL-SRKY------CVVKRRPCGCCDECAGRHKDPC-DRYSV--PC----DDQFECVND-KGYGLKHIENDL------DFHGV-----CRFRARKGQFPYISRRSRPYIIKG----------------------------
    >LGIG|228219_22
    ------MQWLLLTI-----LALATL--------------GS-------VAAL-------SCRQ---------------------C-------QPDH-ECPAL-PNDGK-----CHPARRPCSCCDECAGLRGDDC-GPFTA--RC----HPDLVCVNE-NG-EEKETVQWHE-----KFKGV-----CKRSKAERAERACKRLNQLFRLFNSTNGRPGRFLRRWLKRLYKRCLAKYNVN
    >LGIG|228220_23
    ----MASVTIYMIL-----ILSVTS--------------VV-------FSLS--------CVG---------------------C-------DKAA-PCPLL-PETKE-----CFKARAPCACCDTCASGLGAEC-GALKI--RC----HPDYVCVNK-DG-VEKVMIPWFMM----GFKGT-----CMPTGTGKIV------------------------------------------
    >LGIG|152660_24
    ----MASMIKLSIL-----CSMIAT-----------------------VTSL-------SCVA---------------------C-------PKDQ-VCDPL-PESAE-----CFPAKAACACCKTCAGRFGDKC-STLSV--RC----HPDFVCVNE-DG-VELSSVPWYTF----DFRGI-----CVRDRCPEPSTGGDGGIVPLPVGK----------------------------
    >LGIG|238970_25
    -----MKFGVGFLL-----SCLVAL-----------NTVQN-------MLAL-------SCLP---------------------C-------DFDTLKCSPL-PDDDD-----CFPAYTPCGCCPQCAGEEDDFC-DNFTV--RC----HPDLVCVNA-TG-FEKKFVYWYEF----DFKGT-----CQESELETE-----------------------YEYEYEENETKK--------
    >PFUC|pfu_aug1.0_374.1_29219.t1_26 IGFBP
    --MRNLRFSFFVIS-----VIGVVI--------------CD-------AGRH--------CPS---------------------C--------GKL-TTSGL-PDCTKHLDIGCERVRRPCSCCTTCARNIGETC-SGRTP--RC----ASGLMCVNG-HGEALKTIPRNMR-----HYKGV-----CQNVEVCPVVVENLEVDDRRFGSDHDSSRV----------------------
    >PFUC|pfu_aug1.0_374.1_29219.t1_27
    ---SHSAGVMIRTK-----TTVIHL----------------------------------QILN------------------NHRA---YFYILTDSFVMPHI-PGDCF-----KGLSMAIYTHQSTCT----PRC-SSGAS--------FSDISSTSS-------TTRAWSANTLTANVQMF-----GTIRQS----------------------------------------------
    >SCON|comp34952_c0_seq1_28 IGFBP
    --MVAMKSVVLYSV-----AMAIFF-----------------------TLGA-------ESLR---------------------C---SCGLYPP--TCPTP-PICEE-----YG--RVLCGCCDVCKLAFGEVC-NSWNA--PC----KTGLRCLTS-DGQVYNGRPPWFKF----SEEGV-----CVQLPRGSPDQ-----------------------------------------
    >WARG|GJN0W6B01BRRYQ_29
    --MSYTAPRLAATT-----CFVVALVL-----------LQI-----SEVSSL-------RCLP---------------------C-------APDV-ECPTL-PDDCQ-----PT--KRPCGCCPECKGKVGAQC-SNMGVELRV----GSDVCQQAW-SG-HSCRQMALLV-----GFKG----------------------------------------------------------
    >Patella_vulgata_HE962376.1_30 IGFBP
    ------MKTLFLHI-----CVVLVVIV-----------VTG-------SDAL-------SCAP---------------------CLNHRF-LPRKR-ICPKL-PKTCE-----PA--SKPCGCCPVCAGKVGDRC-GRFQV--RC----EKGLTCQSQSEPTSLLGAYTISFNY---LRQGI-----CRKP------------------------------------------------
    I feel like this should be a relatively simple awk script but I'm not sure how to do it. Here's What I've come up with so far but this isn't working and won't actually take the sequence from the fasta file anyway, only the header:
    Code:
    sed '/^#/d' result_pfamA_and_B_annotation.txt > annotation.txt
    sed -i '/^$/d' annotation.txt
    sed -i 's/>//g' result_AA.fas
    awk -F'\t' -v OFS=' ' '
        NR==FNR     { a[$1]=$0; next } 
       {if (a[$1]) { print a[$1],$7}}' result_AA.fas annotation.txt
    Any assistance would be *greatly* appreciated!!

    Thanks,
    Kevin

Latest Articles

Collapse

  • SEQadmin2
    Nine Things a Sample Prep Scientist Thinks About Before Sequencing
    by SEQadmin2


    I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

    Here are nine questions we think about, in roughly the order they matter, before...
    06-18-2026, 07:11 AM
  • SEQadmin2
    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
    by SEQadmin2


    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
    ...
    06-02-2026, 10:05 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by SEQadmin2, Yesterday, 11:10 AM
0 responses
7 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-17-2026, 06:09 AM
0 responses
42 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-09-2026, 11:58 AM
0 responses
103 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-05-2026, 10:09 AM
0 responses
125 views
0 reactions
Last Post SEQadmin2  
Working...