Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PFAM annotation - need help with awk scripting

    Hi all,

    I'm using pfam_scan.pl to annotate some sequences from a gene family. I'd like to take the output of my pfam annotation, which looks like this...

    Code:
    # pfam_scan.pl,  run at Mon May  5 14:19:29 2014
    #
    # Copyright (c) 2009 Genome Research Ltd
    # Freely distributed under the GNU 
    # General Public License
    #
    # Authors: Jaina Mistry ([email protected]), John Tate ([email protected]), 
    #          Rob Finn ([email protected])
    #
    # This is free software; you can redistribute it and/or modify it under
    # the terms of the GNU General Public License as published by the Free Software
    # Foundation; either version 2 of the License, or (at your option) any later version.
    # This program is distributed in the hope that it will be useful, but WITHOUT
    # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
    # FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
    # details.
    #
    # You should have received a copy of the GNU General Public License along with
    # this program. If not, see <http://www.gnu.org/licenses/>. 
    # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
    #      query sequence file: result_AA.fas.unaligned
    #     cpu number specified: 1
    #        searching against: /media/kmkocot/Sclerite/blast_dbs/Pfam27/Pfam-A.hmm, with cut off --cut_ga
    #    resolve clan overlaps: on
    #     predict active sites: off
    #        searching against: /media/kmkocot/Sclerite/blast_dbs/Pfam27/Pfam-B.hmm, with cut off -E 0.001 --domE 0.001
    # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
    #
    # <seq id> <alignment start> <alignment end> <envelope start> <envelope end> <hmm acc> <hmm name> <type> <hmm start> <hmm end> <hmm length> <bit score> <E-value> <significance> <clan>
    
    AENT|01543_gene_01269_len_492_2       19     72     19     72 PF00219.13  IGFBP             Domain     1    53    53     24.8   2.1e-05   1 No_clan  
    #HMM       CppCtee.CpeepprCpegvslvldgcgCCkvCarqegesCg...veterCakgLrC
    #MATCH     C++C+++ C + p  C  g+      cgCC+vCa+ +g++Cg   + + rC kgL+C
    #PP        99**99856.68899***9666..99***************74446789********
    #SEQ       CRRCDKSkC-KAPVGCRGGTVT--GICGCCNVCAKVKGQKCGgrwNMLGRCDKGLTC
    #CS        -----HH.-.-------SEEE-..------EEE---------...--S-------EE
    AENT|01543_gene_01269_len_492_2       20     57     13     70 PB003492    Pfam-B_3492       Pfam-B   197   233   377     21.3   0.00017  NA NA      
    #HMM       tRCDVSkCPsP.sCPGGYVPDRCNCCLVCAaaEGeACG
    #MATCH      RCD SkC +P +C GG V   C CC VCA+  G+ CG
    #PP        5**********99************************9
    #SEQ       RRCDKSKCKAPvGCRGGTVTGICGCCNVCAKVKGQKCG
    AENT|01543_gene_01269_len_492_2       29     83     12     93 PB000053    Pfam-B_53         Pfam-B   276   332   720     27.2   1.6e-06  NA NA      
    #HMM       lklqlkggvttevvgccpvcarvedeisggaedilskvdkGrmsqevvlcevvvdea
    #MATCH       + + gg  t  +gcc vca+v++  +gg + +l+++dkG++ q+    ++  d +
    #PP        34568999999**********************************9988888..444
    #SEQ       APVGCRGGTVTGICGCCNVCAKVKGQKCGGRWNMLGRCDKGLTCQKEFTGKP--DRR
    #CS        ---------------------------------------------------------
    CAPI|Contig14_4                       29     76     26     76 PF00219.13  IGFBP             Domain     4    53    53     21.5   0.00022   1 No_clan  
    #HMM       Ctee..CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC
    #MATCH     C ++  C + pp C+e  ++    cgCC vC    ge C++  ++C +gL+C
    #PP        5444547.68889*999776...9****************************
    #SEQ       CVNHptC-QAPPVCEEYGRE---LCGCCDVCKLGFGEVCNSRNAPCMSGLVC
    #CS        --HH..-.-------SEEE-..------EEE-----------S-------EE
    HNAG|Contig2818_10                    28     76     23     76 PF00219.13  IGFBP             Domain     4    53    53     21.5   0.00023   1 No_clan  
    #HMM       Ctee...CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC
    #MATCH     C +    Cp+ +p+C e  +     cgCC vC  ++ge+C++  ++C +gL+C
    #PP        3222445985.5569887666...7****************************
    #SEQ       CLNVqttCPP-TPECHEYGRR---LCGCCDVCKLELGETCNNGNAPCMSGLKC
    #CS        --HH...-.-------SEEE-..------EEE-----------S-------EE
    LRUG|comp11289_c0_seq1_18             23     73     23     73 PF00219.13  IGFBP             Domain     1    53    53     25.6   1.2e-05   1 No_clan  
    #HMM       CppCtee..CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC
    #MATCH     C  C     Cp ++p+C+e  ++   +cgCC vC   eg++C ++ ++C +gL C
    #PP        77785557787.7788***9777...****************************9
    #SEQ       CISCASLppCP-PRPDCQEYGRK---QCGCCDVCNLPEGRNCSTYSQPCLSGLLC
    #CS        -----HH..-.-------SEEE-..------EEE-----------S-------EE
    PFUC|pfu_aug1.0_374.1_29219.t1_26     25     81     19     81 PF00219.13  IGFBP             Domain     1    53    53     25.1   1.6e-05   1 No_clan  
    #HMM       CppCtee.......C.peepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC
    #MATCH     Cp C +        C ++ +  C++        c+CC +Car+ ge+C   t+rCa+gL C
    #PP        788866555555555345566665554....46****************************
    #SEQ       CPSCGKLttsglpdCtKHLDIGCERVR----RPCSCCTTCARNIGETCSGRTPRCASGLMC
    #CS        -----HH.......-..-------SEEE-..------EEE-----------S-------EE
    SCON|comp34952_c0_seq1_28             32     77     26     77 PF00219.13  IGFBP             Domain     7    53    53     25.0   1.8e-05   1 No_clan  
    #HMM       e...CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC
    #MATCH         Cp +pp C+e  +     cgCC vC    ge C++  ++C++gLrC
    #PP        244487.6777*999777...8****************************
    #SEQ       YpptCP-TPPICEEYGRV---LCGCCDVCKLAFGEVCNSWNAPCKTGLRC
    #CS        H...-.-------SEEE-..------EEE-----------S-------EE
    Patella_vulgata_HE962376.1_30         25     81     25     81 PF00219.13  IGFBP             Domain     1    53    53     30.4   3.9e-07   1 No_clan  
    #HMM       CppCtee........CpeepprCpegvslvldgcgCCkvCarqegesCgveterCakgLrC
    #MATCH     C+pC ++        Cp+ p++C++  +     cgCC vCa + g+ Cg    rC+kgL+C
    #PP        99998888999999***********9666....****************************
    #SEQ       CAPCLNHrflprkriCPKLPKTCEPASKP----CGCCPVCAGKVGDRCGRFQVRCEKGLTC
    #CS        -----HH........-.-------SEEE-..------EEE-----------S-------EE
    ...and append the annotations to the headers of my original fasta file (don't worry, I used an unaligned one to actually run pfam), which looks like this:
    Code:
    >AENT|01543_gene_01269_len_492_1
    MLYSISEFNKFLLV-----CSLIICCHCNLYSLELVGTFSI-------LHYF-------QSLSLTSTTEFHSADTASWPTIWFPC---KFFLTGQTFVAPPKHIPPAS-----AL--LSLNLSANITTTTYACNCPSTATNWGFT----FRLITPSAT-QGNDTANKSKKYDNSK--HLLGKQRLYFCLYTITRFCL------------------------------------------
    >AENT|01543_gene_01269_len_492_2
    ---MFRIVVFLALI-----CSVVAL----------------------------------SCRR---------------------C-------DKS--KCKA--PVGCR-----GGTVTGICGCCNVCAKVKGQKCGGRWNMLGRC----DKGLTCQKE-----FTGKPDRR------PGSGV-----CRVKFSCTC-------------------------------------------
    >CNIT|CNIT_1987399994_17D08_3
    -----RNPGYVWVL-----FAVVLF--------------AA-------FSSL-------KALR---------------------C---AR--PADL-VCPPR-PDCTE-----YG--QELCGFCDVCRLSVGAPC-DAWKA--PC----ESHLVCRTA-EGGDYIGRPPWNL-----DHSGV-----CSIPDPR---------------------------------------------
    >CAPI|Contig14_4
    --MPTMKSVVFHSV-----VAVILL-----------------------AAAT-------ESLR---------------------C---GC-VNHP--TCQAP-PVCEE-----YG--RELCGCCDVCKLGFGEVC-NSRNA--PC----MSGLVCLAP-DGQVYGERPLWHLF----EVQGV-----CVKLPPSEVSV-----------------------------------------
    >CAPI|Contig14_5
    MYLANFTTSKIIMP-----PFSVSS------------KPAI-------LRQL-------KHLH-------------------TDT---SLGGNFT--HTPWT-SNKCH-----SG--LSPYTCPSGAKQTSPDIH--------GA----LRELHTSPN-PSLQTSQHPQSSR-----PYSSQ-----TGGA------------------------------------------------
    >CAPI|Contig10_6
    --------MPPFSV-----SSKAAI-----------------------LRHL-------KHLH-----------------T--DA---SLGGNFT--HTPWT-SNKCH-----SG--LSPYTCPXGRXSRQVQTYMVRYVN--YT----LLQIQVCRH-HS-IRKALAHILRKLEELDMLGG---------------------------------------------------------
    >LRUG|comp11289_c0_seq1_19
    -------RSHLQQL-----SNQIDD--------------AP-------LGAL-------HSLA---------------------V-------HRYL-PGEWY-TKV-------VFRETRQMVCHIKVQINRVANTWNSFDPLGDCTHHNNRTVSCRIPGNPVSVDKGVERHKI----YSSGY-----AQQRSHTRL-------------------------------------------
    >FCAU|comp43822_c0_seq1_len_518_7
    -----MAMEKTVYL-----CGVLAVVL-----------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS---------------------------------------
    >FCAU|comp43822_c0_seq2_len_507_8
    -----MAMEKTVYL-----CGVLAVVL-----------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS---------------------------------------
    >FCAU|comp43822_c0_seq3_len_454_9
    ---MAMEKSVYLRG-----VLVVAL-------------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS---------------------------------------
    >HNAG|Contig2818_10
    -----MARAILLLV-----CLTSTL--------------VS-------LSMV-------DCLR---------------------C---SC-LNVQT-TCPPT-PECHE-----YG--RRLCGCCDVCKLELGETC-NNGNA--PC----MSGLKCNTS-EG-LFDGRPPWFMF----DAEGQ-----CVDQ------------------------------------------------
    >LHYA|Contig1125_11
    ------MNSLVALL-----SMVIVG--------------AL-------AGGY-------DCPD-------------------DDC---PV-------TCPEY-GDCID-----MR--SYPCACCADCIKPVGEDCSGEFVS---C----DNGLLCNEK-HICVVHADMTEAAR----QRRGI-----HKK-------------------------------------------------
    >LRUG|comp46482_c0_seq1_12
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp46482_c0_seq2_13
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp46482_c0_seq3_14
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFGI-----------------------------------------------
    >LRUG|comp46482_c0_seq4_15
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFGI-----------------------------------------------
    >LRUG|comp46482_c0_seq5_16
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp46482_c0_seq6_17
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp11289_c0_seq1_18
    -----MDKLPLLLF-----ITVYGF--------------SV-------VHSL-------SCIS---------------------C------ASLP--PCPPR-PDCQE-----YG--RKQCGCCDVCNLPEGRNC-STYSQ--PC----LSGLLCDTP-SG-AFHGKPPWYTI----HLEGT-----CVQPENAMHPMGHRLFG-----------------------------------
    >LRUG|comp47698_c0_seq1_20
    LHFSQDSHGGLNAV-------HISP-----------------------IPSLQTSQQPQSRRP---------------------Y---SL-QSGVGAQVPWK-FTQEH-----RRHSTDGRGINVAVRQIIRKSS--------RA----ISNVIFLRI-VWIWNDGSAILYCTVILKSFRR------HSQHVSLVCL------------------------------------------
    >LGIG|171051_21
    -----MVIYVLLPIGSENRCAVTPTKL-----------YKILKKCAGSVNGFILKNDVGGCKP---------------------C-------PAVP-NCAPL-SRKY------CVVKRRPCGCCDECAGRHKDPC-DRYSV--PC----DDQFECVND-KGYGLKHIENDL------DFHGV-----CRFRARKGQFPYISRRSRPYIIKG----------------------------
    >LGIG|228219_22
    ------MQWLLLTI-----LALATL--------------GS-------VAAL-------SCRQ---------------------C-------QPDH-ECPAL-PNDGK-----CHPARRPCSCCDECAGLRGDDC-GPFTA--RC----HPDLVCVNE-NG-EEKETVQWHE-----KFKGV-----CKRSKAERAERACKRLNQLFRLFNSTNGRPGRFLRRWLKRLYKRCLAKYNVN
    >LGIG|228220_23
    ----MASVTIYMIL-----ILSVTS--------------VV-------FSLS--------CVG---------------------C-------DKAA-PCPLL-PETKE-----CFKARAPCACCDTCASGLGAEC-GALKI--RC----HPDYVCVNK-DG-VEKVMIPWFMM----GFKGT-----CMPTGTGKIV------------------------------------------
    >LGIG|152660_24
    ----MASMIKLSIL-----CSMIAT-----------------------VTSL-------SCVA---------------------C-------PKDQ-VCDPL-PESAE-----CFPAKAACACCKTCAGRFGDKC-STLSV--RC----HPDFVCVNE-DG-VELSSVPWYTF----DFRGI-----CVRDRCPEPSTGGDGGIVPLPVGK----------------------------
    >LGIG|238970_25
    -----MKFGVGFLL-----SCLVAL-----------NTVQN-------MLAL-------SCLP---------------------C-------DFDTLKCSPL-PDDDD-----CFPAYTPCGCCPQCAGEEDDFC-DNFTV--RC----HPDLVCVNA-TG-FEKKFVYWYEF----DFKGT-----CQESELETE-----------------------YEYEYEENETKK--------
    >PFUC|pfu_aug1.0_374.1_29219.t1_26
    --MRNLRFSFFVIS-----VIGVVI--------------CD-------AGRH--------CPS---------------------C--------GKL-TTSGL-PDCTKHLDIGCERVRRPCSCCTTCARNIGETC-SGRTP--RC----ASGLMCVNG-HGEALKTIPRNMR-----HYKGV-----CQNVEVCPVVVENLEVDDRRFGSDHDSSRV----------------------
    >PFUC|pfu_aug1.0_374.1_29219.t1_27
    ---SHSAGVMIRTK-----TTVIHL----------------------------------QILN------------------NHRA---YFYILTDSFVMPHI-PGDCF-----KGLSMAIYTHQSTCT----PRC-SSGAS--------FSDISSTSS-------TTRAWSANTLTANVQMF-----GTIRQS----------------------------------------------
    >SCON|comp34952_c0_seq1_28
    --MVAMKSVVLYSV-----AMAIFF-----------------------TLGA-------ESLR---------------------C---SCGLYPP--TCPTP-PICEE-----YG--RVLCGCCDVCKLAFGEVC-NSWNA--PC----KTGLRCLTS-DGQVYNGRPPWFKF----SEEGV-----CVQLPRGSPDQ-----------------------------------------
    >WARG|GJN0W6B01BRRYQ_29
    --MSYTAPRLAATT-----CFVVALVL-----------LQI-----SEVSSL-------RCLP---------------------C-------APDV-ECPTL-PDDCQ-----PT--KRPCGCCPECKGKVGAQC-SNMGVELRV----GSDVCQQAW-SG-HSCRQMALLV-----GFKG----------------------------------------------------------
    >Patella_vulgata_HE962376.1_30
    ------MKTLFLHI-----CVVLVVIV-----------VTG-------SDAL-------SCAP---------------------CLNHRF-LPRKR-ICPKL-PKTCE-----PA--SKPCGCCPVCAGKVGDRC-GRFQV--RC----EKGLTCQSQSEPTSLLGAYTISFNY---LRQGI-----CRKP------------------------------------------------
    Such that I get a final product that looks like this (I want all relevant annotations appended after the name of the sequence but no annotation added if that sequence wasn't annotated):
    Code:
    >AENT|01543_gene_01269_len_492_1
    MLYSISEFNKFLLV-----CSLIICCHCNLYSLELVGTFSI-------LHYF-------QSLSLTSTTEFHSADTASWPTIWFPC---KFFLTGQTFVAPPKHIPPAS-----AL--LSLNLSANITTTTYACNCPSTATNWGFT----FRLITPSAT-QGNDTANKSKKYDNSK--HLLGKQRLYFCLYTITRFCL------------------------------------------
    >AENT|01543_gene_01269_len_492_2 IGFBP Pfam-B_3492 Pfam-B_53
    ---MFRIVVFLALI-----CSVVAL----------------------------------SCRR---------------------C-------DKS--KCKA--PVGCR-----GGTVTGICGCCNVCAKVKGQKCGGRWNMLGRC----DKGLTCQKE-----FTGKPDRR------PGSGV-----CRVKFSCTC-------------------------------------------
    >CNIT|CNIT_1987399994_17D08_3
    -----RNPGYVWVL-----FAVVLF--------------AA-------FSSL-------KALR---------------------C---AR--PADL-VCPPR-PDCTE-----YG--QELCGFCDVCRLSVGAPC-DAWKA--PC----ESHLVCRTA-EGGDYIGRPPWNL-----DHSGV-----CSIPDPR---------------------------------------------
    >CAPI|Contig14_4 IGFBP
    --MPTMKSVVFHSV-----VAVILL-----------------------AAAT-------ESLR---------------------C---GC-VNHP--TCQAP-PVCEE-----YG--RELCGCCDVCKLGFGEVC-NSRNA--PC----MSGLVCLAP-DGQVYGERPLWHLF----EVQGV-----CVKLPPSEVSV-----------------------------------------
    >CAPI|Contig14_5
    MYLANFTTSKIIMP-----PFSVSS------------KPAI-------LRQL-------KHLH-------------------TDT---SLGGNFT--HTPWT-SNKCH-----SG--LSPYTCPSGAKQTSPDIH--------GA----LRELHTSPN-PSLQTSQHPQSSR-----PYSSQ-----TGGA------------------------------------------------
    >CAPI|Contig10_6
    --------MPPFSV-----SSKAAI-----------------------LRHL-------KHLH-----------------T--DA---SLGGNFT--HTPWT-SNKCH-----SG--LSPYTCPXGRXSRQVQTYMVRYVN--YT----LLQIQVCRH-HS-IRKALAHILRKLEELDMLGG---------------------------------------------------------
    >LRUG|comp11289_c0_seq1_19
    -------RSHLQQL-----SNQIDD--------------AP-------LGAL-------HSLA---------------------V-------HRYL-PGEWY-TKV-------VFRETRQMVCHIKVQINRVANTWNSFDPLGDCTHHNNRTVSCRIPGNPVSVDKGVERHKI----YSSGY-----AQQRSHTRL-------------------------------------------
    >FCAU|comp43822_c0_seq1_len_518_7
    -----MAMEKTVYL-----CGVLAVVL-----------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS---------------------------------------
    >FCAU|comp43822_c0_seq2_len_507_8
    -----MAMEKTVYL-----CGVLAVVL-----------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS---------------------------------------
    >FCAU|comp43822_c0_seq3_len_454_9
    ---MAMEKSVYLRG-----VLVVAL-------------FTV-------FSSL-------DALS---------------------C---YR--PRDL-VCPSR-PNCTE-----YG--RELCGFCDLCRLTVGDPC-DSWKA--PC----ESNLVCRTD-DGNVYFGRPPWFLMLEQTSVPGV-----CTVPDTPVNSTIS---------------------------------------
    >HNAG|Contig2818_10 IGFBP
    -----MARAILLLV-----CLTSTL--------------VS-------LSMV-------DCLR---------------------C---SC-LNVQT-TCPPT-PECHE-----YG--RRLCGCCDVCKLELGETC-NNGNA--PC----MSGLKCNTS-EG-LFDGRPPWFMF----DAEGQ-----CVDQ------------------------------------------------
    >LHYA|Contig1125_11
    ------MNSLVALL-----SMVIVG--------------AL-------AGGY-------DCPD-------------------DDC---PV-------TCPEY-GDCID-----MR--SYPCACCADCIKPVGEDCSGEFVS---C----DNGLLCNEK-HICVVHADMTEAAR----QRRGI-----HKK-------------------------------------------------
    >LRUG|comp46482_c0_seq1_12
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp46482_c0_seq2_13
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp46482_c0_seq3_14
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFGI-----------------------------------------------
    >LRUG|comp46482_c0_seq4_15
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFGI-----------------------------------------------
    >LRUG|comp46482_c0_seq5_16
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp46482_c0_seq6_17
    -----MARELFLIV-----CLTVTL--------------TS-------LPMI-------ECLR---------------------C---SC-VNFQG-TCAPT-PDC-E-----VG--RRLCGCCDVCKLEFGEVC-NAWNP--PC----ETGLQCITP-EG-SFDGRPNWWM-----RYEGE-----CGFD------------------------------------------------
    >LRUG|comp11289_c0_seq1_18 IGFBP
    -----MDKLPLLLF-----ITVYGF--------------SV-------VHSL-------SCIS---------------------C------ASLP--PCPPR-PDCQE-----YG--RKQCGCCDVCNLPEGRNC-STYSQ--PC----LSGLLCDTP-SG-AFHGKPPWYTI----HLEGT-----CVQPENAMHPMGHRLFG-----------------------------------
    >LRUG|comp47698_c0_seq1_20
    LHFSQDSHGGLNAV-------HISP-----------------------IPSLQTSQQPQSRRP---------------------Y---SL-QSGVGAQVPWK-FTQEH-----RRHSTDGRGINVAVRQIIRKSS--------RA----ISNVIFLRI-VWIWNDGSAILYCTVILKSFRR------HSQHVSLVCL------------------------------------------
    >LGIG|171051_21
    -----MVIYVLLPIGSENRCAVTPTKL-----------YKILKKCAGSVNGFILKNDVGGCKP---------------------C-------PAVP-NCAPL-SRKY------CVVKRRPCGCCDECAGRHKDPC-DRYSV--PC----DDQFECVND-KGYGLKHIENDL------DFHGV-----CRFRARKGQFPYISRRSRPYIIKG----------------------------
    >LGIG|228219_22
    ------MQWLLLTI-----LALATL--------------GS-------VAAL-------SCRQ---------------------C-------QPDH-ECPAL-PNDGK-----CHPARRPCSCCDECAGLRGDDC-GPFTA--RC----HPDLVCVNE-NG-EEKETVQWHE-----KFKGV-----CKRSKAERAERACKRLNQLFRLFNSTNGRPGRFLRRWLKRLYKRCLAKYNVN
    >LGIG|228220_23
    ----MASVTIYMIL-----ILSVTS--------------VV-------FSLS--------CVG---------------------C-------DKAA-PCPLL-PETKE-----CFKARAPCACCDTCASGLGAEC-GALKI--RC----HPDYVCVNK-DG-VEKVMIPWFMM----GFKGT-----CMPTGTGKIV------------------------------------------
    >LGIG|152660_24
    ----MASMIKLSIL-----CSMIAT-----------------------VTSL-------SCVA---------------------C-------PKDQ-VCDPL-PESAE-----CFPAKAACACCKTCAGRFGDKC-STLSV--RC----HPDFVCVNE-DG-VELSSVPWYTF----DFRGI-----CVRDRCPEPSTGGDGGIVPLPVGK----------------------------
    >LGIG|238970_25
    -----MKFGVGFLL-----SCLVAL-----------NTVQN-------MLAL-------SCLP---------------------C-------DFDTLKCSPL-PDDDD-----CFPAYTPCGCCPQCAGEEDDFC-DNFTV--RC----HPDLVCVNA-TG-FEKKFVYWYEF----DFKGT-----CQESELETE-----------------------YEYEYEENETKK--------
    >PFUC|pfu_aug1.0_374.1_29219.t1_26 IGFBP
    --MRNLRFSFFVIS-----VIGVVI--------------CD-------AGRH--------CPS---------------------C--------GKL-TTSGL-PDCTKHLDIGCERVRRPCSCCTTCARNIGETC-SGRTP--RC----ASGLMCVNG-HGEALKTIPRNMR-----HYKGV-----CQNVEVCPVVVENLEVDDRRFGSDHDSSRV----------------------
    >PFUC|pfu_aug1.0_374.1_29219.t1_27
    ---SHSAGVMIRTK-----TTVIHL----------------------------------QILN------------------NHRA---YFYILTDSFVMPHI-PGDCF-----KGLSMAIYTHQSTCT----PRC-SSGAS--------FSDISSTSS-------TTRAWSANTLTANVQMF-----GTIRQS----------------------------------------------
    >SCON|comp34952_c0_seq1_28 IGFBP
    --MVAMKSVVLYSV-----AMAIFF-----------------------TLGA-------ESLR---------------------C---SCGLYPP--TCPTP-PICEE-----YG--RVLCGCCDVCKLAFGEVC-NSWNA--PC----KTGLRCLTS-DGQVYNGRPPWFKF----SEEGV-----CVQLPRGSPDQ-----------------------------------------
    >WARG|GJN0W6B01BRRYQ_29
    --MSYTAPRLAATT-----CFVVALVL-----------LQI-----SEVSSL-------RCLP---------------------C-------APDV-ECPTL-PDDCQ-----PT--KRPCGCCPECKGKVGAQC-SNMGVELRV----GSDVCQQAW-SG-HSCRQMALLV-----GFKG----------------------------------------------------------
    >Patella_vulgata_HE962376.1_30 IGFBP
    ------MKTLFLHI-----CVVLVVIV-----------VTG-------SDAL-------SCAP---------------------CLNHRF-LPRKR-ICPKL-PKTCE-----PA--SKPCGCCPVCAGKVGDRC-GRFQV--RC----EKGLTCQSQSEPTSLLGAYTISFNY---LRQGI-----CRKP------------------------------------------------
    I feel like this should be a relatively simple awk script but I'm not sure how to do it. Here's What I've come up with so far but this isn't working and won't actually take the sequence from the fasta file anyway, only the header:
    Code:
    sed '/^#/d' result_pfamA_and_B_annotation.txt > annotation.txt
    sed -i '/^$/d' annotation.txt
    sed -i 's/>//g' result_AA.fas
    awk -F'\t' -v OFS=' ' '
        NR==FNR     { a[$1]=$0; next } 
       {if (a[$1]) { print a[$1],$7}}' result_AA.fas annotation.txt
    Any assistance would be *greatly* appreciated!!

    Thanks,
    Kevin

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-25-2024, 11:49 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
18 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
62 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X