Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting information from EMBL flat file

    Hey guys,
    I have a file of the proteome set of C. elegans that I retrieved from Uniprot in a EMBL flat file like this:

    EMBL_FLAT_FILE_CELEGANS
    Code:
    ID   14331_CAEEL             Reviewed;         248 AA.
    AC   P41932; Q21537;
    DT   01-NOV-1995, integrated into UniProtKB/Swiss-Prot.
    DT   22-JUL-2008, sequence version 2.
    DT   28-NOV-2012, entry version 95.
    DE   RecName: Full=14-3-3-like protein 1;
    DE   AltName: Full=Partitioning defective protein 5;
    GN   Name=par-5; Synonyms=ftt-1; ORFNames=M117.2;
    OS   Caenorhabditis elegans.
    OC   Eukaryota; Metazoa; Ecdysozoa; Nematoda; Chromadorea; Rhabditida;
    OC   Rhabditoidea; Rhabditidae; Peloderinae; Caenorhabditis.
    DR   GO; GO:0005938; C:cell cortex; IDA:WormBase.
    DR   GO; GO:0005634; C:nucleus; IDA:WormBase.
    DR   GO; GO:0045167; P:asymmetric protein localization involved in cell fate determination; IMP:WormBase.
    DR   GO; GO:0001708; P:cell fate specification; IMP:WormBase.
    DR   GO; GO:0043053; P:dauer entry; IMP:WormBase.
    DR   GO; GO:0008340; P:determination of adult lifespan; IMP:WormBase.
    DR   GO; GO:0009792; P:embryo development ending in birth or egg hatching; IMP:WormBase.
    DR   GO; GO:0000132; P:establishment of mitotic spindle orientation; IMP:WormBase.
    DR   GO; GO:0030590; P:first cell cycle pseudocleavage; IMP:WormBase.
    DR   GO; GO:0035188; P:hatching; IMP:WormBase.
    DR   GO; GO:0007126; P:meiosis; IMP:WormBase.
    DR   GO; GO:0002009; P:morphogenesis of an epithelium; IMP:WormBase.
    DR   GO; GO:0009949; P:polarity specification of anterior/posterior axis; IMP:WormBase.
    DR   GO; GO:0035046; P:pronuclear migration; IMP:WormBase.
    DR   GO; GO:0006898; P:receptor-mediated endocytosis; IMP:WormBase.
    DR   GO; GO:0007346; P:regulation of mitotic cell cycle; IMP:WormBase.
    DR   GO; GO:0010070; P:zygote asymmetric cell division; IMP:WormBase.
    SQ   SEQUENCE   248 AA;  28191 MW;  ABBE0DA27D9341AF CRC64;
         MSDTVEELVQ RAKLAEQAER YDDMAAAMKK VTEQGQELSN EERNLLSVAY KNVVGARRSS
         WRVISSIEQK TEGSEKKQQL AKEYRVKVEQ ELNDICQDVL KLLDEFLIVK AGAAESKVFY
         LKMKGDYYRY LAEVASEDRA AVVEKSQKAY QEALDIAKDK MQPTHPIRLG LALNFSVFYY
         EILNTPEHAC QLAKQAFDDA IAELDTLNED SYKDSTLIMQ LLRDNLTLWT SDVGAEDQEQ
         EGNQEAGN
    //
    NOTE: the file showed is here shortened.


    Moreover, I have another file with a lot of gene full names and I would like to extract informations of GO for these genes from the EMBL flat file. In other words, I would like to know if someone here have some script that read my file with the gene full names (one per line), find it in this EMBL flat file and extract the GO. The output desirable is the gene full name followed by its gene ontology separated by comma (including each ontology).

    OUTPUT
    Code:
    GENE_A,GO; GO:0001708; P:cell fate specification; IMP:WormBase, GO; GO:0043053; P:dauer entry; IMP:WormBase,GO; GO:0008340; P:determination of adult lifespan; IMP:WormBase,GO; GO:0009792; P:embryo development ending in birth or egg hatching; IMP:WormBase, GO; GO:0000132; P:establishment of mitotic spindle orientation; IMP:WormBase
    If you guys have other ideas it would be nice!

    Cheers.

  • #2
    Do you know any scripting/programming language? Both BioPerl and Biopython (and likely other libraries too) could assist you with their EMBL parsers - although in this case you could do this without a full parser.

    Comment


    • #3
      biomaRt (R/bioconductor): http://www.bioconductor.org/packages...l/biomaRt.html

      Code:
      library( biomaRt )
      
      uniprot = useMart( "unimart" );
      uniprot = useDataset( "uniprot", uniprot );
      
      # these can be looked at for more options in search(filters) and retrieve(attributes)
      
      filters = listFilters( uniprot );
      attributes = listAttributes( uniprot )
      
      useFilter = c( "accession" );
      useAttributes = c( "accession", "gene_name", "go_id", "go_name" );
      
      query = "P41932";
      df = getBM( mart=uniprot, values=c(query), filters=useFilter, attributes=useAttributes )
      
      nrow = dim( df )[ 1 ];
      s=sprintf( "%s", df[1,2] );
      for( i in 1:nrow ) {
              s = sprintf( "%s,GO; %s; %s;", s, df[i,3], df[i,4] );
      }
      If you have a text file full of accessions and want output with 1 gene per line:

      Code:
      query = read.table( "queryfile.txt" );
      # assume 1st column is accession
      
      query = as.character( query[,1] );
      
      mdf = getBM( mart=uniprot, values=query, filters=useFilter, attributes=useAttributes )
      
      uniqueAccs = unique( sort( as.character( mdf[,1] ) ) );
      outvec = vector( mode="character", length=0 );
      for( acc in uniqueAccs ) {
              df = mdf[ mdf[,1] == acc, ];
              nrow = dim( df )[ 1 ];
              s=sprintf( "%s", df[1,2] );
              for( i in 1:nrow ) {
                      s = sprintf( "%s,GO; %s; %s;", s, df[i,3], df[i,4] );
              }
              outvec = c( outvec, s );
      }
      write.table( outvec, "myoutfile.txt", quote=F, row.names=F, col.names=F );
      (the second code snippet depends on the preamble from the first)

      EDIT: I realize I did not answer your question, but this will get the job done without any need for downloading embl files.
      Last edited by jiaco; 11-30-2012, 06:05 AM.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      30 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      32 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      28 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Working...
      X