Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting information from EMBL flat file

    Hey guys,
    I have a file of the proteome set of C. elegans that I retrieved from Uniprot in a EMBL flat file like this:

    EMBL_FLAT_FILE_CELEGANS
    Code:
    ID   14331_CAEEL             Reviewed;         248 AA.
    AC   P41932; Q21537;
    DT   01-NOV-1995, integrated into UniProtKB/Swiss-Prot.
    DT   22-JUL-2008, sequence version 2.
    DT   28-NOV-2012, entry version 95.
    DE   RecName: Full=14-3-3-like protein 1;
    DE   AltName: Full=Partitioning defective protein 5;
    GN   Name=par-5; Synonyms=ftt-1; ORFNames=M117.2;
    OS   Caenorhabditis elegans.
    OC   Eukaryota; Metazoa; Ecdysozoa; Nematoda; Chromadorea; Rhabditida;
    OC   Rhabditoidea; Rhabditidae; Peloderinae; Caenorhabditis.
    DR   GO; GO:0005938; C:cell cortex; IDA:WormBase.
    DR   GO; GO:0005634; C:nucleus; IDA:WormBase.
    DR   GO; GO:0045167; P:asymmetric protein localization involved in cell fate determination; IMP:WormBase.
    DR   GO; GO:0001708; P:cell fate specification; IMP:WormBase.
    DR   GO; GO:0043053; P:dauer entry; IMP:WormBase.
    DR   GO; GO:0008340; P:determination of adult lifespan; IMP:WormBase.
    DR   GO; GO:0009792; P:embryo development ending in birth or egg hatching; IMP:WormBase.
    DR   GO; GO:0000132; P:establishment of mitotic spindle orientation; IMP:WormBase.
    DR   GO; GO:0030590; P:first cell cycle pseudocleavage; IMP:WormBase.
    DR   GO; GO:0035188; P:hatching; IMP:WormBase.
    DR   GO; GO:0007126; P:meiosis; IMP:WormBase.
    DR   GO; GO:0002009; P:morphogenesis of an epithelium; IMP:WormBase.
    DR   GO; GO:0009949; P:polarity specification of anterior/posterior axis; IMP:WormBase.
    DR   GO; GO:0035046; P:pronuclear migration; IMP:WormBase.
    DR   GO; GO:0006898; P:receptor-mediated endocytosis; IMP:WormBase.
    DR   GO; GO:0007346; P:regulation of mitotic cell cycle; IMP:WormBase.
    DR   GO; GO:0010070; P:zygote asymmetric cell division; IMP:WormBase.
    SQ   SEQUENCE   248 AA;  28191 MW;  ABBE0DA27D9341AF CRC64;
         MSDTVEELVQ RAKLAEQAER YDDMAAAMKK VTEQGQELSN EERNLLSVAY KNVVGARRSS
         WRVISSIEQK TEGSEKKQQL AKEYRVKVEQ ELNDICQDVL KLLDEFLIVK AGAAESKVFY
         LKMKGDYYRY LAEVASEDRA AVVEKSQKAY QEALDIAKDK MQPTHPIRLG LALNFSVFYY
         EILNTPEHAC QLAKQAFDDA IAELDTLNED SYKDSTLIMQ LLRDNLTLWT SDVGAEDQEQ
         EGNQEAGN
    //
    NOTE: the file showed is here shortened.


    Moreover, I have another file with a lot of gene full names and I would like to extract informations of GO for these genes from the EMBL flat file. In other words, I would like to know if someone here have some script that read my file with the gene full names (one per line), find it in this EMBL flat file and extract the GO. The output desirable is the gene full name followed by its gene ontology separated by comma (including each ontology).

    OUTPUT
    Code:
    GENE_A,GO; GO:0001708; P:cell fate specification; IMP:WormBase, GO; GO:0043053; P:dauer entry; IMP:WormBase,GO; GO:0008340; P:determination of adult lifespan; IMP:WormBase,GO; GO:0009792; P:embryo development ending in birth or egg hatching; IMP:WormBase, GO; GO:0000132; P:establishment of mitotic spindle orientation; IMP:WormBase
    If you guys have other ideas it would be nice!

    Cheers.

  • #2
    Do you know any scripting/programming language? Both BioPerl and Biopython (and likely other libraries too) could assist you with their EMBL parsers - although in this case you could do this without a full parser.

    Comment


    • #3
      biomaRt (R/bioconductor): http://www.bioconductor.org/packages...l/biomaRt.html

      Code:
      library( biomaRt )
      
      uniprot = useMart( "unimart" );
      uniprot = useDataset( "uniprot", uniprot );
      
      # these can be looked at for more options in search(filters) and retrieve(attributes)
      
      filters = listFilters( uniprot );
      attributes = listAttributes( uniprot )
      
      useFilter = c( "accession" );
      useAttributes = c( "accession", "gene_name", "go_id", "go_name" );
      
      query = "P41932";
      df = getBM( mart=uniprot, values=c(query), filters=useFilter, attributes=useAttributes )
      
      nrow = dim( df )[ 1 ];
      s=sprintf( "%s", df[1,2] );
      for( i in 1:nrow ) {
              s = sprintf( "%s,GO; %s; %s;", s, df[i,3], df[i,4] );
      }
      If you have a text file full of accessions and want output with 1 gene per line:

      Code:
      query = read.table( "queryfile.txt" );
      # assume 1st column is accession
      
      query = as.character( query[,1] );
      
      mdf = getBM( mart=uniprot, values=query, filters=useFilter, attributes=useAttributes )
      
      uniqueAccs = unique( sort( as.character( mdf[,1] ) ) );
      outvec = vector( mode="character", length=0 );
      for( acc in uniqueAccs ) {
              df = mdf[ mdf[,1] == acc, ];
              nrow = dim( df )[ 1 ];
              s=sprintf( "%s", df[1,2] );
              for( i in 1:nrow ) {
                      s = sprintf( "%s,GO; %s; %s;", s, df[i,3], df[i,4] );
              }
              outvec = c( outvec, s );
      }
      write.table( outvec, "myoutfile.txt", quote=F, row.names=F, col.names=F );
      (the second code snippet depends on the preamble from the first)

      EDIT: I realize I did not answer your question, but this will get the job done without any need for downloading embl files.
      Last edited by jiaco; 11-30-2012, 06:05 AM.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        New Genomics Tools and Methods Shared at AGBT 2025
        by seqadmin


        This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

        The Headliner
        The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
        03-03-2025, 01:39 PM
      • seqadmin
        Investigating the Gut Microbiome Through Diet and Spatial Biology
        by seqadmin




        The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
        02-24-2025, 06:31 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 12:50 PM
      0 responses
      10 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-03-2025, 01:15 PM
      0 responses
      181 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 02-28-2025, 12:58 PM
      0 responses
      276 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 02-24-2025, 02:48 PM
      0 responses
      663 views
      0 likes
      Last Post seqadmin  
      Working...
      X