Creating a Custom Genome ======================== SeqMonk comes with a large selection of mammalian genomes annotated by the Ensembl project, but it is possible to work with other genomes in SeqMonk. Genomes in SeqMonk are stored as EMBL format sequence files (with the actual sequence removed - leaving just the annotation). It is therefore possible to create new genomes from any annotated EMBL files. Creating your sequence files ============================ The easiest way to create a new genome is to download an annotated EMBL formatted file for the organism you're interested in and then make some minor changes to the file. The EMBL format is the standard format produced by the EMBL nucleotide sequence database at: http://www.ebi.ac.uk/embl/ Specifically there is a list of completed genomes at: http://www.ebi.ac.uk/genomes/ ..which contains links to EMBL formatted files of the final annotated genome assemblies. You should download the files which make up your genome of interest and save them in a file with a .dat file extension (eg ecoli.dat). Modifying your sequence files ============================= There is one change you need to make to get SeqMonk to recognise your files, and a couple of optional changes which might make your life easier. 1) Changing the accession ------------------------- SeqMonk requires that the accession line of each EMBL file has a certain structure in order to be able to recognise and import the file. The accession line will be very near the top of your file and will start with AC. For example this is the header which shows the start of the E.coli genome and the third line is the accession. ID CP000948; SV 1; circular; genomic DNA; STD; PRO; 4686137 BP. XX AC CP000948; XX PR Project:20079; XX DT 26-MAR-2008 (Rel. 95, Created) DT 06-JUN-2008 (Rel. 96, Last updated, Version 4) XX DE Escherichia coli str. K12 substr. DH10B, complete genome. XX KW . XX OS Escherichia coli str. K-12 substr. DH10B OC Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; OC Enterobacteriaceae; Escherichia. The format for the accession line is a set of 5 fields separated by colons. The fields are: 1 - chromosome (always the same) 2 - Assembly name (for bacteria you might want to use the strain name) 3 - The name of the chromosome (for bacteria this might be genome for the main genome or the name of a plasmid) 4 - The starting base of this sequence (normally 1) 5 - The last base of this sequence (normally the sequence length) 6 - The direction of the sequecne (should always be 1) So for example in the above header a suitable accession line would be AC chromosome:K12DH10B:Genome:1:4686137:1 ...which would replace the existing AC line. 2) Removing the sequence ------------------------ Since SeqMonk only uses the genome files to get annotation information you can remove the actual sequence to save disk space. If you choose to leave the sequence present then this will also work but will take up a bit more room. To delete the sequence simply remove everything from just after the SQ line to the end of file marker (//). eg: SQ Sequence 4686137 BP; 1153640 A; 1190880 C; 1188801 G; 1152814 T; 2 other; agcttttcat tctgactgca acgggcaata tgtctctgtg tggattaaaa aaagagtgtc 60 tgatagcagc ttctgaactg gttacctgcc gtgagtaaat taaaatttta ttgacttagg 120 tcactaaata ctttaaccaa tataggcata gcgcacagac agataaaaat tacagagtac 180 acaacatcca tgaaacgcat tagcaccacc attaccacca ccatcaccat taccacaggt 240 [lots more sequence] atcaactgca agctttacgc gaacgagcca tgacattgct gacgactctg gcagtggcag 4686000 atgacataaa actggtcgac tggttacaac aacgcctggg gcttttagag caacgagaca 4686060 cggcaatgtt gcaccgtttg ctgcatgata ttgaaaaaaa tatcaccaaa taaaaaacgc 4686120 cttagtaagt atttttc 4686137 // ..would become: SQ Sequence 4686137 BP; 1153640 A; 1190880 C; 1188801 G; 1152814 T; 2 other; // 3) Changing annotation tags --------------------------- There are two pieces of information SeqMonk tries to extract from every feature - a name and a description. It does this by reading the annotation tags attached to each figure and tries to work out which one to use. Depending on how your genome has been annotated you may find that SeqMonk doesn't pick the tag you would prefer it to use. For the description SeqMonk requires a tag called 'description'. If this isn't present then no description will be associated with the feature. If you have a description for your features in a tag with another name then you should change that tag to be called description so that SeqMonk will recognise it. For example in the following feature the field you probably want to use for the description is the 'product' tag: FT CDS 337..2799 FT /codon_start=1 FT /transl_table=11 FT /gene="thrA" FT /locus_tag="ECDH10B_0002" FT /product="fused aspartokinase I and homoserine FT dehydrogenase I" FT /db_xref="GOA:B1XBC7" FT /db_xref="InterPro:IPR018042" FT /db_xref="UniProtKB/TrEMBL:B1XBC7" FT /protein_id="ACB01207.1" FT /translation="MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITN FT TLSWKLGV" You could therefore do a find / replace changing every instance of /product for /description to produce: FT CDS 337..2799 FT /codon_start=1 FT /transl_table=11 FT /gene="thrA" FT /locus_tag="ECDH10B_0002" FT /description="fused aspartokinase I and homoserine FT dehydrogenase I" FT /db_xref="GOA:B1XBC7" FT /db_xref="InterPro:IPR018042" FT /db_xref="UniProtKB/TrEMBL:B1XBC7" FT /protein_id="ACB01207.1" FT /translation="MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITN FT TLSWKLGV" For the gene name SeqMonk has a range of tag names it can use, and these have a preferred order. The tag which is found farthest up this list will be used to annotate your feature: 1 = /name 2 = /db_xref="MarkerSymbol:" 3 = /gene 4 = /note="exon_id" 5 = /standard_name 6 = /note 7 = /source If the tag containing your gene names isn't in this list, or if there is another tag which takes preference then you should change that tag to be a /name tag so it will definitely take precedence. If in the above example you wanted the 'locus_tag' field to be the name of the feature you could replace /locus_tag with /name to give: FT CDS 337..2799 FT /codon_start=1 FT /transl_table=11 FT /gene="thrA" FT /name="ECDH10B_0002" FT /description="fused aspartokinase I and homoserine FT dehydrogenase I" FT /db_xref="GOA:B1XBC7" FT /db_xref="InterPro:IPR018042" FT /db_xref="UniProtKB/TrEMBL:B1XBC7" FT /protein_id="ACB01207.1" FT /translation="MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITN FT TLSWKLGV" Installing your genome ====================== Once you have created and modified your genome files you can install it into your genomes folder. The genomes folder is initally created under the SeqMonk installation directory, but you can change its location in your preferences to any directory you like. The Genomes folder is a structured folder which has three levels: Genomes | --Species | --Assembly | -- Annotation files If you wanted to install a Genome for Eschericia coli K12 you would create the following directory structure: Genomes | --Eschericia coli | -- K12 ..and then save all of your EMBL format .dat files into the K12 directory. Once you have done this the genome should appear in the list of installed genomes when you create a new project.