Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

SeqMonk: Export features (e.g. CpG Islands) from Ensembl for import into SeqMonk?

  • Filter
  • Time
  • Show
Clear All
new posts

  • SeqMonk: Export features (e.g. CpG Islands) from Ensembl for import into SeqMonk?


    I am using SeqMonk to view reads mapped to a genome. I would like to import an extra annotation set (CpG Islands) to use in my analysis as is described in the SeqMonk help file under the heading "Importing Extra Annotation".

    I would like to use the CpG Island annotation set that is displayed in the Ensembl Genome Browser, but I can't figure out how to download just the CpG Island data track as a GFF file from Ensembl.

    Can anyone tell me how to download just the CpG Island data from the Ensembl Genome Browser (or Ensembl FTP) in GFF format so that I can use it in SeqMonk?

    Thanks in advance,

  • #2
    I don't think there is an easy way to do this from within Ensembl. You can use their BioMart interface to bulk download gene related information, but this doesn't work for other feature types. Their recommendation is to use their Perl API to pull down this kind of data, but if that's not something you're comfortable with then I guess that's not much help.

    You can actually get at this kind of data much more easily from UCSC. Their table browser system allows you to export any of the annotation tracks into a simple text format which should be easy to import into SeqMonk.

    As an aside, which genome are you using? CpG islands should be a standard track in the latest releases of genomes which contain this track.


    • #3

      Thank you for the information. Sorry for the delay in my response.

      I agree with you that the UCSC Table Browser is a great resource, and I have used it before for exporting specific tracks, including CpG Islands.

      The reason I wanted to get the CpG Island information from Ensembl was that I have imported a custom genome (pig; Sus scrofa 10.2) into SeqMonk by modifying the EMBL formatted files from as you had described in your help file "Creating a Custom Genome".

      Currently, the UCSC table browser is supporting the Nov. 2009, SGSC Sscrofa 9.2/Sscrofa2), so I wasn't sure if the CpG Islands exported from USCS would be compatible with the current S. scrofa 10.2 genome.

      I have to admit that the differences in nomenclature for genomes of the same species from NCBI, Ensembl, etc. are still confusing to me, even though I have tried on numerous occasions to determine compatibility. For this reason, I wanted to obtain all data that I am going to put into SeqMonk from the same place. Yes, it's ignorance on my part, but I don't want to risk generating erroneous results.

      Thank you for your fast response and advice. If you have any more input, I would be glad to hear it.



      • #4
        For this you'd have to go to the Ensembl API, though as pre.ensembl isn't in a release yet I'm not actually sure how you'd connect to that database to be able to run queries.

        Hopefully the pig assembly will make it into a full ensembl release soon, at which point we'll add it to our list of supported genomes and you'll have the CpG island tracks present.


        • #5
          Thanks, Simon. If I figure out how, I will post the method I used here in case some other want to obtain similar data.



          • #6
            Hello jjw,

            I've just read this :

            Originally posted by jjw14 View Post

            I have imported a custom genome (pig; Sus scrofa 10.2) into SeqMonk by modifying the EMBL formatted files from as you had described in your help file "Creating a Custom Genome".

            I am currently trying to do the same thing and I would need some tips..
            I have downloaded the EMBL files (from 0.dat to 7000.dat) into my Genome directory. It seems that Seqmonk can open them without big troubles even if there are several scaffolds and AC lines into each files.
            For many scaffolds, SeqMonk can attribute them to their specific chromosomes so the genome is almost recreated. But some other scaffolds are not attributed to any chromosome and are considered by Seqmonk as very very small independent chromosomes.

            My questions are : do you have the same result? If not, how did you modify the files to get a full assembled genome?

            Thanks for your help.


            • #7
              Which genome are you trying to use? I've just seen that the pig 10.2 assembly is now released into the main Ensembl, so I've just just kicked off the processing scripts to add it to the supported genomes in SeqMonk. It should be there late tonight or early tomorrow.

              In general you can use the EMBL files exported by ensembl, but you only want to use the contigs which form part of the main chromosomes. There are a number of short scaffolds which aren't included in the main assembly (normally with names ending in _random), and it is these which will mess up the genome building in SeqMonk because it will treat each of these as a separate chromosome. In the API you can pull down slices only of type 'chromosome', but from the exported EMBL files you'll need to look at the names of the chromsome and filter out those which aren't actually part of the main assembly.


              • #8
                The Sus scrofa 10.2 genome assembly should now be available as a supported genome.


                • #9
                  Thanks for adding this genome, and for your quick answers.