Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SeqMonk: Export features (e.g. CpG Islands) from Ensembl for import into SeqMonk?

    Hello,

    I am using SeqMonk to view reads mapped to a genome. I would like to import an extra annotation set (CpG Islands) to use in my analysis as is described in the SeqMonk help file under the heading "Importing Extra Annotation".

    I would like to use the CpG Island annotation set that is displayed in the Ensembl Genome Browser, but I can't figure out how to download just the CpG Island data track as a GFF file from Ensembl.

    Can anyone tell me how to download just the CpG Island data from the Ensembl Genome Browser (or Ensembl FTP) in GFF format so that I can use it in SeqMonk?

    Thanks in advance,
    jjw

  • #2
    I don't think there is an easy way to do this from within Ensembl. You can use their BioMart interface to bulk download gene related information, but this doesn't work for other feature types. Their recommendation is to use their Perl API to pull down this kind of data, but if that's not something you're comfortable with then I guess that's not much help.

    You can actually get at this kind of data much more easily from UCSC. Their table browser system allows you to export any of the annotation tracks into a simple text format which should be easy to import into SeqMonk.

    As an aside, which genome are you using? CpG islands should be a standard track in the latest releases of genomes which contain this track.

    Comment


    • #3
      Simon,

      Thank you for the information. Sorry for the delay in my response.

      I agree with you that the UCSC Table Browser is a great resource, and I have used it before for exporting specific tracks, including CpG Islands.

      The reason I wanted to get the CpG Island information from Ensembl was that I have imported a custom genome (pig; Sus scrofa 10.2) into SeqMonk by modifying the EMBL formatted files from as you had described in your help file "Creating a Custom Genome".

      Currently, the UCSC table browser is supporting the Nov. 2009, SGSC Sscrofa 9.2/Sscrofa2), so I wasn't sure if the CpG Islands exported from USCS would be compatible with the current S. scrofa 10.2 genome.

      I have to admit that the differences in nomenclature for genomes of the same species from NCBI, Ensembl, etc. are still confusing to me, even though I have tried on numerous occasions to determine compatibility. For this reason, I wanted to obtain all data that I am going to put into SeqMonk from the same place. Yes, it's ignorance on my part, but I don't want to risk generating erroneous results.

      Thank you for your fast response and advice. If you have any more input, I would be glad to hear it.

      jjw

      Comment


      • #4
        For this you'd have to go to the Ensembl API, though as pre.ensembl isn't in a release yet I'm not actually sure how you'd connect to that database to be able to run queries.

        Hopefully the pig assembly will make it into a full ensembl release soon, at which point we'll add it to our list of supported genomes and you'll have the CpG island tracks present.

        Comment


        • #5
          Thanks, Simon. If I figure out how, I will post the method I used here in case some other want to obtain similar data.

          jjw

          Comment


          • #6
            Hello jjw,

            I've just read this :

            Originally posted by jjw14 View Post

            I have imported a custom genome (pig; Sus scrofa 10.2) into SeqMonk by modifying the EMBL formatted files from as you had described in your help file "Creating a Custom Genome".

            jjw
            I am currently trying to do the same thing and I would need some tips..
            I have downloaded the EMBL files (from 0.dat to 7000.dat) into my Genome directory. It seems that Seqmonk can open them without big troubles even if there are several scaffolds and AC lines into each files.
            For many scaffolds, SeqMonk can attribute them to their specific chromosomes so the genome is almost recreated. But some other scaffolds are not attributed to any chromosome and are considered by Seqmonk as very very small independent chromosomes.

            My questions are : do you have the same result? If not, how did you modify the files to get a full assembled genome?

            Thanks for your help.

            Comment


            • #7
              Which genome are you trying to use? I've just seen that the pig 10.2 assembly is now released into the main Ensembl, so I've just just kicked off the processing scripts to add it to the supported genomes in SeqMonk. It should be there late tonight or early tomorrow.

              In general you can use the EMBL files exported by ensembl, but you only want to use the contigs which form part of the main chromosomes. There are a number of short scaffolds which aren't included in the main assembly (normally with names ending in _random), and it is these which will mess up the genome building in SeqMonk because it will treat each of these as a separate chromosome. In the API you can pull down slices only of type 'chromosome', but from the exported EMBL files you'll need to look at the names of the chromsome and filter out those which aren't actually part of the main assembly.

              Comment


              • #8
                The Sus scrofa 10.2 genome assembly should now be available as a supported genome.

                Comment


                • #9
                  Thanks for adding this genome, and for your quick answers.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  8 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  8 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  66 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X