Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SeqMonk: Export features (e.g. CpG Islands) from Ensembl for import into SeqMonk?

    Hello,

    I am using SeqMonk to view reads mapped to a genome. I would like to import an extra annotation set (CpG Islands) to use in my analysis as is described in the SeqMonk help file under the heading "Importing Extra Annotation".

    I would like to use the CpG Island annotation set that is displayed in the Ensembl Genome Browser, but I can't figure out how to download just the CpG Island data track as a GFF file from Ensembl.

    Can anyone tell me how to download just the CpG Island data from the Ensembl Genome Browser (or Ensembl FTP) in GFF format so that I can use it in SeqMonk?

    Thanks in advance,
    jjw

  • #2
    I don't think there is an easy way to do this from within Ensembl. You can use their BioMart interface to bulk download gene related information, but this doesn't work for other feature types. Their recommendation is to use their Perl API to pull down this kind of data, but if that's not something you're comfortable with then I guess that's not much help.

    You can actually get at this kind of data much more easily from UCSC. Their table browser system allows you to export any of the annotation tracks into a simple text format which should be easy to import into SeqMonk.

    As an aside, which genome are you using? CpG islands should be a standard track in the latest releases of genomes which contain this track.

    Comment


    • #3
      Simon,

      Thank you for the information. Sorry for the delay in my response.

      I agree with you that the UCSC Table Browser is a great resource, and I have used it before for exporting specific tracks, including CpG Islands.

      The reason I wanted to get the CpG Island information from Ensembl was that I have imported a custom genome (pig; Sus scrofa 10.2) into SeqMonk by modifying the EMBL formatted files from as you had described in your help file "Creating a Custom Genome".

      Currently, the UCSC table browser is supporting the Nov. 2009, SGSC Sscrofa 9.2/Sscrofa2), so I wasn't sure if the CpG Islands exported from USCS would be compatible with the current S. scrofa 10.2 genome.

      I have to admit that the differences in nomenclature for genomes of the same species from NCBI, Ensembl, etc. are still confusing to me, even though I have tried on numerous occasions to determine compatibility. For this reason, I wanted to obtain all data that I am going to put into SeqMonk from the same place. Yes, it's ignorance on my part, but I don't want to risk generating erroneous results.

      Thank you for your fast response and advice. If you have any more input, I would be glad to hear it.

      jjw

      Comment


      • #4
        For this you'd have to go to the Ensembl API, though as pre.ensembl isn't in a release yet I'm not actually sure how you'd connect to that database to be able to run queries.

        Hopefully the pig assembly will make it into a full ensembl release soon, at which point we'll add it to our list of supported genomes and you'll have the CpG island tracks present.

        Comment


        • #5
          Thanks, Simon. If I figure out how, I will post the method I used here in case some other want to obtain similar data.

          jjw

          Comment


          • #6
            Hello jjw,

            I've just read this :

            Originally posted by jjw14 View Post

            I have imported a custom genome (pig; Sus scrofa 10.2) into SeqMonk by modifying the EMBL formatted files from as you had described in your help file "Creating a Custom Genome".

            jjw
            I am currently trying to do the same thing and I would need some tips..
            I have downloaded the EMBL files (from 0.dat to 7000.dat) into my Genome directory. It seems that Seqmonk can open them without big troubles even if there are several scaffolds and AC lines into each files.
            For many scaffolds, SeqMonk can attribute them to their specific chromosomes so the genome is almost recreated. But some other scaffolds are not attributed to any chromosome and are considered by Seqmonk as very very small independent chromosomes.

            My questions are : do you have the same result? If not, how did you modify the files to get a full assembled genome?

            Thanks for your help.

            Comment


            • #7
              Which genome are you trying to use? I've just seen that the pig 10.2 assembly is now released into the main Ensembl, so I've just just kicked off the processing scripts to add it to the supported genomes in SeqMonk. It should be there late tonight or early tomorrow.

              In general you can use the EMBL files exported by ensembl, but you only want to use the contigs which form part of the main chromosomes. There are a number of short scaffolds which aren't included in the main assembly (normally with names ending in _random), and it is these which will mess up the genome building in SeqMonk because it will treat each of these as a separate chromosome. In the API you can pull down slices only of type 'chromosome', but from the exported EMBL files you'll need to look at the names of the chromsome and filter out those which aren't actually part of the main assembly.

              Comment


              • #8
                The Sus scrofa 10.2 genome assembly should now be available as a supported genome.

                Comment


                • #9
                  Thanks for adding this genome, and for your quick answers.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Best Practices for Single-Cell Sequencing Analysis
                    by seqadmin



                    While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                    06-06-2024, 07:15 AM
                  • seqadmin
                    Latest Developments in Precision Medicine
                    by seqadmin



                    Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                    Somatic Genomics
                    “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                    05-24-2024, 01:16 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 06-21-2024, 07:49 AM
                  0 responses
                  14 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 06-20-2024, 07:23 AM
                  0 responses
                  14 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 06-17-2024, 06:54 AM
                  0 responses
                  16 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 06-14-2024, 07:24 AM
                  0 responses
                  25 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X