Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • rebrendi
    ng
    • May 2008
    • 78

    #16
    It seems that instead of "gene start" or "gene end" columns, I should be looking for "5' UTR Start" and "3' UTR End" columns. But these columns are not available for whatever reason. When selecting these columns in the Biomart web query, the output file returns empty spaces for the "5' UTR Start" and "3' UTR End" columns.

    Any suggestions?

    Comment

    • laura
      Senior Member
      • Sep 2008
      • 151

      #17
      Can you give us an actual example?

      Comment

      • rebrendi
        ng
        • May 2008
        • 78

        #18
        For example, have a look at S.Cerevisae, gene YAL001C:
        Biomart gives its boundaries as 147594-151166,
        whereas the experimentally confirmed TSS-TTS for this gene are 147531-151187.

        (it's on the "-" strand, so the TTS comes first, then TSS; anyway they do not coincide with the Biomart's "gene start" and "gene end" by several dozens bp. It's like this with almost all genes that I tested. I could not compare with Biomart's 5'-UTR and 3'-UTR, because Biomart returns empty spaces instead of these columns).
        Last edited by rebrendi; 03-03-2012, 03:16 AM.

        Comment

        • dpryan
          Devon Ryan
          • Jul 2011
          • 3478

          #19
          The biomart boundaries coincide with those from Ensembl, which is to be expected. Ensembl mentions that for S.Cerevisae, it just imports data from the Saccharomyces Genome Database (SGD). If you go to the SGD website, you also get the coordinates that you found from Biomart, but then you'll notice that you can instead search for YAL001C_5UTR and YAL001C_3UTR as a landmark. Those seem to give (more or less) the coordinates that you listed. That suggests that this is due to a quirk of how the SGD structures its data.

          If this is correct, then you might want to just parse whatever SGD has available. It'd then be good if someone notified Ensembl. I haven't seen this sort of thing happen with human or mouse data.

          Comment

          • laura
            Senior Member
            • Sep 2008
            • 151

            #20
            Dpryan is right the cerrevisae data is an import so it might be an oddity of how sgd stores the data rather than ensembl

            Comment

            • rebrendi
              ng
              • May 2008
              • 78

              #21
              dpryan,
              How do you know that this does not happen with mouse or human? Could you please tell, how exactly you download the 5'-UTR coordinates from Biomart? Or you mean that for mouse/human, but not for yeast, the "gene start" and "gene end" have the meaning of 5'-UTR and 3'-UTR?

              Comment

              • laura
                Senior Member
                • Sep 2008
                • 151

                #22
                For ensembl annotated species gene start is the 5" most coordinate and gene end is the 3" most coordinate. For many species this means the utr

                Comment

                • rebrendi
                  ng
                  • May 2008
                  • 78

                  #23
                  hmm, would be nice to know is it so for mouse and human, actually...

                  Comment

                  • dpryan
                    Devon Ryan
                    • Jul 2011
                    • 3478

                    #24
                    You can trivially see the differences in the Ensembl mouse/human and S. Cerevisiae genomes by looking at the genome browser. The Ensembl mouse/human genomes have obvious UTRs, but that's not the case for S. cerevisiae. You can also see this in the mouse and human gtf files, which I assume is the source of the Biomart information (don't actually use it myself).

                    Comment

                    • laura
                      Senior Member
                      • Sep 2008
                      • 151

                      #25
                      Human and mouse are ensembl annotated species so yes this is true for human and mouse go and look

                      Comment

                      • rebrendi
                        ng
                        • May 2008
                        • 78

                        #26
                        Ok, great, thank you guys!

                        Comment

                        • rebrendi
                          ng
                          • May 2008
                          • 78

                          #27
                          Although, there is still something that I do not understand:

                          For example, let's take the Ensemble mouse annotation: around 95,000 entries.

                          Now, let's look at some other database that contains all known mouse UTRs, e.g. http://utrdb.ba.itb.cnr.it/home/statistics
                          It has only around 25,000 entries for mouse.

                          It seems that it is technically more difficult to determine the TSS position rather than just define the ORF.

                          Now, can someone explain, which values are substituted in the Biomart output file for mouse containing ~95,000 entries, if only ~25,000 genes have been experimentally characterized in terms of their TSS?

                          How would I guess, which "gene start" is the real gene start, and which "gene start" is just the start of the ORF?

                          Comment

                          • dpryan
                            Devon Ryan
                            • Jul 2011
                            • 3478

                            #28
                            Originally posted by rebrendi View Post
                            Although, there is still something that I do not understand:

                            For example, let's take the Ensemble mouse annotation: around 95,000 entries.

                            Now, let's look at some other database that contains all known mouse UTRs, e.g. http://utrdb.ba.itb.cnr.it/home/statistics
                            It has only around 25,000 entries for mouse.

                            It seems that it is technically more difficult to determine the TSS position rather than just define the ORF.

                            Now, can someone explain, which values are substituted in the Biomart output file for mouse containing ~95,000 entries, if only ~25,000 genes have been experimentally characterized in terms of their TSS?

                            How would I guess, which "gene start" is the real gene start, and which "gene start" is just the start of the ORF?
                            Have you read how Ensembl generates its annotation? Have you then compared it to how the database you linked to was created? You should be able to answer your own question.

                            Comment

                            • rebrendi
                              ng
                              • May 2008
                              • 78

                              #29
                              Originally posted by dpryan View Post
                              Have you read how Ensembl generates its annotation? Have you then compared it to how the database you linked to was created? You should be able to answer your own question.
                              You mean that the former is automatically+manually created, and the latter is manually created? Ok, but that does not answer my question.

                              I can not check each individual gene as I did in the example above to find out that the S.Cerevisae genes are annotated somehow different from the other species. I am just looking for a simple way to download the data set that contains all TSS coordinates (not the ORF coordinates).

                              Comment

                              • dpryan
                                Devon Ryan
                                • Jul 2011
                                • 3478

                                #30
                                Originally posted by rebrendi View Post
                                You mean that the former is automatically+manually created, and the latter is manually created? Ok, but that does not answer my question.

                                I can not check each individual gene as I did in the example above to find out that the S.Cerevisae genes are annotated somehow different from the other species. I am just looking for a simple way to download the data set that contains all TSS coordinates (not the ORF coordinates).
                                Neither of them are manually created and they source from only partly overlapping datasets (well, one gets its data only from EMBL/Genbank). Your question regarding downloading TSS coordinates was already answered for mouse and human. Most other genomes are probably the same. Some, such as S. Cerevisiae, aren't created by Ensembl and so could be different.

                                Unless you're downloading hundreds of genomes, it's not a problem to quickly check a couple genes to make sure the dataset is what you think it is. That's a good thing to do anyway for any dataset you don't produce yourself. Frankly, you could have done that between when you wrote your last message and my reply.

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM
                                • SEQadmin2
                                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                  by SEQadmin2

                                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                  05-06-2026, 09:04 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, Today, 08:59 AM
                                0 responses
                                8 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 12:03 PM
                                0 responses
                                21 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 11:40 AM
                                0 responses
                                17 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-28-2026, 11:40 AM
                                0 responses
                                29 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...