Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • pauboher
    Junior Member
    • Apr 2013
    • 5

    Clustering annotated sequences based on their GO terms

    Dear all,

    I have a set of 10000 sequences from an RNAseq experiment annotated with GO terms, I would like to cluster the sequences in biological meaningful groups using the GO terms information for each sequence. Is there any software to do that?

    Pau

    Thank you!
  • Apexy
    Member
    • Apr 2011
    • 62

    #2
    Hi,
    Out of curiosity, are these (10000) sequences already clusetered (de novo clustering) or do they represent only sequences with GO from your assembly?

    I had writen something down exactly as this one in my to dos list. I'd bookmarked this page to explored in future. I have not used it but it might help:


    The grouping algorithm is based on the hypothesis that similar annotations should have similar gene members.

    HTH

    Comment

    • pauboher
      Junior Member
      • Apr 2013
      • 5

      #3
      Hi Apexy,

      No I have not yest clustered de sequences. I have just annotated de sequences obtained from the assembly using Blast2go. Now I would like to go un step further and cluster de sequences based on their GO terms in order to obtain groups of genes involved in similar function.
      Thanks, I have already had a look at DAVID website. I think it could be a good option, but the web only accepts 3000 sequences each time and I would like to cluster all the sequences and the same time....I will keep on searching for alternative websites.

      Thank you for your answer!!

      Pau

      Comment

      • rhinoceros
        Senior Member
        • Apr 2013
        • 372

        #4
        You could probably write some small bash script. What kind of separators are you using in your headers? Which field is GO? Are there line-breaks in your sequences?
        savetherhino.org

        Comment

        • Apexy
          Member
          • Apr 2011
          • 62

          #5
          Hi,
          Its better as you have not done any clustering on them before annotation since de novo clustering sometimes assigns different transcripts from paralogous gene into the same clusters and for species with extensive gene duplications, it can be a potential nightmare. Are these functional labels from annotation transfer with BLAST or with INTERPRO or both in Blast2go? Can I also know what fraction these sequences (10,000) represent the entire assembly and what database was Blast2go set to if you used BLAST?

          @rhinoceros, a cluster should be defined by the degree of overlap in GOs shared by sequences. This will certainly introduce a new challenge as to what threshold of GOs required to put sequences in one cluster. Do you mean using cat, cut,sort and grep in a loop to write a clustering algorithm?

          Thanks,

          Comment

          • rhinoceros
            Senior Member
            • Apr 2013
            • 372

            #6
            Originally posted by Apexy View Post
            Hi,
            @rhinoceros, a cluster should be defined by the degree of overlap in GOs shared by sequences. This will certainly introduce a new challenge as to what threshold of GOs required to put sequences in one cluster. Do you mean using cat, cut,sort and grep in a loop to write a clustering algorithm?
            I thought the aim was to sort sequences so that in file Z there would be all the sequences that had GO X in their header. It's not really clustering at all but sorting. But anyway, maybe I misunderstood OP.
            Last edited by rhinoceros; 04-29-2013, 02:55 AM.
            savetherhino.org

            Comment

            • Apexy
              Member
              • Apr 2011
              • 62

              #7
              Originally posted by rhinoceros View Post
              I thought the aim was to sort sequences so that in file Z there would be all the sequences that had GO X in their header. It's not really clustering at all but sorting. But anyway, maybe I misunderstood OP.
              This would have been an appealing solution if each sequence had only one GO term.

              Comment

              • pauboher
                Junior Member
                • Apr 2013
                • 5

                #8
                Hi Apexy and rhinoceros, thank you for your information. Yes, Apexy is right in the sense that each sequences has more than one GO term and this make the process more complex. The annotation come from GO terms, motif (Interproscan) and enzyme code. All them came from the best first 10 hits from a blastX against de nr database from NCBI with a treshold of 10e-6.
                From 16000 sequences I got significant blast hits for 14000 sequences. Then for these sequences I performed the different annotation steps and I got around 10000 annotated. Now as you say, I want to cluster this 10000 sequences usig the information coming from the annotations. I tried DAVID and BABELOMICS but they have some limitations in the number of sequences they can run each time. I was wondering if it could be any program based on R or UNIX to that locally...

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  Yesterday, 10:05 AM
                • SEQadmin2
                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                  by SEQadmin2


                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                  Introduction

                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                  05-22-2026, 06:42 AM
                • SEQadmin2
                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                  by SEQadmin2

                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                  05-06-2026, 09:04 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, Yesterday, 12:03 PM
                0 responses
                19 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, Yesterday, 11:40 AM
                0 responses
                14 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 05-28-2026, 11:40 AM
                0 responses
                29 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 05-26-2026, 10:12 AM
                0 responses
                31 views
                0 reactions
                Last Post SEQadmin2  
                Working...