Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • rkneue
    Junior Member
    • Jan 2014
    • 5

    The missing tool in bioinformatics

    Hi to all SEQAnswers forum members.
    My name's Robert, and I'm a stem cell researcher and bioinformatics developer. Me and my unit deal daily with next gen sequencing technologies (ChIP-Seq, RNA-Seq, RIP-Seq, Bis-Seq, and a couple of techniques we've developed), and what we've observed is that in most cases (except for standardized procedures such as reads mapping and so on), writing our own custom tools is better than dealing with third-party softwares.
    Now we're planning to start an ambitious open-source project. The aim is to the develop a tool/framework to enable the ordered and non-redundant integration of genomic/epigenomic data from the billions of informations actually available on internet. First of all, we're planning to create an unique and full genes annotation by extended cross referencing between all annotations actually available (e.g. ENSEMBL, NCBI, VEGA, etc.). Next we would like to enable integration and handling of most ChIP-Seq data available from ENCODE and GEO Datasets, with quality checks on data to discard all low-quality datasets (which are actually really abundant).
    Now, before starting the work, we would like to ask you all for suggestions, ideas, and what you will expect from a tool like this.
    If you can spend a couple of minutes to help us (helping you), we will really appreciate that.
    All my best

    Robert
  • rkneue
    Junior Member
    • Jan 2014
    • 5

    #2
    Maybe, the best question would be, will a tool like this be useful for genome researchers? Are there any other tools or frameworks you will need? We want to realize something really useful to the scientific community, so... You all are the scientific community, so let's go with your suggestions and ideas!

    Comment

    • gringer
      David Eccles (gringer)
      • May 2011
      • 845

      #3
      we would like to ask you all for suggestions, ideas, and what you will expect from a tool like this.
      A search tool. Given an arbitrary sequence, find all matches to that sequence with some allowance for error.

      Comment

      • biznatch
        Senior Member
        • Nov 2010
        • 124

        #4
        Originally posted by gringer View Post
        A search tool. Given an arbitrary sequence, find all matches to that sequence with some allowance for error.
        Like BLAST?

        Comment

        • mattanswers
          Member
          • Oct 2009
          • 65

          #5
          Arabidopsis

          Will it include Arabidopsis data ?

          Comment

          • gringer
            David Eccles (gringer)
            • May 2011
            • 845

            #6
            Originally posted by biznatch View Post
            Like BLAST?
            Yes, a bit like BLAST, but it will need a substantially altered algorithm to work with the massive amounts of sequence data that would be in the database described by rkneue. BLAST currently works fairly well on many gigabases of sequence data. I don't expect it will have the same success on terabases or petabases of sequence data.
            Last edited by gringer; 01-08-2014, 04:12 PM.

            Comment

            • usad
              Member
              • Sep 2009
              • 53

              #7
              Hi
              so if I understand it correctly a) a gene annotation pipeline and b) a compedium of well evaluated data?

              a) I would be careful beause annotation is not annotation (you might want to go deeply into Evidence code ontology ECO) and there are pipelines/tools like this that also partially take ECO codes into account or simple GO codes. (If this is what you meant) We do something not to dissimilar ourselves for plants (Mercator). And there is the whole field of phylogenomics.
              I'd rater settle with useful information than the information overloade that you are now exposed with like expressed in 50 tissues or rather showing a signal on some chips in these, effictively being a non-information (plant researchers will likely know what I mean). Also the whole similar to a protein shown to be similar to..... is not really helpful at all times and can be misleading. (Coming from the plant side, neuronal and angiongenesis proteins are always ---- interesting and a good example)

              b) not in the chip-seq field (yet???) but genevestigator collects expression data and is quite nice. BUT not open source.

              Cheers
              björn
              PS Hope this helped and was not completely off topic

              Comment

              • rkneue
                Junior Member
                • Jan 2014
                • 5

                #8
                Hi all, and thank you for your replies.
                gringer: What you'd like to do may be performed easily with BLAT algorithm. BLAT is much more faster than BLAST, and allows mismatches and spliced mapping.

                mattanswers: Once the core is properly written, adding new organisms will not be a problem, so ideally, my answer is yes.

                usad: a) Not exactly what I meant. We are not trying to realize a gene annotation pipeline, but a comprehensive annotation of "already annotated" genes on different database. For example, a gene X may be annotated as NR_000001 in RefSeq with a single isoform, ENSG00000000001 in Ensembl with multiple isoforms, not annotated in VEGA, annotated in lnciclopedia as XXXXX, etc.
                Providing an automatic updatable cross-referencing database of genes annotations may be really useful, since in most cases finding the correspondence between different databases is a really annoying task.
                b) Yeah, genevestigator may be an idea... But yes it's commercial.

                Comment

                • gringer
                  David Eccles (gringer)
                  • May 2011
                  • 845

                  #9
                  gringer: What you'd like to do may be performed easily with BLAT algorithm. BLAT is much more faster than BLAST, and allows mismatches and spliced mapping.
                  Not really. BLAT still has the indexing problem at its core: everything that is in the database needs to be indexed (at least for subsequences) at a compression level of around 4X (e.g. 2bit encoding). The speed of the actual search is irrelevant if the database cannot be indexed for the search to be carried out.

                  Comment

                  • rkneue
                    Junior Member
                    • Jan 2014
                    • 5

                    #10
                    I don't really understand in which cases you cannot index a database. What kind of sequence search are you interested in?

                    Comment

                    • usad
                      Member
                      • Sep 2009
                      • 53

                      #11
                      Ah ok I see, yeah different names for the same thing is a major bummer. But instead of having a data warehouse concept, maybe you could relalize the same thing by using some AJAXian data collector doing this on the fly when the user queries the data?
                      Many years ago Biomoby allowed such aggregating services. Of course the problem with this approach is that of the weakest link.

                      b

                      Comment

                      • gringer
                        David Eccles (gringer)
                        • May 2011
                        • 845

                        #12
                        Originally posted by gringer View Post
                        BLAST currently works fairly well on many gigabases of sequence data. I don't expect it will have the same success on terabases or petabases of sequence data.
                        Originally posted by rkneue View Post
                        I don't really understand in which cases you cannot index a database. What kind of sequence search are you interested in?
                        A sequence database for the "genomic/epigenomic data from the billions of informations actually available on internet". You will need to index a few petabases of sequence data for that to happen, and I don't expect that either BLAST or BLAT will work well for that.

                        Comment

                        • rkneue
                          Junior Member
                          • Jan 2014
                          • 5

                          #13
                          We don't expect to work with sequences in that case. Working with genomic coordinates is the best choice, since you can extract sequences on the fly from an indexed multi-fasta in a few ms.

                          Comment

                          • kredens
                            Junior Member
                            • Apr 2014
                            • 1

                            #14
                            what about compression?

                            I mean, randon access compressed information...

                            Comment

                            Latest Articles

                            Collapse

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by SEQadmin2, 06-09-2026, 11:58 AM
                            0 responses
                            19 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-05-2026, 10:09 AM
                            0 responses
                            27 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-04-2026, 08:59 AM
                            0 responses
                            38 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 12:03 PM
                            0 responses
                            61 views
                            0 reactions
                            Last Post SEQadmin2  
                            Working...