Unconfigured Ad

**rkneue** · 01-08-2014, 01:13 AM

Maybe, the best question would be, will a tool like this be useful for genome researchers? Are there any other tools or frameworks you will need? We want to realize something really useful to the scientific community, so... You all are the scientific community, so let's go with your suggestions and ideas!

**gringer** · 01-08-2014, 12:15 PM

we would like to ask you all for suggestions, ideas, and what you will expect from a tool like this.

A search tool. Given an arbitrary sequence, find all matches to that sequence with some allowance for error.

**biznatch** · 01-08-2014, 01:57 PM

Originally posted by gringer View Post

A search tool. Given an arbitrary sequence, find all matches to that sequence with some allowance for error.

Like BLAST?

**mattanswers** · 01-08-2014, 03:56 PM

Arabidopsis

Will it include Arabidopsis data ?

**gringer** · 01-08-2014, 04:10 PM

Originally posted by biznatch View Post

Like BLAST?

Yes, a bit like BLAST, but it will need a substantially altered algorithm to work with the massive amounts of sequence data that would be in the database described by rkneue. BLAST currently works fairly well on many gigabases of sequence data. I don't expect it will have the same success on terabases or petabases of sequence data.

**usad** · 01-08-2014, 08:01 PM

Hi
so if I understand it correctly a) a gene annotation pipeline and b) a compedium of well evaluated data?

a) I would be careful beause annotation is not annotation (you might want to go deeply into Evidence code ontology ECO) and there are pipelines/tools like this that also partially take ECO codes into account or simple GO codes. (If this is what you meant) We do something not to dissimilar ourselves for plants (Mercator). And there is the whole field of phylogenomics.
I'd rater settle with useful information than the information overloade that you are now exposed with like expressed in 50 tissues or rather showing a signal on some chips in these, effictively being a non-information (plant researchers will likely know what I mean). Also the whole similar to a protein shown to be similar to..... is not really helpful at all times and can be misleading. (Coming from the plant side, neuronal and angiongenesis proteins are always ---- interesting and a good example)

b) not in the chip-seq field (yet???) but genevestigator collects expression data and is quite nice. BUT not open source.

Cheers
björn
PS Hope this helped and was not completely off topic

**rkneue** · 01-09-2014, 02:27 AM

Hi all, and thank you for your replies.
gringer: What you'd like to do may be performed easily with BLAT algorithm. BLAT is much more faster than BLAST, and allows mismatches and spliced mapping.

mattanswers: Once the core is properly written, adding new organisms will not be a problem, so ideally, my answer is yes.

usad: a) Not exactly what I meant. We are not trying to realize a gene annotation pipeline, but a comprehensive annotation of "already annotated" genes on different database. For example, a gene X may be annotated as NR_000001 in RefSeq with a single isoform, ENSG00000000001 in Ensembl with multiple isoforms, not annotated in VEGA, annotated in lnciclopedia as XXXXX, etc.
Providing an automatic updatable cross-referencing database of genes annotations may be really useful, since in most cases finding the correspondence between different databases is a really annoying task.
b) Yeah, genevestigator may be an idea... But yes it's commercial.

**gringer** · 01-09-2014, 04:07 AM

gringer: What you'd like to do may be performed easily with BLAT algorithm. BLAT is much more faster than BLAST, and allows mismatches and spliced mapping.

Not really. BLAT still has the indexing problem at its core: everything that is in the database needs to be indexed (at least for subsequences) at a compression level of around 4X (e.g. 2bit encoding). The speed of the actual search is irrelevant if the database cannot be indexed for the search to be carried out.

**rkneue** · 01-09-2014, 07:23 AM

I don't really understand in which cases you cannot index a database. What kind of sequence search are you interested in?

**usad** · 01-09-2014, 09:54 AM

Ah ok I see, yeah different names for the same thing is a major bummer. But instead of having a data warehouse concept, maybe you could relalize the same thing by using some AJAXian data collector doing this on the fly when the user queries the data?
Many years ago Biomoby allowed such aggregating services. Of course the problem with this approach is that of the weakest link.

b

**gringer** · 01-09-2014, 08:28 PM

Originally posted by gringer View Post

BLAST currently works fairly well on many gigabases of sequence data. I don't expect it will have the same success on terabases or petabases of sequence data.

Originally posted by rkneue View Post

I don't really understand in which cases you cannot index a database. What kind of sequence search are you interested in?

A sequence database for the "genomic/epigenomic data from the billions of informations actually available on internet". You will need to index a few petabases of sequence data for that to happen, and I don't expect that either BLAST or BLAT will work well for that.

**rkneue** · 01-10-2014, 07:57 AM

We don't expect to work with sequences in that case. Working with genomic coordinates is the best choice, since you can extract sequences on the fly from an indexed multi-fasta in a few ms.

**kredens** · 04-28-2014, 04:40 AM

what about compression?

I mean, randon access compressed information...

Topics	Statistics	Last Post
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 12 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 30 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM

Unconfigured Ad

The missing tool in bioinformatics

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News