The trouble with any none centralised solution is consistency. Databases like Ensembl and UCSC have been around for a long time as have the sequence archives (and long may they continue) how many labs have the resources to put up their data pretty much forever I would suspect the answer to that question is very few
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by Nix View PostThe problem with both of these approaches is the lack of a formal language to associate annotation with the data. Species, genome version, file format are often anyone's guess. Every group encodes this information differently. Thus making use of such data requires manual download and interpretation.
Don't believe me? Try intersecting chIP-seq peaks calls from 20 different datasets hosted by UCSC, Ensembl, and GMOD. That's at minimum 1 day of work by a skilled bioinformatician. Now do it again and again with each new release. Analysis applications that speak DAS can do this in minutes with no human error.
DAS provides the "glue language" (in XML) for brokering queries and returning slices of data based on coordinates or feature searches. It is defined by a community vetted specification with several server and client implementations. Among other things these define how you represent each species, their genome, and their coordinate system.
Another key advantage of DAS is that allows one to separate data distribution from data use. I love the UCSC data tracks but can't always use their browser to analyze them.
I think we'd be lucky if labs would even provide you with more than a list of variants if they were forced to host data themselves. That's the whole reason for having a centralized host. Most people these days know how to deal with FTP or HTTP, so that's what we'd end up with
SRA was a good idea, but with clunky implementation. A group, government or academic, ought to pick up the mantle and host the world's genomic data. Hey, it could be you. Host it all on a DAS/2 server.Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Comment
-
central solution
This is where I wish use of our institutional libraries would come to mind. They have and maintain the long term academic infrastructure and well understand inter-institutional standards (also publishing standards). Why can't institutional biological repositories hosting for open access scientific research purposes include next gen sequencing data? If a set of deposit and access standards were worked out and agreed upon (as is now being discussed in this thread) (and that could be the beneficiary of a collaborative grant proposal for start-up funding) library consortia could be forged and linked-up to accomplish specialized sequence data deposit/access, large and small. This is truly the kind of non-profit academic research purpose that is for the good and advancement of society and science. It enhances and builds upon using our existing, traditional academic resources.
Comment
-
Originally posted by Joann View PostThis is where I wish use of our institutional libraries would come to mind....
For those "centralists" in the group... In theory one can pull data off of one DAS server and host it on your own. Thus a centralized DAS server could be built that is continuously updating it's repository from other DAS servers. This kind of defeats the purpose of a Distributed Annotation System though.
Large groups (Institutions, Universities, Journals, NIH (if they can fund it)) would be the best final repositories for genomic data. The SRA was swamped almost from the start. I think the only way to keep up with the deluge is to distribute the data.
Comment
-
Can DAS or another off-the-shelf system address the concern for security? The US government sponsored research has some weighty Patient Privacy restrictions. I'm not sure that dishing up Bams via FTP by just "setting up something as simple as installing apache" is going to work. I'd like it to, but I'm thinking someone's going to say "no go". I hope there's an easing of the "must lock down data. only high priests than even think about looking at the data" mentality. But ... we're not there yet.Last edited by Richard Finney; 02-19-2011, 10:47 AM.
Comment
-
The ENA will continue to accept open access data and the EGA will continue to accept human data with consent agreements and Data Access committees, lets hope these changes don't stop people releasing data into the public domain, certainly all the 1000 genomes data will remain freely available for everyone
Comment
-
For international database collaborations (INSCD)
Quoting directly from the ENA web page:
"The European Nucleotide Archive (ENA) accepts data generated by next-generation sequencing methodologies such as 454, Illumina Genome Analyzer and ABI SOLiD into the Sequence Read Archive (SRA). ENA works in close collaboration with the NCBI and DDBJ as part of the International Nucleotide Sequence Database Collaboration (INSDC). All submitted public data is exchanged between the partners on a daily basis. All three partners use the same data and metadata formats.
For all questions and enquiries please contact [email protected]."
Comment
-
Originally posted by Nix View PostWhy aren't people hosting and publishing their own data? There's no need to centralize this activity.
You are responsible for providing plasmids and cell lines etc. used in your papers why not the genomic data too in both its raw and processed form.
The question was met with pretty much universal dismay and disagreement. I later realised why - it costs money, time and effort to host data. NCBI were promising to do this for everyone, essentially solving all their problems. Why would you agree to taking the hard route of storing it yourself (and harder still agreeing with all the other centres to do it in a uniform manner) when NCBI will take the data off your hands for free and do all the hard work for you?
Comment
-
Originally posted by Richard Finney View PostCan DAS or another off-the-shelf system address the concern for security?
As far as getting NCBI to take care of the problem. Good luck. The SRA was swamped from day one. I doubt SRA 2.0 can do any better without a significant increase in resources, which currently are slated to go to the Dept of Defense (4.7% budget increase for 2012!).
If the ENA is willing to host all of the US data great but, if I'm not mistaken, they still don't provide a programatic way of accessing analysis (bam files, variant calls, enrichment tracks, etc.). Neither did the SRA for that matter.
I believe our scientific community can do better.
Comment
-
To me, the easiest way to access data is a hierarchical FTP/HTTP directory containing all the fastq/sra/bam files with a top-level TAB delimited file briefly describing the batch, species if applicable, type of data (metagenomics, RNA-seq, ChIP-seq, targeted, exome, whole-genome sequencing), sample name, number of sequences, average read length, barcode length etc (possibly also submission date, etc) of each file, something similar to what the 1000g project is providing but probably a little more comprehensive. A single XML would also be fine, though I am happier with a TAB delimited file.
Comment
-
People seem to be arguing different points here, although possibly it indicates the problem SRA tried to solve wasn't the primary issue the community faces. I'm not sure.
Anyway SRA was designed to store the primary data. That was originally trace files, but later just the raw calls and confidence values. The purpose was to allow any analysis to be rerun on the input data so we can reproduce results or "upgrade" results by using a newer set of analysis tools.
More recently the discussions here seem to be centred around storage and retrieval of analysis results: aligned BAM files, SNP VCF files, etc. Heng has been involved in a variety of formats here to tackle such things (BAM, tab delimited "tabix" indexed files, etc). Samtools also has existing code to on-the-fly download portions of bam files via http or ftp for any specific region, so it works neatly with existing web protocols. I'm unsure of security and SSL concerns, but either way it's a solid start.
Obviously the two scenarios aren't quite the same though, but with careful consideration perhaps they can be merged. For example if the aligned bam files contain all qualities (there have been discussions about only storing qualities for sites that differ to the reference), mark duplicates via flags rather than removal, have the original qualities instead of recalibrated ones, and store all unmapped reads, then and only then we can extract the primary data back from the aligned bam.
Is it worth it? Perhaps not. Sorted BAM is an ok format for storing primary data (as it's relatively compact, although not the best, and well understood), but some groups will want to do so much processing of their bams they'll need a second copy anyway.
Comment
-
My preference is to store raw reads (without trimming, alignment-based recalibration or duplicate removal) in the BAM format, in the order in which the reads come off sequencing machines. Reads may be optionally mapped. I do see the advantages of keeping sorted alignment, but even if no information is lost, unsorted reads are more convenient if we want to redo alignment. It would be good to keep unsorted and sorted data, but this leads to duplicates and may deviate from the intention of SRA.
Anyway, I agree with James that the most important goal of SRA is to store the primary data. Alignment without the loss of primary data, though preferred, comes only in the second place. I do not think SNPs and other annotations should go into SRA. These are the objective of a 3rd-party database, not SRA.Last edited by lh3; 02-21-2011, 10:42 AM.
Comment
-
Short Read Archive reprieve!
Sequence Read Archive (SRA) is still in service.
Recently, NCBI announced that due to budget constraints, it would be discontinuing its Sequence Read Archive (SRA) and Trace Archive repositories for high-throughput sequence data. However, NIH has since committed interim funding for SRA in its current form until October 1, 2011. In addition, NCBI has been working with staff from other NIH Institutes and NIH grantees to develop an approach to continue archiving a widely used subset of next generation sequencing data after October 1, 2011.
We now plan to continue handling sequencing data associated with:
RNA-Seq, ChIP-Seq, and epigenomic data that are submitted to GEO
Genomic and Transcriptomic assemblies that are submitted to GenBank
Genomic assemblies to GenBank/WGS
16S ribosomal RNA data associated with metagenomics that are submitted to GenBank
In addition, NCBI will continue to provide access to existing SRA and Trace Archive data for the foreseeable future. NCBI is also continuing to discuss with NIH Institutes approaches for handling other next-generation sequencing data associated with specific large-scale studies.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 11:09 AM
|
0 responses
24 views
0 likes
|
Last Post
by seqadmin
Today, 11:09 AM
|
||
Started by seqadmin, Today, 06:13 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
Today, 06:13 AM
|
||
Started by seqadmin, 11-01-2024, 06:09 AM
|
0 responses
30 views
0 likes
|
Last Post
by seqadmin
11-01-2024, 06:09 AM
|
||
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, 10-30-2024, 05:31 AM
|
0 responses
21 views
0 likes
|
Last Post
by seqadmin
10-30-2024, 05:31 AM
|
Comment