Seqanswers Leaderboard Ad

**laura** · 02-18-2011, 02:35 PM

The trouble with any none centralised solution is consistency. Databases like Ensembl and UCSC have been around for a long time as have the sequence archives (and long may they continue) how many labs have the resources to put up their data pretty much forever I would suspect the answer to that question is very few

**Michael.James.Clark** · 02-18-2011, 03:00 PM

Originally posted by Nix View Post

The problem with both of these approaches is the lack of a formal language to associate annotation with the data. Species, genome version, file format are often anyone's guess. Every group encodes this information differently. Thus making use of such data requires manual download and interpretation.

Don't believe me? Try intersecting chIP-seq peaks calls from 20 different datasets hosted by UCSC, Ensembl, and GMOD. That's at minimum 1 day of work by a skilled bioinformatician. Now do it again and again with each new release. Analysis applications that speak DAS can do this in minutes with no human error.

DAS provides the "glue language" (in XML) for brokering queries and returning slices of data based on coordinates or feature searches. It is defined by a community vetted specification with several server and client implementations. Among other things these define how you represent each species, their genome, and their coordinate system.

Another key advantage of DAS is that allows one to separate data distribution from data use. I love the UCSC data tracks but can't always use their browser to analyze them.

Honestly, if I have to choose between setting up and maintaining my own server and uploading my data to a centralized depository, there really is no choice at all.

I think we'd be lucky if labs would even provide you with more than a list of variants if they were forced to host data themselves. That's the whole reason for having a centralized host. Most people these days know how to deal with FTP or HTTP, so that's what we'd end up with

SRA was a good idea, but with clunky implementation. A group, government or academic, ought to pick up the mantle and host the world's genomic data. Hey, it could be you. Host it all on a DAS/2 server.

**Joann** · 02-18-2011, 03:04 PM

central solution

This is where I wish use of our institutional libraries would come to mind. They have and maintain the long term academic infrastructure and well understand inter-institutional standards (also publishing standards). Why can't institutional biological repositories hosting for open access scientific research purposes include next gen sequencing data? If a set of deposit and access standards were worked out and agreed upon (as is now being discussed in this thread) (and that could be the beneficiary of a collaborative grant proposal for start-up funding) library consortia could be forged and linked-up to accomplish specialized sequence data deposit/access, large and small. This is truly the kind of non-profit academic research purpose that is for the good and advancement of society and science. It enhances and builds upon using our existing, traditional academic resources.

**Nix** · 02-18-2011, 03:37 PM

Originally posted by Joann View Post

This is where I wish use of our institutional libraries would come to mind....

Funny you should mention this. I've shared many a beer with librarians over a campfire and they are very keen on doing just this sort of thing. They are way into the proper annotation and curation of data too.

For those "centralists" in the group... In theory one can pull data off of one DAS server and host it on your own. Thus a centralized DAS server could be built that is continuously updating it's repository from other DAS servers. This kind of defeats the purpose of a Distributed Annotation System though.

Large groups (Institutions, Universities, Journals, NIH (if they can fund it)) would be the best final repositories for genomic data. The SRA was swamped almost from the start. I think the only way to keep up with the deluge is to distribute the data.

**laura** · 02-19-2011, 07:46 AM

The NCBI is part of United States National Library of Medicine

**Richard Finney** · 02-19-2011, 10:07 AM

Can DAS or another off-the-shelf system address the concern for security? The US government sponsored research has some weighty Patient Privacy restrictions. I'm not sure that dishing up Bams via FTP by just "setting up something as simple as installing apache" is going to work. I'd like it to, but I'm thinking someone's going to say "no go". I hope there's an easing of the "must lock down data. only high priests than even think about looking at the data" mentality. But ... we're not there yet.

**laura** · 02-19-2011, 02:56 PM

The ENA will continue to accept open access data and the EGA will continue to accept human data with consent agreements and Data Access committees, lets hope these changes don't stop people releasing data into the public domain, certainly all the 1000 genomes data will remain freely available for everyone

**Joann** · 02-20-2011, 04:25 AM

For international database collaborations (INSCD)

Quoting directly from the ENA web page:

"The European Nucleotide Archive (ENA) accepts data generated by next-generation sequencing methodologies such as 454, Illumina Genome Analyzer and ABI SOLiD into the Sequence Read Archive (SRA). ENA works in close collaboration with the NCBI and DDBJ as part of the International Nucleotide Sequence Database Collaboration (INSDC). All submitted public data is exchanged between the partners on a daily basis. All three partners use the same data and metadata formats.

For all questions and enquiries please contact [email protected]."

**jkbonfield** · 02-21-2011, 01:47 AM

Originally posted by Nix View Post

Why aren't people hosting and publishing their own data? There's no need to centralize this activity.

You are responsible for providing plasmids and cell lines etc. used in your papers why not the genomic data too in both its raw and processed form.

I was at a meeting at NCBI a few years back, before SRA got off the ground, to discuss how it should all work and explain things to the main sequencing centres. I dared to ask why people didn't want a federated service instead, or at least a central store for meta-data with redirections to the labs own store of data.

The question was met with pretty much universal dismay and disagreement. I later realised why - it costs money, time and effort to host data. NCBI were promising to do this for everyone, essentially solving all their problems. Why would you agree to taking the hard route of storing it yourself (and harder still agreeing with all the other centres to do it in a uniform manner) when NCBI will take the data off your hands for free and do all the hard work for you?

**Nix** · 02-21-2011, 08:39 AM

Originally posted by Richard Finney View Post

Can DAS or another off-the-shelf system address the concern for security?

DAS servers are web apps and as such can take advantage of the same security protocols worked out for banks and hospitals (ssl, https, digest authentication, vpn). With our GenoPub server, visibility of each dataset is set to either the owner, the lab, particular defined collaborators of the lab, the institute, or the public.

As far as getting NCBI to take care of the problem. Good luck. The SRA was swamped from day one. I doubt SRA 2.0 can do any better without a significant increase in resources, which currently are slated to go to the Dept of Defense (4.7% budget increase for 2012!).

If the ENA is willing to host all of the US data great but, if I'm not mistaken, they still don't provide a programatic way of accessing analysis (bam files, variant calls, enrichment tracks, etc.). Neither did the SRA for that matter.

I believe our scientific community can do better.

**lh3** · 02-21-2011, 08:58 AM

To me, the easiest way to access data is a hierarchical FTP/HTTP directory containing all the fastq/sra/bam files with a top-level TAB delimited file briefly describing the batch, species if applicable, type of data (metagenomics, RNA-seq, ChIP-seq, targeted, exome, whole-genome sequencing), sample name, number of sequences, average read length, barcode length etc (possibly also submission date, etc) of each file, something similar to what the 1000g project is providing but probably a little more comprehensive. A single XML would also be fine, though I am happier with a TAB delimited file.

**jkbonfield** · 02-21-2011, 09:56 AM

People seem to be arguing different points here, although possibly it indicates the problem SRA tried to solve wasn't the primary issue the community faces. I'm not sure.

Anyway SRA was designed to store the primary data. That was originally trace files, but later just the raw calls and confidence values. The purpose was to allow any analysis to be rerun on the input data so we can reproduce results or "upgrade" results by using a newer set of analysis tools.

More recently the discussions here seem to be centred around storage and retrieval of analysis results: aligned BAM files, SNP VCF files, etc. Heng has been involved in a variety of formats here to tackle such things (BAM, tab delimited "tabix" indexed files, etc). Samtools also has existing code to on-the-fly download portions of bam files via http or ftp for any specific region, so it works neatly with existing web protocols. I'm unsure of security and SSL concerns, but either way it's a solid start.

Obviously the two scenarios aren't quite the same though, but with careful consideration perhaps they can be merged. For example if the aligned bam files contain all qualities (there have been discussions about only storing qualities for sites that differ to the reference), mark duplicates via flags rather than removal, have the original qualities instead of recalibrated ones, and store all unmapped reads, then and only then we can extract the primary data back from the aligned bam.

Is it worth it? Perhaps not. Sorted BAM is an ok format for storing primary data (as it's relatively compact, although not the best, and well understood), but some groups will want to do so much processing of their bams they'll need a second copy anyway.

**lh3** · 02-21-2011, 10:39 AM

My preference is to store raw reads (without trimming, alignment-based recalibration or duplicate removal) in the BAM format, in the order in which the reads come off sequencing machines. Reads may be optionally mapped. I do see the advantages of keeping sorted alignment, but even if no information is lost, unsorted reads are more convenient if we want to redo alignment. It would be good to keep unsorted and sorted data, but this leads to duplicates and may deviate from the intention of SRA.

Anyway, I agree with James that the most important goal of SRA is to store the primary data. Alignment without the loss of primary data, though preferred, comes only in the second place. I do not think SNPs and other annotations should go into SRA. These are the objective of a 3rd-party database, not SRA.

**Joann** · 03-29-2011, 12:02 PM

More Discussion

Editorial at Genome Biology

"Closure of the NCBI SRA and implications for the long-term future of genomics data storage".

doi:10.1186/gb-2011-12-3-402

**nickloman** · 05-11-2011, 07:01 AM

Short Read Archive reprieve!

Home - SRA - NCBI

http://www.ncbi.nlm.nih.gov/sra/

Sequence Read Archive (SRA) is still in service.
Recently, NCBI announced that due to budget constraints, it would be discontinuing its Sequence Read Archive (SRA) and Trace Archive repositories for high-throughput sequence data. However, NIH has since committed interim funding for SRA in its current form until October 1, 2011. In addition, NCBI has been working with staff from other NIH Institutes and NIH grantees to develop an approach to continue archiving a widely used subset of next generation sequencing data after October 1, 2011.

We now plan to continue handling sequencing data associated with:

RNA-Seq, ChIP-Seq, and epigenomic data that are submitted to GEO
Genomic and Transcriptomic assemblies that are submitted to GenBank
Genomic assemblies to GenBank/WGS
16S ribosomal RNA data associated with metagenomics that are submitted to GenBank
In addition, NCBI will continue to provide access to existing SRA and Trace Archive data for the foreseeable future. NCBI is also continuing to discuss with NIH Institutes approaches for handling other next-generation sequencing data associated with specific large-scale studies.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News