Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    The trouble with any none centralised solution is consistency. Databases like Ensembl and UCSC have been around for a long time as have the sequence archives (and long may they continue) how many labs have the resources to put up their data pretty much forever I would suspect the answer to that question is very few


    • #47
      Originally posted by Nix View Post
      The problem with both of these approaches is the lack of a formal language to associate annotation with the data. Species, genome version, file format are often anyone's guess. Every group encodes this information differently. Thus making use of such data requires manual download and interpretation.

      Don't believe me? Try intersecting chIP-seq peaks calls from 20 different datasets hosted by UCSC, Ensembl, and GMOD. That's at minimum 1 day of work by a skilled bioinformatician. Now do it again and again with each new release. Analysis applications that speak DAS can do this in minutes with no human error.

      DAS provides the "glue language" (in XML) for brokering queries and returning slices of data based on coordinates or feature searches. It is defined by a community vetted specification with several server and client implementations. Among other things these define how you represent each species, their genome, and their coordinate system.

      Another key advantage of DAS is that allows one to separate data distribution from data use. I love the UCSC data tracks but can't always use their browser to analyze them.
      Honestly, if I have to choose between setting up and maintaining my own server and uploading my data to a centralized depository, there really is no choice at all.

      I think we'd be lucky if labs would even provide you with more than a list of variants if they were forced to host data themselves. That's the whole reason for having a centralized host. Most people these days know how to deal with FTP or HTTP, so that's what we'd end up with

      SRA was a good idea, but with clunky implementation. A group, government or academic, ought to pick up the mantle and host the world's genomic data. Hey, it could be you. Host it all on a DAS/2 server.
      Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
      Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
      Projects: U87MG whole genome sequence [Website] [Paper]


      • #48
        central solution

        This is where I wish use of our institutional libraries would come to mind. They have and maintain the long term academic infrastructure and well understand inter-institutional standards (also publishing standards). Why can't institutional biological repositories hosting for open access scientific research purposes include next gen sequencing data? If a set of deposit and access standards were worked out and agreed upon (as is now being discussed in this thread) (and that could be the beneficiary of a collaborative grant proposal for start-up funding) library consortia could be forged and linked-up to accomplish specialized sequence data deposit/access, large and small. This is truly the kind of non-profit academic research purpose that is for the good and advancement of society and science. It enhances and builds upon using our existing, traditional academic resources.


        • #49
          Originally posted by Joann View Post
          This is where I wish use of our institutional libraries would come to mind....
          Funny you should mention this. I've shared many a beer with librarians over a campfire and they are very keen on doing just this sort of thing. They are way into the proper annotation and curation of data too.

          For those "centralists" in the group... In theory one can pull data off of one DAS server and host it on your own. Thus a centralized DAS server could be built that is continuously updating it's repository from other DAS servers. This kind of defeats the purpose of a Distributed Annotation System though.

          Large groups (Institutions, Universities, Journals, NIH (if they can fund it)) would be the best final repositories for genomic data. The SRA was swamped almost from the start. I think the only way to keep up with the deluge is to distribute the data.


          • #50
            The NCBI is part of United States National Library of Medicine


            • #51
              Can DAS or another off-the-shelf system address the concern for security? The US government sponsored research has some weighty Patient Privacy restrictions. I'm not sure that dishing up Bams via FTP by just "setting up something as simple as installing apache" is going to work. I'd like it to, but I'm thinking someone's going to say "no go". I hope there's an easing of the "must lock down data. only high priests than even think about looking at the data" mentality. But ... we're not there yet.
              Last edited by Richard Finney; 02-19-2011, 10:47 AM.


              • #52
                The ENA will continue to accept open access data and the EGA will continue to accept human data with consent agreements and Data Access committees, lets hope these changes don't stop people releasing data into the public domain, certainly all the 1000 genomes data will remain freely available for everyone


                • #53
                  For international database collaborations (INSCD)

                  Quoting directly from the ENA web page:

                  "The European Nucleotide Archive (ENA) accepts data generated by next-generation sequencing methodologies such as 454, Illumina Genome Analyzer and ABI SOLiD into the Sequence Read Archive (SRA). ENA works in close collaboration with the NCBI and DDBJ as part of the International Nucleotide Sequence Database Collaboration (INSDC). All submitted public data is exchanged between the partners on a daily basis. All three partners use the same data and metadata formats.

                  For all questions and enquiries please contact [email protected]."


                  • #54
                    Originally posted by Nix View Post
                    Why aren't people hosting and publishing their own data? There's no need to centralize this activity.

                    You are responsible for providing plasmids and cell lines etc. used in your papers why not the genomic data too in both its raw and processed form.
                    I was at a meeting at NCBI a few years back, before SRA got off the ground, to discuss how it should all work and explain things to the main sequencing centres. I dared to ask why people didn't want a federated service instead, or at least a central store for meta-data with redirections to the labs own store of data.

                    The question was met with pretty much universal dismay and disagreement. I later realised why - it costs money, time and effort to host data. NCBI were promising to do this for everyone, essentially solving all their problems. Why would you agree to taking the hard route of storing it yourself (and harder still agreeing with all the other centres to do it in a uniform manner) when NCBI will take the data off your hands for free and do all the hard work for you?


                    • #55
                      Originally posted by Richard Finney View Post
                      Can DAS or another off-the-shelf system address the concern for security?
                      DAS servers are web apps and as such can take advantage of the same security protocols worked out for banks and hospitals (ssl, https, digest authentication, vpn). With our GenoPub server, visibility of each dataset is set to either the owner, the lab, particular defined collaborators of the lab, the institute, or the public.

                      As far as getting NCBI to take care of the problem. Good luck. The SRA was swamped from day one. I doubt SRA 2.0 can do any better without a significant increase in resources, which currently are slated to go to the Dept of Defense (4.7% budget increase for 2012!).

                      If the ENA is willing to host all of the US data great but, if I'm not mistaken, they still don't provide a programatic way of accessing analysis (bam files, variant calls, enrichment tracks, etc.). Neither did the SRA for that matter.

                      I believe our scientific community can do better.


                      • #56
                        To me, the easiest way to access data is a hierarchical FTP/HTTP directory containing all the fastq/sra/bam files with a top-level TAB delimited file briefly describing the batch, species if applicable, type of data (metagenomics, RNA-seq, ChIP-seq, targeted, exome, whole-genome sequencing), sample name, number of sequences, average read length, barcode length etc (possibly also submission date, etc) of each file, something similar to what the 1000g project is providing but probably a little more comprehensive. A single XML would also be fine, though I am happier with a TAB delimited file.


                        • #57
                          People seem to be arguing different points here, although possibly it indicates the problem SRA tried to solve wasn't the primary issue the community faces. I'm not sure.

                          Anyway SRA was designed to store the primary data. That was originally trace files, but later just the raw calls and confidence values. The purpose was to allow any analysis to be rerun on the input data so we can reproduce results or "upgrade" results by using a newer set of analysis tools.

                          More recently the discussions here seem to be centred around storage and retrieval of analysis results: aligned BAM files, SNP VCF files, etc. Heng has been involved in a variety of formats here to tackle such things (BAM, tab delimited "tabix" indexed files, etc). Samtools also has existing code to on-the-fly download portions of bam files via http or ftp for any specific region, so it works neatly with existing web protocols. I'm unsure of security and SSL concerns, but either way it's a solid start.

                          Obviously the two scenarios aren't quite the same though, but with careful consideration perhaps they can be merged. For example if the aligned bam files contain all qualities (there have been discussions about only storing qualities for sites that differ to the reference), mark duplicates via flags rather than removal, have the original qualities instead of recalibrated ones, and store all unmapped reads, then and only then we can extract the primary data back from the aligned bam.

                          Is it worth it? Perhaps not. Sorted BAM is an ok format for storing primary data (as it's relatively compact, although not the best, and well understood), but some groups will want to do so much processing of their bams they'll need a second copy anyway.


                          • #58
                            My preference is to store raw reads (without trimming, alignment-based recalibration or duplicate removal) in the BAM format, in the order in which the reads come off sequencing machines. Reads may be optionally mapped. I do see the advantages of keeping sorted alignment, but even if no information is lost, unsorted reads are more convenient if we want to redo alignment. It would be good to keep unsorted and sorted data, but this leads to duplicates and may deviate from the intention of SRA.

                            Anyway, I agree with James that the most important goal of SRA is to store the primary data. Alignment without the loss of primary data, though preferred, comes only in the second place. I do not think SNPs and other annotations should go into SRA. These are the objective of a 3rd-party database, not SRA.
                            Last edited by lh3; 02-21-2011, 10:42 AM.


                            • #59
                              More Discussion

                              Editorial at Genome Biology

                              "Closure of the NCBI SRA and implications for the long-term future of genomics data storage".



                              • #60
                                Short Read Archive reprieve!

                                Sequence Read Archive (SRA) is still in service.
                                Recently, NCBI announced that due to budget constraints, it would be discontinuing its Sequence Read Archive (SRA) and Trace Archive repositories for high-throughput sequence data. However, NIH has since committed interim funding for SRA in its current form until October 1, 2011. In addition, NCBI has been working with staff from other NIH Institutes and NIH grantees to develop an approach to continue archiving a widely used subset of next generation sequencing data after October 1, 2011.

                                We now plan to continue handling sequencing data associated with:

                                RNA-Seq, ChIP-Seq, and epigenomic data that are submitted to GEO
                                Genomic and Transcriptomic assemblies that are submitted to GenBank
                                Genomic assemblies to GenBank/WGS
                                16S ribosomal RNA data associated with metagenomics that are submitted to GenBank
                                In addition, NCBI will continue to provide access to existing SRA and Trace Archive data for the foreseeable future. NCBI is also continuing to discuss with NIH Institutes approaches for handling other next-generation sequencing data associated with specific large-scale studies.


                                Latest Articles


                                • seqadmin
                                  Latest Developments in Precision Medicine
                                  by seqadmin

                                  Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                                  Somatic Genomics
                                  “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                                  05-24-2024, 01:16 PM
                                • seqadmin
                                  Recent Advances in Sequencing Analysis Tools
                                  by seqadmin

                                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                                  05-06-2024, 07:48 AM





                                Topics Statistics Last Post
                                Started by seqadmin, 05-24-2024, 07:15 AM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 05-23-2024, 10:28 AM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 05-23-2024, 07:35 AM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 05-22-2024, 02:06 PM
                                0 responses
                                Last Post seqadmin