Are we sure this is real? I hope that one or more private companies have the foresight to step up to the plate on this. The commercial potential would be enormous. They just have to be big enough to cover the enormous overhead of the data. The ad revenue alone would be incentive enough. Can we make a collective appeal to say... Google?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by nickloman View PostWhere will you submit your data now?
Comment
-
Hmmm, I wonder how this sits with the following article though?
President Obama Proposes Budget Increases for NIH, CDC, NSF, and FDA
For me this is very worrying as it represents a big change in the way in which biodata is managed. NCBI, EBI and DDBJ have *always* managed public, biological data. That's what they do and we love them for it. If the NCBI pull out of it now, even if it is just the SRA (just? just the largest collection of data from one of the most exciting technologies on the planet right now...), it's a worrying development.
Comment
-
Originally posted by Michael.James.Clark View PostI'm also not convinced about simply sharing variants. While it's true that it will save a lot of storage space, variants are not inherently comparable. Sequencing platform plays a role, but even more significant are the significant improvements in alignment and variant detection over the past few years. Realign and re-call variants on the Watson genome and I bet you'll end up with vastly different numbers from what were reported, for example. But if you just have the variants, you can't realign and recall, and therefore you can't really use that data for a true comparison.
Comment
-
I never liked the SRA. It was incredibly difficult to get data out of it - not to mention to know what datasets you are getting!
Another thing - why this bloated SRF format?
Why aren't we just uploading bam files?
They come already with the read quality scores, aligned and compressed. You can then load it up into a viewer and easily see what the authors saw in their results.
And if you like, you can extract the sequences (Bam to Fastq) and realign them yourself with your favorite aligner.
Saving variants is a good idea - but not now when the methodology for variant detection is so volatile.
Comment
-
Originally posted by NGSfan View PostWhy aren't we just uploading bam files?
Specifically, the SAM/BAM format can be variable between aligners and/or options; for instance, if hard clipping is enabled, you would not be able to get back the full fastq from BAM; nor would you if the sequences had been trimmed before aligning.
Comment
-
Originally posted by NGSfan View PostI never liked the SRA. It was incredibly difficult to get data out of it - not to mention to know what datasets you are getting!
Another thing - why this bloated SRF format?
Why aren't we just uploading bam files?
SRF was a temporary thing really and I think even before this announcement NCBI were preferring BAM submissions.
Basically view SRF as analogous to AB1/SCF/ZTR was for capillary trace data. Without storing trace data people wouldn't have been able to obtain old data, recall with Phred (which was far better than the original ABI software) and reassemble. This did happen using the old public trace archives, albeit very rarely and probably not at all at the end.
So it was felt in the very early days of the "next gen" sequencing technologies that storing the trace data would allow for third party applications to be developed that improved on the own instrument software. This also did happen - eg swift and AYB - but once again very few people attempted to apply these newer tools to old published data sets.
Hence SRF's days were numbered and I'm pleased to see it retired. (I'd dispute the "bloated" bit - it's heavily compressed and sometimes even comparable in size to the more extreme bloated BAMs with recalibrated confidence values and secondary calls + confidences. It's just that it contains a LOT of data which we no longer deem as valuable.) Of course given that it was nigh on impossible to actually obtain the raw traces out of NCBI, only offerring easy access to fastq, it was rather pointless them ever offering to store traces in the first place.
I'd also be interested to know the access patterns of these data sets though. I suspect they have a severe drop off based on age. Eg recent data sets may get accessed a lot, but then after a year there's very little - maybe none at all. This indicates a staged data aging policy would work, perhaps ended with totally off-line storage for old dormant data sets. Attempting to keep everything online forever just isn't going to work when "everything" is an exponentially growing quantity. It was just a matter of time before people realised SRA couldn't be viable long term without some major rethinking of data aging policies.
Comment
-
Official news release from NCBI:
They are also discontinuing the peptidome repository, not that we NGSers care about that.
Comment
-
plan B?
According to the official announcement:
Over the next several months, NCBI will be working with staff from NIH Institutes that fund large-scale sequencing efforts to develop an approach for future access to and storage of the existing data.
So if the the sequencing was performed for an NIH funded project at a large scale facility, open access to and storage of data (including existing data) is going to be discussed as described above.
As such, the most pressing need for plan B at this point appears to encompass non-NIH funded sequencing where journal article publication of summary results would still involve a public data repository obligation.
Comment
-
Well, at least it's not a rumor anymore. Something will have to come along to fill that void, though.
Doesn't SRF predate BAM? At least the final version of the SAM standard. I seem to recall discussions on the BAM format still going on well after the SRA's establishment.
BAM is certainly a fairly obvious option. I'm not sure how much space it saves compared to compressed FASTQs (if any?).
Originally posted by mwatson View PostIsn't the proposal to store variants to store them in such a way that the original read can be reconstructed?Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Comment
-
Originally posted by NGSfan View PostSaving variants is a good idea - but not now when the methodology for variant detection is so volatile.
The still leaves the non-aligned reads and the quality scores, so it's not a magical solution but it's still a big step forward.
There has got to be some decent efforts going in that direction or is there something trivial that I'm missing?
Comment
-
EMBL-EBI will continue to support SRA for raw data storage: http://www.ebi.ac.uk/ena/SRA_announcement_Feb_2011.pdf
Originally posted by fpepin View PostBut you can do that losslessly as well if you have the same reference genome: store start position, end position, differences with reference.
The still leaves the non-aligned reads and the quality scores, so it's not a magical solution but it's still a big step forward.
There has got to be some decent efforts going in that direction or is there something trivial that I'm missing?
Said processed data is not comparable with new analyses that use different alignment algorithms and variant callers.
That right there is why the set of variants is not adequate. It cannot be adequately re-analyzed or re-assessed.
Once we're at a point where the community is satisfied that the set of detected variants is never going to improve in sensitivity and specificity, then we can store only variants. Until then, however, we ought to be storing raw data. Once we are at that point, older data should be brought up to those standards as well, really.Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Comment
-
Originally posted by Michael.James.Clark View PostStoring only variants is downstream of alignment and variant calling. It is inherently not raw data. The service SRA provided was basically storing raw data.
Comment
-
fpepin is perfectly correct about the compression approach. We have been using this approach in the Goby alignment format for about two years (see http://campagnelab.org/software/goby...ing-with-goby/). We store only how the reads differ from the reference, and the read sequence can be reconstructed quickly when the reference sequence is given. This is how we reconstruct read sequences to display in IGV (you need the development version of IGV to view Goby alignments at this stage).
We do not store unmapped reads in default configuration, but the format makes adding new fields very easy, so we could easily add optional fields for unmapped reads and corresponding quality scores. Without the unmapped reads, we typically obtain files 80-90% smaller than BAM files.Last edited by Fabien Campagne; 02-16-2011, 07:57 PM.
Comment
Latest Articles
Collapse
-
by seqadmin
Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...-
Channel: Articles
09-23-2024, 06:35 AM -
-
by seqadmin
During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.
Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...-
Channel: Articles
09-09-2024, 10:59 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 10-02-2024, 04:51 AM
|
0 responses
13 views
0 likes
|
Last Post
by seqadmin
10-02-2024, 04:51 AM
|
||
Started by seqadmin, 10-01-2024, 07:10 AM
|
0 responses
22 views
0 likes
|
Last Post
by seqadmin
10-01-2024, 07:10 AM
|
||
Started by seqadmin, 09-30-2024, 08:33 AM
|
0 responses
26 views
0 likes
|
Last Post
by seqadmin
09-30-2024, 08:33 AM
|
||
Started by seqadmin, 09-26-2024, 12:57 PM
|
0 responses
19 views
0 likes
|
Last Post
by seqadmin
09-26-2024, 12:57 PM
|
Comment