Seqanswers Leaderboard Ad

**GERALD** · 02-14-2011, 04:47 PM

Are we sure this is real? I hope that one or more private companies have the foresight to step up to the plate on this. The commercial potential would be enormous. They just have to be big enough to cover the enormous overhead of the data. The ad revenue alone would be incentive enough. Can we make a collective appeal to say... Google?

**flxlex** · 02-14-2011, 09:54 PM

Originally posted by nickloman View Post

Where will you submit your data now?

The European nucleotide archive?

ENA Browser

http://www.ebi.ac.uk/ena/about/page.php?page=sra_submissions

ENA Browser

**mwatson** · 02-16-2011, 01:00 AM

Hmmm, I wonder how this sits with the following article though?

President Obama Proposes Budget Increases for NIH, CDC, NSF, and FDA

For me this is very worrying as it represents a big change in the way in which biodata is managed. NCBI, EBI and DDBJ have *always* managed public, biological data. That's what they do and we love them for it. If the NCBI pull out of it now, even if it is just the SRA (just? just the largest collection of data from one of the most exciting technologies on the planet right now...), it's a worrying development.

**mwatson** · 02-16-2011, 01:39 AM

Originally posted by Michael.James.Clark View Post

I'm also not convinced about simply sharing variants. While it's true that it will save a lot of storage space, variants are not inherently comparable. Sequencing platform plays a role, but even more significant are the significant improvements in alignment and variant detection over the past few years. Realign and re-call variants on the Watson genome and I bet you'll end up with vastly different numbers from what were reported, for example. But if you just have the variants, you can't realign and recall, and therefore you can't really use that data for a true comparison.

Isn't the proposal to store variants to store them in such a way that the original read can be reconstructed?

**NGSfan** · 02-16-2011, 05:25 AM

I never liked the SRA. It was incredibly difficult to get data out of it - not to mention to know what datasets you are getting!

Another thing - why this bloated SRF format?

Why aren't we just uploading bam files?

They come already with the read quality scores, aligned and compressed. You can then load it up into a viewer and easily see what the authors saw in their results.

And if you like, you can extract the sequences (Bam to Fastq) and realign them yourself with your favorite aligner.

Saving variants is a good idea - but not now when the methodology for variant detection is so volatile.

**mwatson** · 02-16-2011, 05:36 AM

Originally posted by NGSfan View Post

Why aren't we just uploading bam files?

I think there is a problem with archiving derived data rather than raw.

Specifically, the SAM/BAM format can be variable between aligners and/or options; for instance, if hard clipping is enabled, you would not be able to get back the full fastq from BAM; nor would you if the sequences had been trimmed before aligning.

**jkbonfield** · 02-16-2011, 06:11 AM

Originally posted by NGSfan View Post

I never liked the SRA. It was incredibly difficult to get data out of it - not to mention to know what datasets you are getting!

Another thing - why this bloated SRF format?

Why aren't we just uploading bam files?

Firstly, 100% agreed of difficulty of extracting data!

SRF was a temporary thing really and I think even before this announcement NCBI were preferring BAM submissions.

Basically view SRF as analogous to AB1/SCF/ZTR was for capillary trace data. Without storing trace data people wouldn't have been able to obtain old data, recall with Phred (which was far better than the original ABI software) and reassemble. This did happen using the old public trace archives, albeit very rarely and probably not at all at the end.

So it was felt in the very early days of the "next gen" sequencing technologies that storing the trace data would allow for third party applications to be developed that improved on the own instrument software. This also did happen - eg swift and AYB - but once again very few people attempted to apply these newer tools to old published data sets.

Hence SRF's days were numbered and I'm pleased to see it retired. (I'd dispute the "bloated" bit - it's heavily compressed and sometimes even comparable in size to the more extreme bloated BAMs with recalibrated confidence values and secondary calls + confidences. It's just that it contains a LOT of data which we no longer deem as valuable.) Of course given that it was nigh on impossible to actually obtain the raw traces out of NCBI, only offerring easy access to fastq, it was rather pointless them ever offering to store traces in the first place.

I'd also be interested to know the access patterns of these data sets though. I suspect they have a severe drop off based on age. Eg recent data sets may get accessed a lot, but then after a year there's very little - maybe none at all. This indicates a staged data aging policy would work, perhaps ended with totally off-line storage for old dormant data sets. Attempting to keep everything online forever just isn't going to work when "everything" is an exponentially growing quantity. It was just a matter of time before people realised SRA couldn't be viable long term without some major rethinking of data aging policies.

**kmcarr** · 02-16-2011, 01:50 PM

Official news release from NCBI:

404 Error - NCBI

http://www.ncbi.nlm.nih.gov/About/news/16feb2011

They are also discontinuing the peptidome repository, not that we NGSers care about that.

**Richard Finney** · 02-16-2011, 02:14 PM

Plan B ?

What's Plan B?

**Joann** · 02-16-2011, 03:04 PM

plan B?

According to the official announcement:

Over the next several months, NCBI will be working with staff from NIH Institutes that fund large-scale sequencing efforts to develop an approach for future access to and storage of the existing data.

So if the the sequencing was performed for an NIH funded project at a large scale facility, open access to and storage of data (including existing data) is going to be discussed as described above.

As such, the most pressing need for plan B at this point appears to encompass non-NIH funded sequencing where journal article publication of summary results would still involve a public data repository obligation.

**Michael.James.Clark** · 02-16-2011, 04:00 PM

Well, at least it's not a rumor anymore. Something will have to come along to fill that void, though.

Doesn't SRF predate BAM? At least the final version of the SAM standard. I seem to recall discussions on the BAM format still going on well after the SRA's establishment.

BAM is certainly a fairly obvious option. I'm not sure how much space it saves compared to compressed FASTQs (if any?).

Originally posted by mwatson View Post

Isn't the proposal to store variants to store them in such a way that the original read can be reconstructed?

How do we do that without storing the original read off the machine?

**fpepin** · 02-16-2011, 04:46 PM

Originally posted by NGSfan View Post

Saving variants is a good idea - but not now when the methodology for variant detection is so volatile.

But you can do that losslessly as well if you have the same reference genome: store start position, end position, differences with reference.

The still leaves the non-aligned reads and the quality scores, so it's not a magical solution but it's still a big step forward.

There has got to be some decent efforts going in that direction or is there something trivial that I'm missing?

**Michael.James.Clark** · 02-16-2011, 05:07 PM

EMBL-EBI will continue to support SRA for raw data storage: http://www.ebi.ac.uk/ena/SRA_announcement_Feb_2011.pdf

Originally posted by fpepin View Post

But you can do that losslessly as well if you have the same reference genome: store start position, end position, differences with reference.

The still leaves the non-aligned reads and the quality scores, so it's not a magical solution but it's still a big step forward.

There has got to be some decent efforts going in that direction or is there something trivial that I'm missing?

Storing only variants is downstream of alignment and variant calling. It is inherently not raw data. The service SRA provided was basically storing raw data.

Said processed data is not comparable with new analyses that use different alignment algorithms and variant callers.

That right there is why the set of variants is not adequate. It cannot be adequately re-analyzed or re-assessed.

Once we're at a point where the community is satisfied that the set of detected variants is never going to improve in sensitivity and specificity, then we can store only variants. Until then, however, we ought to be storing raw data. Once we are at that point, older data should be brought up to those standards as well, really.

**fpepin** · 02-16-2011, 05:41 PM

Originally posted by Michael.James.Clark View Post

Storing only variants is downstream of alignment and variant calling. It is inherently not raw data. The service SRA provided was basically storing raw data.

I'm talking of using it as a compression feature. Imagine a toy example where 90% of the reads map exactly to the reference (no SNPs/indels) and the rest doesn't map at all. A position and a length is easier to store than a 100bp read. Then keep the other 10% raw and you've just had a 10-fold compression or so. Since many projects share the reference genome, that only has to be stored once. So the 90% compression is wildly optimistic once you consider SNPs, quality scores, etc. Still, you should be able to get some pretty good rates.

**Fabien Campagne** · 02-16-2011, 07:38 PM

fpepin is perfectly correct about the compression approach. We have been using this approach in the Goby alignment format for about two years (see http://campagnelab.org/software/goby...ing-with-goby/). We store only how the reads differ from the reference, and the read sequence can be reconstructed quickly when the reference sequence is given. This is how we reconstruct read sequences to display in IGV (you need the development version of IGV to view Goby alignments at this stage).

We do not store unmapped reads in default configuration, but the format makes adding new fields very easy, so we could easily add optional fields for unmapped reads and corresponding quality scores. Without the unmapped reads, we typically obtain files 80-90% smaller than BAM files.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News