Short Read Archive Canned

vaughn replied

02-18-2011, 06:53 AM
Does anyone here have a good estimate on the storage footprint and bandwidth numbers for the SRA?
Leave a comment:
jkbonfield replied

02-18-2011, 01:26 AM
There is still room for shrinking BAM size too, even with the existing compression used. It's pretty trivial to reduce them by 30% or more (sometimes up to 50%, depending on the data) without any complex tricks or slow algorithms - even without using references. However it's not the huge step change we need for the future, rather than just a technique to make use of whenever BAM 2.0 comes along.

The quality budget concept listed in the referenced paper makes a lot of sense. We know quality values are large and difficult to compress as they contain a lot of variability. However downgrading their precision (do we need 40 discrete values? how about 20, or 5?) and also restricting qualities to only interesting areas (bases that differ and known SNP locations) seems like a logical step towards losing the data we least care about.

Last edited by jkbonfield; 02-18-2011, 01:31 AM.
Leave a comment:
Fabien Campagne replied

02-17-2011, 06:34 PM
Regarding Heng's 4), yes 80-90% smaller than BAM is when quality scores are stored only for these bases that differ between the read and the reference, not for any flanking read sequence. This is quite useful for counting applications (e.g., RNASeq and CHIPSeq). However, for variant discovery there are open questions. I am not aware of papers that compared performance when considering all base qualities vs only quality scores for those bases that differ from the reference. Since base quality score usually strongly correlated with the base position in the read, it is not clear how much signal is left in the base quality score if you account for the positional effect. It would be interesting to evaluate this carefully, since so much storage can be saved when leaving out quality scores for bases that align.
Leave a comment:
lh3 replied

02-17-2011, 04:03 PM
A few comments.

1) SRF predates SAM/BAM

2) The SRF compression algorithm is definitely more advanced and specialized than the BAM compression. SRF is usually larger simply because it stores far more information.

3) BAM is almost certainly larger than compressed fastq in size.

4) In Ewan Birney's paper, they also paid a lot of efforts to compress base quality, which is the hardest part. Their algorithm is lossy when considering quality, but we can hardly compress the data down to 1/2 with a lossless algorithm. I guess being "80-90%" smaller than BAM refers to compression without full base quality.

Last edited by lh3; 02-17-2011, 04:07 PM.
Leave a comment:
Fabien Campagne replied

02-17-2011, 05:10 AM
Thanks laura for a reference that should help ties this discussion together. We have used the reference in Goby only to compress alignments (Michael.James.Clark points are well received, this is not what SRA was storing), while this paper illustrates how the reference information can be used to also compress reads (what fpepin was referring to I think).
Leave a comment:
laura replied

02-17-2011, 01:40 AM
With respect to efficient storage of raw data people may be interested in

Efficient storage of high throughput sequencing data using reference-based compression

http://genome.cshlp.org/content/early/2011/01/18/gr.114819.110.abstract

An international, peer-reviewed genome sciences journal featuring outstanding original research that offers novel insights into the biology of all organisms
Leave a comment:
fpepin replied

02-17-2011, 12:22 AM
Originally posted by Michael.James.Clark View Post

SRA's purpose was to hold all the raw data--the sequence off the machine plus strand and base qualities (and some optional meta information).

[...]

But simply saving start and end positions plus variations from the reference genome as fpepin suggested is not adequate to completely reconstruct the raw reads, I'm afraid.

I think we are talking about the same thing. My point is that you could have significant savings in storage requirement by basing it off a reference when available.

The quality score are indeed a part that would not be affected by such a scheme. I mentioned the issues above but I should have made it more clear. There are probably some clever ways to get some of the way there, but I kind of doubt we'd be able to have near the same level of compression as with the reads.

It could very well be that it's not worth the effort to design such a scheme at this point, as you'd have to added computational and administrative costs that might not be worth the saved space.
Leave a comment:
Michael.James.Clark replied

02-16-2011, 10:25 PM
Originally posted by Fabien Campagne View Post

fpepin is perfectly correct about the compression approach. We have been using this approach in the Goby alignment format for about two years (see http://campagnelab.org/software/goby...ing-with-goby/). We store only how the reads differ from the reference, and the read sequence can be reconstructed quickly when the reference sequence is given. This is how we reconstruct read sequences to display in IGV (you need the development version of IGV to view Goby alignments at this stage).

We do not store unmapped reads in default configuration, but the format makes adding new fields very easy, so we could easily add optional fields for unmapped reads and corresponding quality scores. Without the unmapped reads, we typically obtain files 80-90% smaller than BAM files.

I think there is a disconnect here where some of us are talking about post-aligned data and some of us are talking about raw data off the machine (which is what SRA stored--basically compressed FASTQ files). SRA's purpose was to hold all the raw data--the sequence off the machine plus strand and base qualities (and some optional meta information).

We want to keep that information for posterity for many reasons, not the least of which includes the ability to bring old data "up to date" and make it comparable to new data.

Now, can you save space by not storing the bases that do not differ from the reference? Absolutely. I'd wager one could probably reduce the size of the data by about 50%, which is fantastic.

But simply saving start and end positions plus variations from the reference genome as fpepin suggested is not adequate to completely reconstruct the raw reads, I'm afraid.
Leave a comment:
Fabien Campagne replied

02-16-2011, 07:38 PM
fpepin is perfectly correct about the compression approach. We have been using this approach in the Goby alignment format for about two years (see http://campagnelab.org/software/goby...ing-with-goby/). We store only how the reads differ from the reference, and the read sequence can be reconstructed quickly when the reference sequence is given. This is how we reconstruct read sequences to display in IGV (you need the development version of IGV to view Goby alignments at this stage).

We do not store unmapped reads in default configuration, but the format makes adding new fields very easy, so we could easily add optional fields for unmapped reads and corresponding quality scores. Without the unmapped reads, we typically obtain files 80-90% smaller than BAM files.

Last edited by Fabien Campagne; 02-16-2011, 07:57 PM.
Leave a comment:
fpepin replied

02-16-2011, 05:41 PM
Originally posted by Michael.James.Clark View Post

Storing only variants is downstream of alignment and variant calling. It is inherently not raw data. The service SRA provided was basically storing raw data.

I'm talking of using it as a compression feature. Imagine a toy example where 90% of the reads map exactly to the reference (no SNPs/indels) and the rest doesn't map at all. A position and a length is easier to store than a 100bp read. Then keep the other 10% raw and you've just had a 10-fold compression or so. Since many projects share the reference genome, that only has to be stored once. So the 90% compression is wildly optimistic once you consider SNPs, quality scores, etc. Still, you should be able to get some pretty good rates.

Last edited by fpepin; 02-16-2011, 07:51 PM. Reason: typo
Leave a comment:
Michael.James.Clark replied

02-16-2011, 05:07 PM
EMBL-EBI will continue to support SRA for raw data storage: http://www.ebi.ac.uk/ena/SRA_announcement_Feb_2011.pdf

Originally posted by fpepin View Post

But you can do that losslessly as well if you have the same reference genome: store start position, end position, differences with reference.

The still leaves the non-aligned reads and the quality scores, so it's not a magical solution but it's still a big step forward.

There has got to be some decent efforts going in that direction or is there something trivial that I'm missing?

Storing only variants is downstream of alignment and variant calling. It is inherently not raw data. The service SRA provided was basically storing raw data.

Said processed data is not comparable with new analyses that use different alignment algorithms and variant callers.

That right there is why the set of variants is not adequate. It cannot be adequately re-analyzed or re-assessed.

Once we're at a point where the community is satisfied that the set of detected variants is never going to improve in sensitivity and specificity, then we can store only variants. Until then, however, we ought to be storing raw data. Once we are at that point, older data should be brought up to those standards as well, really.
Leave a comment:
fpepin replied

02-16-2011, 04:46 PM
Originally posted by NGSfan View Post

Saving variants is a good idea - but not now when the methodology for variant detection is so volatile.

But you can do that losslessly as well if you have the same reference genome: store start position, end position, differences with reference.

The still leaves the non-aligned reads and the quality scores, so it's not a magical solution but it's still a big step forward.

There has got to be some decent efforts going in that direction or is there something trivial that I'm missing?
Leave a comment:
Michael.James.Clark replied

02-16-2011, 04:00 PM
Well, at least it's not a rumor anymore. Something will have to come along to fill that void, though.

Doesn't SRF predate BAM? At least the final version of the SAM standard. I seem to recall discussions on the BAM format still going on well after the SRA's establishment.

BAM is certainly a fairly obvious option. I'm not sure how much space it saves compared to compressed FASTQs (if any?).

Originally posted by mwatson View Post

Isn't the proposal to store variants to store them in such a way that the original read can be reconstructed?

How do we do that without storing the original read off the machine?
Leave a comment:
Joann replied

02-16-2011, 03:04 PM
plan B?

According to the official announcement:

Over the next several months, NCBI will be working with staff from NIH Institutes that fund large-scale sequencing efforts to develop an approach for future access to and storage of the existing data.

So if the the sequencing was performed for an NIH funded project at a large scale facility, open access to and storage of data (including existing data) is going to be discussed as described above.

As such, the most pressing need for plan B at this point appears to encompass non-NIH funded sequencing where journal article publication of summary results would still involve a public data repository obligation.
Leave a comment:
Richard Finney replied

02-16-2011, 02:14 PM
Plan B ?

What's Plan B?
Leave a comment:

Previous 1 2 3 4 5 6 template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News