Seqanswers Leaderboard Ad

**Michael.James.Clark** · 02-16-2011, 10:25 PM

Originally posted by Fabien Campagne View Post

fpepin is perfectly correct about the compression approach. We have been using this approach in the Goby alignment format for about two years (see http://campagnelab.org/software/goby...ing-with-goby/). We store only how the reads differ from the reference, and the read sequence can be reconstructed quickly when the reference sequence is given. This is how we reconstruct read sequences to display in IGV (you need the development version of IGV to view Goby alignments at this stage).

We do not store unmapped reads in default configuration, but the format makes adding new fields very easy, so we could easily add optional fields for unmapped reads and corresponding quality scores. Without the unmapped reads, we typically obtain files 80-90% smaller than BAM files.

I think there is a disconnect here where some of us are talking about post-aligned data and some of us are talking about raw data off the machine (which is what SRA stored--basically compressed FASTQ files). SRA's purpose was to hold all the raw data--the sequence off the machine plus strand and base qualities (and some optional meta information).

We want to keep that information for posterity for many reasons, not the least of which includes the ability to bring old data "up to date" and make it comparable to new data.

Now, can you save space by not storing the bases that do not differ from the reference? Absolutely. I'd wager one could probably reduce the size of the data by about 50%, which is fantastic.

But simply saving start and end positions plus variations from the reference genome as fpepin suggested is not adequate to completely reconstruct the raw reads, I'm afraid.

**fpepin** · 02-17-2011, 12:22 AM

Originally posted by Michael.James.Clark View Post

SRA's purpose was to hold all the raw data--the sequence off the machine plus strand and base qualities (and some optional meta information).

[...]

But simply saving start and end positions plus variations from the reference genome as fpepin suggested is not adequate to completely reconstruct the raw reads, I'm afraid.

I think we are talking about the same thing. My point is that you could have significant savings in storage requirement by basing it off a reference when available.

The quality score are indeed a part that would not be affected by such a scheme. I mentioned the issues above but I should have made it more clear. There are probably some clever ways to get some of the way there, but I kind of doubt we'd be able to have near the same level of compression as with the reads.

It could very well be that it's not worth the effort to design such a scheme at this point, as you'd have to added computational and administrative costs that might not be worth the saved space.

**laura** · 02-17-2011, 01:40 AM

With respect to efficient storage of raw data people may be interested in

Efficient storage of high throughput sequencing data using reference-based compression

http://genome.cshlp.org/content/early/2011/01/18/gr.114819.110.abstract

An international, peer-reviewed genome sciences journal featuring outstanding original research that offers novel insights into the biology of all organisms

**Fabien Campagne** · 02-17-2011, 05:10 AM

Thanks laura for a reference that should help ties this discussion together. We have used the reference in Goby only to compress alignments (Michael.James.Clark points are well received, this is not what SRA was storing), while this paper illustrates how the reference information can be used to also compress reads (what fpepin was referring to I think).

**lh3** · 02-17-2011, 04:03 PM

A few comments.

1) SRF predates SAM/BAM

2) The SRF compression algorithm is definitely more advanced and specialized than the BAM compression. SRF is usually larger simply because it stores far more information.

3) BAM is almost certainly larger than compressed fastq in size.

4) In Ewan Birney's paper, they also paid a lot of efforts to compress base quality, which is the hardest part. Their algorithm is lossy when considering quality, but we can hardly compress the data down to 1/2 with a lossless algorithm. I guess being "80-90%" smaller than BAM refers to compression without full base quality.

**Fabien Campagne** · 02-17-2011, 06:34 PM

Regarding Heng's 4), yes 80-90% smaller than BAM is when quality scores are stored only for these bases that differ between the read and the reference, not for any flanking read sequence. This is quite useful for counting applications (e.g., RNASeq and CHIPSeq). However, for variant discovery there are open questions. I am not aware of papers that compared performance when considering all base qualities vs only quality scores for those bases that differ from the reference. Since base quality score usually strongly correlated with the base position in the read, it is not clear how much signal is left in the base quality score if you account for the positional effect. It would be interesting to evaluate this carefully, since so much storage can be saved when leaving out quality scores for bases that align.

**jkbonfield** · 02-18-2011, 01:26 AM

There is still room for shrinking BAM size too, even with the existing compression used. It's pretty trivial to reduce them by 30% or more (sometimes up to 50%, depending on the data) without any complex tricks or slow algorithms - even without using references. However it's not the huge step change we need for the future, rather than just a technique to make use of whenever BAM 2.0 comes along.

The quality budget concept listed in the referenced paper makes a lot of sense. We know quality values are large and difficult to compress as they contain a lot of variability. However downgrading their precision (do we need 40 discrete values? how about 20, or 5?) and also restricting qualities to only interesting areas (bases that differ and known SNP locations) seems like a logical step towards losing the data we least care about.

**vaughn** · 02-18-2011, 06:53 AM

Does anyone here have a good estimate on the storage footprint and bandwidth numbers for the SRA?

**Nix** · 02-18-2011, 11:44 AM

DAS/2 to the rescue?!

Why aren't people hosting and publishing their own data? There's no need to centralize this activity.

You are responsible for providing plasmids and cell lines etc. used in your papers why not the genomic data too in both its raw and processed form.

It's quite easy and useful to do this provided everyone uses the same communication protocol to enable programatic access so one doesn't have to manually download and reprocess the data before using it.

DAS (Distributed Annotation System) is one such protocol designed to do exactly this, it's been in use for >10yrs, with hundreds of servers world wide. DAS/2 is a modification to the original DAS/1 protocol optimized for large scale genomic data distribution using any file format (bam, bar, gff, bed, etc).

Check out http://www.biodas.org and http://www.biodas.org/wiki/DAS/2 and feel free to play around with our DAS/2 server http://bioserver.hci.utah.edu/BioInf.../Software:DAS2 or install your own http://bioserver.hci.utah.edu/BioInf...GenoPubInstall .

We've written up some of these tools in a recent paper if folks want to take a look: http://www.biomedcentral.com/1471-2105/11/455

Why wait for the government to fix our problems when we can do it ourselves?

**csoong** · 02-18-2011, 11:55 AM

we could definitely DIY the process, the limitation being the bandwidth demand for any lab to do so. again this could be solved by using fedex-ing harddisks, but again who wants to take time in a lab to do so? and who has the expertise to setup such a thing. if not properly set up, backup would be an issue so would general performance of the server, since it will be soley used for fetching data all the time.

**fpepin** · 02-18-2011, 12:36 PM

Originally posted by Nix View Post

Why wait for the government to fix our problems when we can do it ourselves?

Not everyone has the resources or expertise to set up (and maintain!) a DAS server or another type of repository. This is going to be exacerbated as costs go down and every lab is going to be doing NGS.

Having a central repository makes it a lot easier to make sure that the data is consistent and has enough details to be useful.

**Nix** · 02-18-2011, 12:52 PM

If your group can run a web site it can run a DAS/2 server. It really is rather easy to set up, just mysql and tomcat. Then have the biologists use the point and click web interface to load the data.

It is probably overkill to have every lab run a DAS/2 server, although they can. It is best if your department/ institute/ organization maintains one alongside their web server.

Forcing folks to properly annotate their data is another issue. Best to have journals hold the stick and require that datasets for publication be MINSEQE compliant.

**fpepin** · 02-18-2011, 01:32 PM

Originally posted by Nix View Post

If your group can run a web site it can run a DAS/2 server. It really is rather easy to set up, just mysql and tomcat. Then have the biologists use the point and click web interface to load the data.

It is probably overkill to have every lab run a DAS/2 server, although they can. It is best if your department/ institute/ organization maintains one alongside their web server.

I've seen my share of wet labs that have students bring in their own personal laptops and not even a backup system, and the department wasn't much savvier either.

It's doable, but I know I'd be happier having a central managed repository than depending on every group/department to have a server running properly that contains the data reasonable format.

**lh3** · 02-18-2011, 01:42 PM

In general, hosting data with HTTP/FTP is much more convenient to most researchers. If we want to look at data in small regions, we can use IGV/UCSC to view remote bam/bigbed/bigwig files. IGV also supports tabix indexing and thus VCF files.

**Nix** · 02-18-2011, 02:26 PM

Originally posted by lh3 View Post

In general, hosting data with HTTP/FTP is much more convenient to most researchers. If we want to look at data in small regions, we can use IGV/UCSC to view remote bam/bigbed/bigwig files. IGV also supports tabix indexing and thus VCF files.

The problem with both of these approaches is the lack of a formal language to associate annotation with the data. Species, genome version, file format are often anyone's guess. Every group encodes this information differently. Thus making use of such data requires manual download and interpretation.

Don't believe me? Try intersecting chIP-seq peaks calls from 20 different datasets hosted by UCSC, Ensembl, and GMOD. That's at minimum 1 day of work by a skilled bioinformatician. Now do it again and again with each new release. Analysis applications that speak DAS can do this in minutes with no human error.

DAS provides the "glue language" (in XML) for brokering queries and returning slices of data based on coordinates or feature searches. It is defined by a community vetted specification with several server and client implementations. Among other things these define how you represent each species, their genome, and their coordinate system.

Another key advantage of DAS is that allows one to separate data distribution from data use. I love the UCSC data tracks but can't always use their browser to analyze them.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News