Seqanswers Leaderboard Ad

**westerman** · 07-17-2012, 07:15 AM

I suspect that most NGS data is not directly deposited in centralized databases any more. Thus it will be hard to get a number of bases sequenced worldwide. It is probably more feasible to get a cost per Gbase graph. Although that has problems of its own since "cost" is highly variable depending on the sequencing center and is dependent on the type of machine -- e.g., our Illumina hiSeq is cheaper to run than our newer miSeq ... but I doubt if anyone would say that the graph should go up and up due to the hiSeq and then down again because of the introduction of the miSeq (or other "benchtop" sequencers.)

Whatever you do please do not fall into the awful fallacy of comparing Moore's law (the number of transistors on a chip) to a cost per base comparison. I know that people do this all of the time but it is really akin comparing apples to oranges.

**rmdavies** · 07-17-2012, 08:19 AM

There is an effort to graph the size of the NCBI capillary archive at http://www1.cse.wustl.edu/~jarpith/blogx/index.php/?e=3. You could try the same technique to get at the raw data. Unfortunately the data set runs out in 2009, but things were getting much quieter by then anyway and it's dwarfed by the amount sent to the SRA.

A log scale graph for the SRA is available at http://www.ncbi.nlm.nih.gov/Traces/s...w=announcement. The data doesn't appear to be available in table form, but you could always try contacting them. The growth rate actually looks pretty constant since about 2009.

Information on sequencing costs is available at http://www.genome.gov/sequencingcosts/. It looks like things have calmed down a lot here since the very rapid drop around 2007-2008.

You might like to compare the cost of storing the sequence data with that of generating it. This ratio has been getting worse for quite a while, although it still appears to be favourable to store rather than resequence (for now...) Another good comparison is hard drive capacity. Given that the sequence doubling time is faster than the hard drive one, we need an exponentially growing number of hard drives to store all of this data.

**westerman** · 07-17-2012, 09:34 AM

Speaking of data storage costs there are some graphs from Lincoln Stein floating around showing the inflection point (and compare costs versus costs). A simplified version is at:

http://www.nature.com/news/2011/110726/full/475435a/box/1.html

And a more complex example at:

http://ivory.idyll.org/permanent/lstein-ngs-capacity.png

The 14-months halving cost for storage is known as "Kryder's law". Unfortunately the charts, like so many, end with data from 2009-2010.

Of course the chart really does not mean much without an context. It obviously shows that sequencing costs are out declining storage costs but if storage costs are cheap to begin with, well, then does it matter? For example, at Purdue I can get high-capacity (e.g., fast and well supported) storage for $0.15/GB/year. $150/TB/year. The unaligned reads -- all that is needed for further downstream analysis -- from our most recent miSeq run total 915MB while the unaligned reads from our most recent hiSeq run total 267GB. Given that the costs of sequencing are at least $1000 and $10,000 respectively I don't think that the $0.15/year or $40/year are a big factor.

Of course holding on to many sequencing runs for many years, like we as a sequencing center do, can add up to a large number but still, overall, it is chump change.

**Richard Finney** · 07-17-2012, 10:33 AM

This would solve a lot of problems ...

**Felisek** · 07-17-2012, 11:17 PM

Thanks everybody.

I don't want to talk about sequencing costs at this point, as I come back to this at the end of the talk and already have a graph for this. I want to show the exponential increase in data volume, and a sudden jump in this increase due to NGS revolution.

The SRA graphs is OK, but it doesn't go enough back in time and doesn't compare pre- and post-NGS data. The only graph spanning over many years is the one in the Wikipedia. However, this one doesn't show any sudden NGS increase, even after incorporating a 2012 data point (see note under the graph).

**HESmith** · 07-18-2012, 07:52 AM

Rick is right; most of the data are not deposited in a central repository, so the information is not available. Also, you may not see a dramatic inflection. Total output will be the number of bases per instrument times the number of instruments. The transition from Sanger to NGS machines was not instantaneous, and the genome centers (which produced the bulk of the data) had a large investment in the older instrumentation.

As a proxy, I would suggest plotting the increasing output of the Sanger vs. NGS instruments over time; that will illustrate the same point.

**Felisek** · 07-19-2012, 04:53 AM

Well, I decided to use the SRA growth plot from EBI:

**pmiguel** · 07-20-2012, 10:56 AM

Originally posted by rmdavies View Post

Information on sequencing costs is available at http://www.genome.gov/sequencingcosts/. It looks like things have calmed down a lot here since the very rapid drop around 2007-2008.

Yeah. I think that particular chart is misleading. If you look at the equivalent chart from the Broad, you don't see the discontinuity in the 2007-2008. Instead the slope you see in the 2008-2012 segment in the "cost per megabase" plot begins in 2005.

Why? I guess the source of the data was from the NHGRI core facility and they may never have had a big roll out of 454 sequencers. Hence their "next gen" experience jumps from Sanger sequencing to fairly modern Illumina sequencers, and so misses the intermediately priced data from 454s.

So, to a first approximation you have:

Generation bases/$ doubling time
1..........1.5 years
2..........0.5 years

Possibly we are in for a plateau in the price per base decrease now that Illumina has largely crushed all competitors. Or possibly not, if the Proton Torrent becomes a serious competitor.

--
Phillip

**ymc** · 07-21-2012, 07:40 AM

Originally posted by westerman View Post

Speaking of data storage costs there are some graphs from Lincoln Stein floating around showing the inflection point (and compare costs versus costs). A simplified version is at:

http://www.nature.com/news/2011/110726/full/475435a/box/1.html

And a more complex example at:

http://ivory.idyll.org/permanent/lstein-ngs-capacity.png

The 14-months halving cost for storage is known as "Kryder's law". Unfortunately the charts, like so many, end with data from 2009-2010.

Of course the chart really does not mean much without an context. It obviously shows that sequencing costs are out declining storage costs but if storage costs are cheap to begin with, well, then does it matter? For example, at Purdue I can get high-capacity (e.g., fast and well supported) storage for $0.15/GB/year. $150/TB/year. The unaligned reads -- all that is needed for further downstream analysis -- from our most recent miSeq run total 915MB while the unaligned reads from our most recent hiSeq run total 267GB. Given that the costs of sequencing are at least $1000 and $10,000 respectively I don't think that the $0.15/year or $40/year are a big factor.

Of course holding on to many sequencing runs for many years, like we as a sequencing center do, can add up to a large number but still, overall, it is chump change.

HDD storage cost might be dropping. However, its read/write speed stagnated at about 150MB/s.

SSD cost is dropping dramatically but it is still a long way to go.

Therefore, I don't think storage technology is up to the task of $1,000 genome yet. I suppose that's why 23andme chose to develop exome sequencing service first.

**pmiguel** · 07-21-2012, 10:43 AM

Originally posted by ymc View Post

HDD storage cost might be dropping. However, its read/write speed stagnated at about 150MB/s.

SSD cost is dropping dramatically but it is still a long way to go.

Therefore, I don't think storage technology is up to the task of $1,000 genome yet. I suppose that's why 23andme chose to develop exome sequencing service first.

What?! A full, uncompressed genome would be 6 GB. You can source a 1000 GB drive for less than $100. 500 GB SSD are now available for less than $400.

I don't think storage costs are an impediment to a $1000 genome.

--
Phillip

**ymc** · 07-21-2012, 07:02 PM

Originally posted by pmiguel View Post

What?! A full, uncompressed genome would be 6 GB. You can source a 1000 GB drive for less than $100. 500 GB SSD are now available for less than $400.

I don't think storage costs are an impediment to a $1000 genome.

--
Phillip

Hmm... A 50x human genome fastq is about 75GB in gzipped format. The bam file it generates is about 400GB. You might also want to backup your original fastq. So at least you need 550GB per sample.

Since there is an advantage to call genotypes with other samples, I don't think people will just store one de-novoly assembled genome based on current sequencing technology.

**pmiguel** · 07-22-2012, 09:33 AM

Originally posted by ymc View Post

Hmm... A 50x human genome fastq is about 75GB in gzipped format. The bam file it generates is about 400GB. You might also want to backup your original fastq. So at least you need 550GB per sample.

Since there is an advantage to call genotypes with other samples, I don't think people will just store one de-novoly assembled genome based on current sequencing technology.

I am sure more efficient methods of data storage for data of this sort could be easily acquired. But even if you don't want to bother, you are looking at $50 worth of storage. How does this serve as a stumbling block to the $1000 genome?

--
Phillip

**ymc** · 07-22-2012, 02:06 PM

Originally posted by pmiguel View Post

I am sure more efficient methods of data storage for data of this sort could be easily acquired. But even if you don't want to bother, you are looking at $50 worth of storage. How does this serve as a stumbling block to the $1000 genome?

--
Phillip

Well, if you are only dealing with only a few genomes, then you should be fine. But if you are 23andme or a big genome center, then it can be a problem.

Currently, you can install four HDDs on a typical PC. Suppose each of them is 3TB, then you can store at most 24 genomes per PC. But when you have thousands of genomes, then things can get complicated.

Obviously, this problem will get better as storage cost drops. I was just saying if we have $1000 today, then it can be a problem. But the genome is not $1000 now, by the time it is $1000, storage cost should become less of a problem.

**pmiguel** · 07-23-2012, 03:58 AM

Originally posted by ymc View Post

Well, if you are only dealing with only a few genomes, then you should be fine. But if you are 23andme or a big genome center, then it can be a problem.

Currently, you can install four HDDs on a typical PC. Suppose each of them is 3TB, then you can store at most 24 genomes per PC. But when you have thousands of genomes, then things can get complicated.

Obviously, this problem will get better as storage cost drops. I was just saying if we have $1000 today, then it can be a problem. But the genome is not $1000 now, by the time it is $1000, storage cost should become less of a problem.

I don't see a reason for 23andme to store all the read data for a genome sequence. It isn't like their typical user will fire up IGV and peruse the BAM file. [Note: I just noticed that their new "Golden Helix" browser is for that purpose. Pretty crazy...] Storing the data as a diff against some reference sequence would be plenty.

Anyway, the core of my argument is that storage costs are not likely to be what drove 23andme to do exon sequences, instead of full genomes. Sequence still costs orders of magnitudes more to obtain than to store. Even when total cost to obtain a human genome sequence drops to $1000, storage will still be a minor component of its total cost.

However, rather than churn through a few more cycles of this dispute, maybe we could look for the price at which we would both agree that storage becomes a substantial part of the cost of obtaining a genome sequence.

Around $100/genome, the cost of transiently storing the read data will begin to consume enough of a budget that it will be considered a major factor, I think. Of course other computation aspects will also be felt -- CPU time, band width, etc.

The thing is, below a certain cost per genome, it may no longer make sense to store anything other than the processed data. That is, DNA itself serves as the storage medium if additional information needs to be gleaned. Or, in other words, the information is already stored in the primary structure of the DNA. We are really just paying to "down convert" it from molecular to digital storage.

Note that as long as the price of obtaining sequence continues to fall at a rate below the rate at which cost to store sequence decreases, the problem of storing the digital form only gets harder. So that is why that even if I disagree about storage being a major factor in the cost of a $1000 genome, I agree that it does become a major factor at $100/genome.

Of course that leaves us 10-fold from agreeing with one another. But that is, nevertheless, a quantitative disagreement, not a qualitative one.

--
Phillip

Topics	Statistics	Last Post
Mechanical Forces in DNA Transcription Uncovered by Clemson Researchers by seqadmin Started by seqadmin, 10-02-2024, 04:51 AM	0 responses 13 views 0 likes	Last Post by seqadmin 10-02-2024, 04:51 AM
New Epigenetic Clock Links Cheek Cells to Mortality Risk by seqadmin Started by seqadmin, 10-01-2024, 07:10 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-01-2024, 07:10 AM
AI-Powered Blood Test Shows Promise for Early Ovarian Cancer Detection by seqadmin Started by seqadmin, 09-30-2024, 08:33 AM	0 responses 25 views 0 likes	Last Post by seqadmin 09-30-2024, 08:33 AM
Stem Cell Research Suggests Human Cells May Enter Developmental Pause by seqadmin Started by seqadmin, 09-26-2024, 12:57 PM	0 responses 18 views 0 likes	Last Post by seqadmin 09-26-2024, 12:57 PM

Seqanswers Leaderboard Ad

Announcement

NGS data growth information

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News