Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • NGS data growth information

    I'm preparing a talk on bioinformatics for a group of physicists. I want to show them that the main drive for bioinformatics development is enormous data growth in the recent decade or so. NGS would be a very good example of this. I managed to find a few plots on the web, but most of them are Excel graphs in linear scale, for example this:



    I think Sanger's cumulative yield would be a very good example. Can anyone provide me with (up-to-date) data for such a graph: number of bases sequenced as a function of time? It would be great to have both traditional sequencing and NGS, to show a quantitative change in data growth. I would make a nice logarithmic graph myself.

  • #2
    I suspect that most NGS data is not directly deposited in centralized databases any more. Thus it will be hard to get a number of bases sequenced worldwide. It is probably more feasible to get a cost per Gbase graph. Although that has problems of its own since "cost" is highly variable depending on the sequencing center and is dependent on the type of machine -- e.g., our Illumina hiSeq is cheaper to run than our newer miSeq ... but I doubt if anyone would say that the graph should go up and up due to the hiSeq and then down again because of the introduction of the miSeq (or other "benchtop" sequencers.)


    Whatever you do please do not fall into the awful fallacy of comparing Moore's law (the number of transistors on a chip) to a cost per base comparison. I know that people do this all of the time but it is really akin comparing apples to oranges.

    Comment


    • #3
      There is an effort to graph the size of the NCBI capillary archive at http://www1.cse.wustl.edu/~jarpith/blogx/index.php/?e=3. You could try the same technique to get at the raw data. Unfortunately the data set runs out in 2009, but things were getting much quieter by then anyway and it's dwarfed by the amount sent to the SRA.

      A log scale graph for the SRA is available at http://www.ncbi.nlm.nih.gov/Traces/s...w=announcement. The data doesn't appear to be available in table form, but you could always try contacting them. The growth rate actually looks pretty constant since about 2009.

      Information on sequencing costs is available at http://www.genome.gov/sequencingcosts/. It looks like things have calmed down a lot here since the very rapid drop around 2007-2008.

      You might like to compare the cost of storing the sequence data with that of generating it. This ratio has been getting worse for quite a while, although it still appears to be favourable to store rather than resequence (for now...) Another good comparison is hard drive capacity. Given that the sequence doubling time is faster than the hard drive one, we need an exponentially growing number of hard drives to store all of this data.

      Comment


      • #4
        Speaking of data storage costs there are some graphs from Lincoln Stein floating around showing the inflection point (and compare costs versus costs). A simplified version is at:



        And a more complex example at:



        The 14-months halving cost for storage is known as "Kryder's law". Unfortunately the charts, like so many, end with data from 2009-2010.

        Of course the chart really does not mean much without an context. It obviously shows that sequencing costs are out declining storage costs but if storage costs are cheap to begin with, well, then does it matter? For example, at Purdue I can get high-capacity (e.g., fast and well supported) storage for $0.15/GB/year. $150/TB/year. The unaligned reads -- all that is needed for further downstream analysis -- from our most recent miSeq run total 915MB while the unaligned reads from our most recent hiSeq run total 267GB. Given that the costs of sequencing are at least $1000 and $10,000 respectively I don't think that the $0.15/year or $40/year are a big factor.

        Of course holding on to many sequencing runs for many years, like we as a sequencing center do, can add up to a large number but still, overall, it is chump change.

        Comment


        • #5
          This would solve a lot of problems ...

          Last edited by Richard Finney; 07-17-2012, 10:55 AM. Reason: darn alpha channel

          Comment


          • #6
            Thanks everybody.

            I don't want to talk about sequencing costs at this point, as I come back to this at the end of the talk and already have a graph for this. I want to show the exponential increase in data volume, and a sudden jump in this increase due to NGS revolution.

            The SRA graphs is OK, but it doesn't go enough back in time and doesn't compare pre- and post-NGS data. The only graph spanning over many years is the one in the Wikipedia. However, this one doesn't show any sudden NGS increase, even after incorporating a 2012 data point (see note under the graph).

            Comment


            • #7
              Rick is right; most of the data are not deposited in a central repository, so the information is not available. Also, you may not see a dramatic inflection. Total output will be the number of bases per instrument times the number of instruments. The transition from Sanger to NGS machines was not instantaneous, and the genome centers (which produced the bulk of the data) had a large investment in the older instrumentation.

              As a proxy, I would suggest plotting the increasing output of the Sanger vs. NGS instruments over time; that will illustrate the same point.

              Comment


              • #8
                Well, I decided to use the SRA growth plot from EBI:

                Comment


                • #9
                  Originally posted by rmdavies View Post

                  Information on sequencing costs is available at http://www.genome.gov/sequencingcosts/. It looks like things have calmed down a lot here since the very rapid drop around 2007-2008.
                  Yeah. I think that particular chart is misleading. If you look at the equivalent chart from the Broad, you don't see the discontinuity in the 2007-2008. Instead the slope you see in the 2008-2012 segment in the "cost per megabase" plot begins in 2005.

                  Why? I guess the source of the data was from the NHGRI core facility and they may never have had a big roll out of 454 sequencers. Hence their "next gen" experience jumps from Sanger sequencing to fairly modern Illumina sequencers, and so misses the intermediately priced data from 454s.

                  So, to a first approximation you have:

                  Generation bases/$ doubling time
                  1..........1.5 years
                  2..........0.5 years


                  Possibly we are in for a plateau in the price per base decrease now that Illumina has largely crushed all competitors. Or possibly not, if the Proton Torrent becomes a serious competitor.

                  --
                  Phillip

                  Comment


                  • #10
                    Originally posted by westerman View Post
                    Speaking of data storage costs there are some graphs from Lincoln Stein floating around showing the inflection point (and compare costs versus costs). A simplified version is at:



                    And a more complex example at:



                    The 14-months halving cost for storage is known as "Kryder's law". Unfortunately the charts, like so many, end with data from 2009-2010.

                    Of course the chart really does not mean much without an context. It obviously shows that sequencing costs are out declining storage costs but if storage costs are cheap to begin with, well, then does it matter? For example, at Purdue I can get high-capacity (e.g., fast and well supported) storage for $0.15/GB/year. $150/TB/year. The unaligned reads -- all that is needed for further downstream analysis -- from our most recent miSeq run total 915MB while the unaligned reads from our most recent hiSeq run total 267GB. Given that the costs of sequencing are at least $1000 and $10,000 respectively I don't think that the $0.15/year or $40/year are a big factor.

                    Of course holding on to many sequencing runs for many years, like we as a sequencing center do, can add up to a large number but still, overall, it is chump change.
                    HDD storage cost might be dropping. However, its read/write speed stagnated at about 150MB/s.

                    SSD cost is dropping dramatically but it is still a long way to go.

                    Therefore, I don't think storage technology is up to the task of $1,000 genome yet. I suppose that's why 23andme chose to develop exome sequencing service first.

                    Comment


                    • #11
                      Originally posted by ymc View Post
                      HDD storage cost might be dropping. However, its read/write speed stagnated at about 150MB/s.

                      SSD cost is dropping dramatically but it is still a long way to go.

                      Therefore, I don't think storage technology is up to the task of $1,000 genome yet. I suppose that's why 23andme chose to develop exome sequencing service first.
                      What?! A full, uncompressed genome would be 6 GB. You can source a 1000 GB drive for less than $100. 500 GB SSD are now available for less than $400.

                      I don't think storage costs are an impediment to a $1000 genome.

                      --
                      Phillip

                      Comment


                      • #12
                        Originally posted by pmiguel View Post
                        What?! A full, uncompressed genome would be 6 GB. You can source a 1000 GB drive for less than $100. 500 GB SSD are now available for less than $400.

                        I don't think storage costs are an impediment to a $1000 genome.

                        --
                        Phillip
                        Hmm... A 50x human genome fastq is about 75GB in gzipped format. The bam file it generates is about 400GB. You might also want to backup your original fastq. So at least you need 550GB per sample.

                        Since there is an advantage to call genotypes with other samples, I don't think people will just store one de-novoly assembled genome based on current sequencing technology.

                        Comment


                        • #13
                          Originally posted by ymc View Post
                          Hmm... A 50x human genome fastq is about 75GB in gzipped format. The bam file it generates is about 400GB. You might also want to backup your original fastq. So at least you need 550GB per sample.

                          Since there is an advantage to call genotypes with other samples, I don't think people will just store one de-novoly assembled genome based on current sequencing technology.
                          I am sure more efficient methods of data storage for data of this sort could be easily acquired. But even if you don't want to bother, you are looking at $50 worth of storage. How does this serve as a stumbling block to the $1000 genome?

                          --
                          Phillip

                          Comment


                          • #14
                            Originally posted by pmiguel View Post
                            I am sure more efficient methods of data storage for data of this sort could be easily acquired. But even if you don't want to bother, you are looking at $50 worth of storage. How does this serve as a stumbling block to the $1000 genome?

                            --
                            Phillip
                            Well, if you are only dealing with only a few genomes, then you should be fine. But if you are 23andme or a big genome center, then it can be a problem.

                            Currently, you can install four HDDs on a typical PC. Suppose each of them is 3TB, then you can store at most 24 genomes per PC. But when you have thousands of genomes, then things can get complicated.

                            Obviously, this problem will get better as storage cost drops. I was just saying if we have $1000 today, then it can be a problem. But the genome is not $1000 now, by the time it is $1000, storage cost should become less of a problem.

                            Comment


                            • #15
                              Originally posted by ymc View Post
                              Well, if you are only dealing with only a few genomes, then you should be fine. But if you are 23andme or a big genome center, then it can be a problem.

                              Currently, you can install four HDDs on a typical PC. Suppose each of them is 3TB, then you can store at most 24 genomes per PC. But when you have thousands of genomes, then things can get complicated.

                              Obviously, this problem will get better as storage cost drops. I was just saying if we have $1000 today, then it can be a problem. But the genome is not $1000 now, by the time it is $1000, storage cost should become less of a problem.
                              I don't see a reason for 23andme to store all the read data for a genome sequence. It isn't like their typical user will fire up IGV and peruse the BAM file. [Note: I just noticed that their new "Golden Helix" browser is for that purpose. Pretty crazy...] Storing the data as a diff against some reference sequence would be plenty.

                              Anyway, the core of my argument is that storage costs are not likely to be what drove 23andme to do exon sequences, instead of full genomes. Sequence still costs orders of magnitudes more to obtain than to store. Even when total cost to obtain a human genome sequence drops to $1000, storage will still be a minor component of its total cost.

                              However, rather than churn through a few more cycles of this dispute, maybe we could look for the price at which we would both agree that storage becomes a substantial part of the cost of obtaining a genome sequence.

                              Around $100/genome, the cost of transiently storing the read data will begin to consume enough of a budget that it will be considered a major factor, I think. Of course other computation aspects will also be felt -- CPU time, band width, etc.

                              The thing is, below a certain cost per genome, it may no longer make sense to store anything other than the processed data. That is, DNA itself serves as the storage medium if additional information needs to be gleaned. Or, in other words, the information is already stored in the primary structure of the DNA. We are really just paying to "down convert" it from molecular to digital storage.

                              Note that as long as the price of obtaining sequence continues to fall at a rate below the rate at which cost to store sequence decreases, the problem of storing the digital form only gets harder. So that is why that even if I disagree about storage being a major factor in the cost of a $1000 genome, I agree that it does become a major factor at $100/genome.

                              Of course that leaves us 10-fold from agreeing with one another. But that is, nevertheless, a quantitative disagreement, not a qualitative one.

                              --
                              Phillip
                              Last edited by pmiguel; 07-23-2012, 04:52 AM. Reason: Added note.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 11:49 AM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-24-2024, 08:47 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              61 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X