Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • kmcarr
    replied
    Official news release from NCBI:



    They are also discontinuing the peptidome repository, not that we NGSers care about that.

    Leave a comment:


  • jkbonfield
    replied
    Originally posted by NGSfan View Post
    I never liked the SRA. It was incredibly difficult to get data out of it - not to mention to know what datasets you are getting!

    Another thing - why this bloated SRF format?

    Why aren't we just uploading bam files?
    Firstly, 100% agreed of difficulty of extracting data!

    SRF was a temporary thing really and I think even before this announcement NCBI were preferring BAM submissions.

    Basically view SRF as analogous to AB1/SCF/ZTR was for capillary trace data. Without storing trace data people wouldn't have been able to obtain old data, recall with Phred (which was far better than the original ABI software) and reassemble. This did happen using the old public trace archives, albeit very rarely and probably not at all at the end.

    So it was felt in the very early days of the "next gen" sequencing technologies that storing the trace data would allow for third party applications to be developed that improved on the own instrument software. This also did happen - eg swift and AYB - but once again very few people attempted to apply these newer tools to old published data sets.

    Hence SRF's days were numbered and I'm pleased to see it retired. (I'd dispute the "bloated" bit - it's heavily compressed and sometimes even comparable in size to the more extreme bloated BAMs with recalibrated confidence values and secondary calls + confidences. It's just that it contains a LOT of data which we no longer deem as valuable.) Of course given that it was nigh on impossible to actually obtain the raw traces out of NCBI, only offerring easy access to fastq, it was rather pointless them ever offering to store traces in the first place.

    I'd also be interested to know the access patterns of these data sets though. I suspect they have a severe drop off based on age. Eg recent data sets may get accessed a lot, but then after a year there's very little - maybe none at all. This indicates a staged data aging policy would work, perhaps ended with totally off-line storage for old dormant data sets. Attempting to keep everything online forever just isn't going to work when "everything" is an exponentially growing quantity. It was just a matter of time before people realised SRA couldn't be viable long term without some major rethinking of data aging policies.

    Leave a comment:


  • mwatson
    replied
    Originally posted by NGSfan View Post
    Why aren't we just uploading bam files?
    I think there is a problem with archiving derived data rather than raw.

    Specifically, the SAM/BAM format can be variable between aligners and/or options; for instance, if hard clipping is enabled, you would not be able to get back the full fastq from BAM; nor would you if the sequences had been trimmed before aligning.

    Leave a comment:


  • NGSfan
    replied
    I never liked the SRA. It was incredibly difficult to get data out of it - not to mention to know what datasets you are getting!

    Another thing - why this bloated SRF format?

    Why aren't we just uploading bam files?

    They come already with the read quality scores, aligned and compressed. You can then load it up into a viewer and easily see what the authors saw in their results.

    And if you like, you can extract the sequences (Bam to Fastq) and realign them yourself with your favorite aligner.

    Saving variants is a good idea - but not now when the methodology for variant detection is so volatile.

    Leave a comment:


  • mwatson
    replied
    Originally posted by Michael.James.Clark View Post
    I'm also not convinced about simply sharing variants. While it's true that it will save a lot of storage space, variants are not inherently comparable. Sequencing platform plays a role, but even more significant are the significant improvements in alignment and variant detection over the past few years. Realign and re-call variants on the Watson genome and I bet you'll end up with vastly different numbers from what were reported, for example. But if you just have the variants, you can't realign and recall, and therefore you can't really use that data for a true comparison.
    Isn't the proposal to store variants to store them in such a way that the original read can be reconstructed?

    Leave a comment:


  • mwatson
    replied
    Hmmm, I wonder how this sits with the following article though?

    President Obama Proposes Budget Increases for NIH, CDC, NSF, and FDA

    For me this is very worrying as it represents a big change in the way in which biodata is managed. NCBI, EBI and DDBJ have *always* managed public, biological data. That's what they do and we love them for it. If the NCBI pull out of it now, even if it is just the SRA (just? just the largest collection of data from one of the most exciting technologies on the planet right now...), it's a worrying development.

    Leave a comment:


  • flxlex
    replied
    Originally posted by nickloman View Post
    Where will you submit your data now?
    The European nucleotide archive?

    Leave a comment:


  • GERALD
    replied
    Are we sure this is real? I hope that one or more private companies have the foresight to step up to the plate on this. The commercial potential would be enormous. They just have to be big enough to cover the enormous overhead of the data. The ad revenue alone would be incentive enough. Can we make a collective appeal to say... Google?

    Leave a comment:


  • Michael.James.Clark
    replied
    It's kind of funny how the Science articles about data deluge basically precipitated this announcement. There's been a lot of blog-o-sphere buzz about data deluge and more than a couple of them mentioning off-hand SRA and its attempt to handle it.

    So far this is a rumor. It happens to be a very believable rumor given the funding issue and ever-increasing need for storage, but let's not say it's canned before we're sure.

    I think while the intent of SRA was good, the execution was not. Anyone who's dealt with it can tell you how much extra work getting data into their formats and uploading it was, not to mention the effort involved in retrieving data from it.

    It's also just not a very sustainable thing for the government to sponsor this way. Transferring giant data sets through the net is time and bandwidth consuming, not to mention the upkeep of an ever-expanding storage space.

    All that said, I don't like the whole "cloud" solution very much either. The major reason is the lack of control over privacy. At the very least, SRA did a good job protecting privacy (although their mechanism for doing so was quite clunky). Storing personal genetic data on a computer system owned by a third party simply does not sit well with me. It's kind of a funny idea to be "sharing" personal genetic data anyway, but at the very least, attempts to protect privacy need to be made and it's hard to envision how that's accomplished when the data itself is on a third party computer.

    Perhaps a Biotorrent type solution is the best way to share this type of data. Something that can be reasonably secure while not consuming massive bandwidth on both ends.

    I'm also not convinced about simply sharing variants. While it's true that it will save a lot of storage space, variants are not inherently comparable. Sequencing platform plays a role, but even more significant are the significant improvements in alignment and variant detection over the past few years. Realign and re-call variants on the Watson genome and I bet you'll end up with vastly different numbers from what were reported, for example. But if you just have the variants, you can't realign and recall, and therefore you can't really use that data for a true comparison.

    I proposed in a recent blog article that someone should try to create a project where all the world's public sequence data is kept continually updated to modern standards. Would it be expensive? You betcha. But it would also be a very powerful resource while also avoiding the whole shoe-horning problem that SRA ran into with its formatting issues.

    Leave a comment:


  • Richard Finney
    replied
    Big Bams and Bit Torrents

    Perhaps using a subset of the bit torrent protocol might be an answer. I guess there would have to be a "you must have served up half as much as you've downloaded" rule or something to prevent getting but not giving.

    Security's a beach.
    Last edited by Richard Finney; 02-14-2011, 12:26 PM.

    Leave a comment:


  • pmiguel
    replied
    Originally posted by nickloman View Post
    If you think about it rationally, there's no way you can have a centralised single resource for sequence data volumes which are doubling every year or so.
    Doubling would be okay. That is close enough to Moore's law that investments of the same amount of money per year in storage would suffice. The problem is that next gen sequencing is expanding at hyper-Moore's law rates. See:

    A decade after the human-genome project, writes Geoffrey Carr (interviewed here), biological science is poised on the edge of something wonderful


    (Figure 1)



    Around 2005-2006, you see an inflection point. Before that point, Moore's law roughly kept pace with sequence cost. But since then (at least at the Broad) the semi-log slope tips downward for sequencing. That means you need to exponentially increase your expenditures on sequence storage if you plan to spend the same amount on sequencing. Alternatively you can come up with specialized storage solutions, etc.

    But, ultimately one of two things happens:

    (1) Front-end computational cost de facto limits the drop in sequencing costs -- at which point sequencing costs lock at Moore's law rates.

    (2) "Sequencing" reaches fruition -- reading DNA sequences costs no more than storing them. Congratulations your new storage medium is DNA.

    --
    Phillip
    Last edited by pmiguel; 02-14-2011, 11:32 AM. Reason: typo

    Leave a comment:


  • GW_OK
    replied
    PacBio should fold it into their mega New Biology thingy.

    Leave a comment:


  • csoong
    replied
    Another possibility to consider would be to only share certain variation files. But that is dependent on what defines variants and how variants are characterized and is sort of confined to DNA topics. For expression level data, perhaps some standardized format could come along as well.

    Leave a comment:


  • ECO
    replied
    Right. If all the data is IN amazon...the worldwide bandwidth req's are much lower if you're using amazon's tools.

    Leave a comment:


  • nickloman
    replied
    OK, it's *possible*. But it's going to be very expensive.

    There's the networking costs / limits to think about as well as storage.

    Amazon might be a good choice to step in! A great way of attracting people to their cloud computing services.

    Of course there needs to be some degree of replication so we are not dependent on a single organisation.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 08:47 AM
0 responses
10 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
59 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
53 views
0 likes
Last Post seqadmin  
Working...
X