Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    To me SRA made absolutely no sense whatsoever.

    Why would anyone want to store *raw* reads in a centralized database? It's not like there are thousands of queries every day requesting a certain dataset, and thousands of scientists re-aligning & analyzing whatever someone uploaded to SRA. I may be missing the point of SRA, but to me it sounds just ridiculous.

    The only people ever aligning and having a look at raw sequencig data are the one who *publish* on that dataset. How many publications do you get from one HT dataset / experiment usually? One.

    If you do some high-throughput experiment and publish, my guess is that if your paper get's about 10.000 retrievals per year, which is not bad, maybe one single person out of the 10.000 will bother to take a look at your *aligned* sequences. No one will ever take a look at the raw sequences.

    Then there are the very few projects were many scientists will actually want to analyze themselves, like Encode. But serving this data is the responsibilty of these mega-projects themselves, and they are up to it.

    Comment


    • #62
      Originally posted by Azazel View Post
      Why would anyone want to store *raw* reads in a centralized database? It's not like there are thousands of queries every day requesting a certain dataset, and thousands of scientists re-aligning & analyzing whatever someone uploaded to SRA. I may be missing the point of SRA, but to me it sounds just ridiculous.
      Because current analytical tools can extract additional information from the same primary data.

      Because federal funding doesn't necessarily provide for hosting massive amounts of data but does require sharing said data openly.

      Because there actually are numerous queries for said primary data and, again, no funding provided for hosting or sharing it.
      Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
      Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
      Projects: U87MG whole genome sequence [Website] [Paper]

      Comment


      • #63
        Originally posted by Azazel View Post
        To me SRA made absolutely no sense whatsoever.

        Why would anyone want to store *raw* reads in a centralized database? It's not like there are thousands of queries every day requesting a certain dataset, and thousands of scientists re-aligning & analyzing whatever someone uploaded to SRA. I may be missing the point of SRA, but to me it sounds just ridiculous.

        The only people ever aligning and having a look at raw sequencig data are the one who *publish* on that dataset. How many publications do you get from one HT dataset / experiment usually? One.

        If you do some high-throughput experiment and publish, my guess is that if your paper get's about 10.000 retrievals per year, which is not bad, maybe one single person out of the 10.000 will bother to take a look at your *aligned* sequences. No one will ever take a look at the raw sequences.
        Allow me to disagree. We do a lot of reanalysis of existing datasets and we *always* go back to the raw reads rather than alignments or derived data. If a repository were to limit the deposited data then the raw reads are the one thing you need to keep and everything else is optional since you can always re-derive the aligned and analysed data by reproducing the original analysis (or at least you should be able to).

        In simple cases you find a lot of data which was aligned against older genome assemblies so it's easier and better to work against the latest assembly. There are also variations between the results produced by different aligners such that it's more consistent to use the same aligner for each data set. It also helps to be able to QC and reprocess the original data since many older studies just seemed to skip this step all together.

        The biggest advantage to having raw data is that you can do things not envisaged by the original study authors. Our most interesting results come from using sequence data for purposes the original study never envisaged and often these wouldn't be possible if you didn't have the original data.

        PS - To stay on topic for the original post in this thread, it appears that NCBIs SRA has been reprieved. Hopefully it will still get the major overhaul it so desperately needs, but I'm glad to see that it's funding will continue.

        Comment


        • #64
          Originally posted by Azazel View Post
          Why would anyone want to store *raw* reads in a centralized database? It's not like there are thousands of queries every day requesting a certain dataset, and thousands of scientists re-aligning & analyzing whatever someone uploaded to SRA. I may be missing the point of SRA, but to me it sounds just ridiculous.
          Primarily you're missing some historical context.

          Traditionally with the capillary data set, people published sequencing chromatogram files - aka traces. Ie the SCF and later ZTR files. Although not in the most raw form, these offered a way for users to visually inspect base-calling errors and allowed for the potential of new algorithms to reprocess the trace files to come up with better base-callers.

          Indeed, this happened. Phred was by far the most widely used such tool and was routinely applied to data sets downloaded from the trace archive. Later on there were a few more choices too, but the technology didn't have long left then anyway.

          Roll forward to Solexa instruments, soon to become Illumina, and you can see the same questions being asked. Should we just store base calls and confidence values, or some form or signal intensity (either before or after background correction and dye separation)? It was clear people wouldn't be visually inspecting errors, but we knew from previous experiences that people would use this raw data to produce newer, and importantly *better*, base callers. If the raw data was available then people could recall existing data sets when it was appropriate.

          Clearly at the time it was a reasonable decision too, as a whole plethora of new base-calling algorithms arrived. With hindsight though it seems that the amount of re-base-calling of old data wasn't high, nowhere near as much as in the capillary world. Partly it could be that as sequencing became cheaper people were less inclined to wring the very last ounce of accuracy out of their precious data sets, and partly it's just one of sheer scales.

          So I'd say it's largely pointless now keeping trace data except occasionally as example data sets for use by software developers. While no doubt still of use to a few people, it's hard to justify their cost. I don't think it's fair to say they were of no use early on though.

          Comment


          • #65
            Originally posted by jkbonfield View Post
            Partly it could be that as sequencing became cheaper people were less inclined to wring the very last ounce of accuracy out of their precious data sets, and partly it's just one of sheer scales.

            So I'd say it's largely pointless now keeping trace data except occasionally as example data sets for use by software developers.

            Yes, I think we've hit a point of diminishing returns when it comes to redoing basecalling on older data sets.

            I would opt to drop the raw trace data, and keep the compressed FASTQ files.

            My main beef with SRA is that they need to require better annotation of the submissions.

            It really irks me when a publication puts their SRA number in the methods section, and they've sequenced 10 samples and then when I go look it up to download, it is not clear which sample is which, Does SRA not provide fields for people to fill this information in, or are authors just being lazy and neglecting to label their files? Or am I just too stupid to use the SRA? Is it too much to ask that it take no more than 2 minutes to figure out the data? Do I have to write to every author asking them what is the difference between SRA2130032 and SRA2204224 ?

            Comment


            • #66
              Request for Information (RFI) by NIH NINDS

              Closing date May 30 2011

              NIH Funding Opportunities and Notices in the NIH Guide for Grants and Contracts: Request for Information: Whole Genome Sequencing, Data Analysis, Storage and Annotation NOT-NS-11-015. NINDS



              "This RFI is meant to solicit information from extramural research investigators regarding the type and availability of projects that can be advanced through whole genome sequencing services. The RFI also solicits information on institutional capabilities for sequence storage, data analysis and annotation. Responses to this RFI will be reviewed by NINDS staff and will help inform and complement their assessment of current and future whole genome sequencing needs."

              There are general questions here that can be addressed by members of this forum. The information provided would be very helpful for future planning at (US) NIH.

              Comment


              • #67
                Interesting comments.

                I wonder, does anyone know if/where I can get data how often a dataset is downloaded from SRA? Basically, I mean usage statistics: total number of datasets/experiments, and how often downloaded per month (or week, or daily) ?

                Comment


                • #68
                  I think next year SRA will become impossible to sustain once the MiSeq and PGM's kick into full gear, but I think the idea of maintaining a single repository is outdated. Why not use a "Sciece Torrent", SRA should be the "Pirate Bay" with an additional task of reviewing and standardizing formats, maybe seed the files, but the throughput should be shared. This would allow multiple centers to just provide RAW seeding space, besides, datasets downloads should be really "peeky" meaning that there would be multiple downloads at the same time (usually right after publication) this would be perfect scenario for torrent?

                  Comment


                  • #69
                    Torrents don't work when most of the time there are less than 50 people who want to the data sets

                    Comment


                    • #70
                      Interim reprieve for SRA

                      NCBI announcing SRA open for new data til October. Will post info links ASAP.

                      Last edited by Joann; 07-14-2011, 07:03 AM. Reason: link to news article dated 7/14/11

                      Comment


                      • #71
                        Torrents don't work..

                        well torrents don't work for SRA? well, not so sure, the thing is that, maybe somebody from IT from SRA could answer this, but I think that when a paper gets out then people want to download the data and look at it creating a massive download peak. If this is the case then the torrent would minimize the overload. So you could have a single server with a not so great download speed for random access for archiving and then a torrent comunity to spread the data arround and remove peaks. The question is do you have peak's of downloads of the same datasets? Also you could have mirrors deployed at different institutions, I mean, a 50TB server is less than 50k, you mount a torrent mirror in 100 universities and and you got yourself a nice reduntant service. I think concentrating all data in the same spot is becomes exponentially expensive, you can get a much better scalable distributed system for a fraction of the cost. MiSeq and IonTorrent are comming next year, and we need a way to distribute this stuff. Look for a NGS paper and you will oftenly find amazing statistical creativity in the handling of the data, we need to SRA to keep these "creative people" honest.

                        Comment


                        • #72
                          Setting up torrents for projects as they are released, think about the IT overhead on that one

                          torrents don't work for this sort of data because there are very few people willing and able to seed the torrents unlike with legal software torrents or illegal film/tv/music torrents

                          Compression solutions are being actively looked into and are likely to be the best idea for long term sustainability

                          Comment


                          • #73
                            Originally posted by laura View Post
                            Setting up torrents for projects as they are released, think about the IT overhead on that one

                            torrents don't work for this sort of data because there are very few people willing and able to seed the torrents unlike with legal software torrents or illegal film/tv/music torrents

                            Compression solutions are being actively looked into and are likely to be the best idea for long term sustainability
                            Well, first I think that can be made automatic quite easily so I really don't see the IT overload. Second, this has never been tested therefore you don't know for sure if it doesn't work. So the question is, sequencer output is growing exponentially, last I checked, compression efficiency barely make the linear scale? I still don't think compression is the way to go.

                            Comment


                            • #74
                              I still like the concept of NCBI being "pirate bay". RRRrrr, mateys! Given how locked down the really interesting data is ... might as well treat it like "illegal music".

                              Comment


                              • #75
                                well I'm just not sure centralized system is the way to go, you need a centralized indexer, but the storage should be decentralized. Storing 100PB of data is easy and "cheep", distributing 100PB of data from a single location, that is extremely difficult. In particular I'm guessing that data is not randomly accessed, I'm guessing that data from the last 6mo's is accessed more frequently than data that is 10 years old, so if you can distribute the "hottest, most downloaded data" you can release the download overload from the main repository quite easily.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Exploring the Dynamics of the Tumor Microenvironment
                                  by seqadmin




                                  The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                  07-08-2024, 03:19 PM
                                • seqadmin
                                  Exploring Human Diversity Through Large-Scale Omics
                                  by seqadmin


                                  In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                  06-25-2024, 06:43 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 07-16-2024, 05:49 AM
                                0 responses
                                26 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-15-2024, 06:53 AM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-10-2024, 07:30 AM
                                0 responses
                                40 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-03-2024, 09:45 AM
                                0 responses
                                205 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X