Short Read Archive Canned

aleferna replied

06-23-2011, 06:20 AM
I think next year SRA will become impossible to sustain once the MiSeq and PGM's kick into full gear, but I think the idea of maintaining a single repository is outdated. Why not use a "Sciece Torrent", SRA should be the "Pirate Bay" with an additional task of reviewing and standardizing formats, maybe seed the files, but the throughput should be shared. This would allow multiple centers to just provide RAW seeding space, besides, datasets downloads should be really "peeky" meaning that there would be multiple downloads at the same time (usually right after publication) this would be perfect scenario for torrent?
Leave a comment:
Azazel replied

05-25-2011, 01:40 AM
Interesting comments.

I wonder, does anyone know if/where I can get data how often a dataset is downloaded from SRA? Basically, I mean usage statistics: total number of datasets/experiments, and how often downloaded per month (or week, or daily) ?
Leave a comment:
Joann replied

05-19-2011, 04:09 AM
Request for Information (RFI) by NIH NINDS

Closing date May 30 2011

NOT-NS-11-015: Request for Information: Whole Genome Sequencing, Data Analysis, Storage and Annotation

http://grants.nih.gov/grants/guide/notice-files/NOT-NS-11-015.html

NIH Funding Opportunities and Notices in the NIH Guide for Grants and Contracts: Request for Information: Whole Genome Sequencing, Data Analysis, Storage and Annotation NOT-NS-11-015. NINDS

"This RFI is meant to solicit information from extramural research investigators regarding the type and availability of projects that can be advanced through whole genome sequencing services. The RFI also solicits information on institutional capabilities for sequence storage, data analysis and annotation. Responses to this RFI will be reviewed by NINDS staff and will help inform and complement their assessment of current and future whole genome sequencing needs."

There are general questions here that can be addressed by members of this forum. The information provided would be very helpful for future planning at (US) NIH.
Leave a comment:
NGSfan replied

05-19-2011, 01:10 AM
Originally posted by jkbonfield View Post

Partly it could be that as sequencing became cheaper people were less inclined to wring the very last ounce of accuracy out of their precious data sets, and partly it's just one of sheer scales.

So I'd say it's largely pointless now keeping trace data except occasionally as example data sets for use by software developers.

Yes, I think we've hit a point of diminishing returns when it comes to redoing basecalling on older data sets.

I would opt to drop the raw trace data, and keep the compressed FASTQ files.

My main beef with SRA is that they need to require better annotation of the submissions.

It really irks me when a publication puts their SRA number in the methods section, and they've sequenced 10 samples and then when I go look it up to download, it is not clear which sample is which, Does SRA not provide fields for people to fill this information in, or are authors just being lazy and neglecting to label their files? Or am I just too stupid to use the SRA? Is it too much to ask that it take no more than 2 minutes to figure out the data? Do I have to write to every author asking them what is the difference between SRA2130032 and SRA2204224 ?
Leave a comment:
jkbonfield replied

05-19-2011, 12:27 AM
Originally posted by Azazel View Post

Why would anyone want to store *raw* reads in a centralized database? It's not like there are thousands of queries every day requesting a certain dataset, and thousands of scientists re-aligning & analyzing whatever someone uploaded to SRA. I may be missing the point of SRA, but to me it sounds just ridiculous.

Primarily you're missing some historical context.

Traditionally with the capillary data set, people published sequencing chromatogram files - aka traces. Ie the SCF and later ZTR files. Although not in the most raw form, these offered a way for users to visually inspect base-calling errors and allowed for the potential of new algorithms to reprocess the trace files to come up with better base-callers.

Indeed, this happened. Phred was by far the most widely used such tool and was routinely applied to data sets downloaded from the trace archive. Later on there were a few more choices too, but the technology didn't have long left then anyway.

Roll forward to Solexa instruments, soon to become Illumina, and you can see the same questions being asked. Should we just store base calls and confidence values, or some form or signal intensity (either before or after background correction and dye separation)? It was clear people wouldn't be visually inspecting errors, but we knew from previous experiences that people would use this raw data to produce newer, and importantly *better*, base callers. If the raw data was available then people could recall existing data sets when it was appropriate.

Clearly at the time it was a reasonable decision too, as a whole plethora of new base-calling algorithms arrived. With hindsight though it seems that the amount of re-base-calling of old data wasn't high, nowhere near as much as in the capillary world. Partly it could be that as sequencing became cheaper people were less inclined to wring the very last ounce of accuracy out of their precious data sets, and partly it's just one of sheer scales.

So I'd say it's largely pointless now keeping trace data except occasionally as example data sets for use by software developers. While no doubt still of use to a few people, it's hard to justify their cost. I don't think it's fair to say they were of no use early on though.
Leave a comment:
simonandrews replied

05-18-2011, 11:25 PM
Originally posted by Azazel View Post

To me SRA made absolutely no sense whatsoever.

Why would anyone want to store *raw* reads in a centralized database? It's not like there are thousands of queries every day requesting a certain dataset, and thousands of scientists re-aligning & analyzing whatever someone uploaded to SRA. I may be missing the point of SRA, but to me it sounds just ridiculous.

The only people ever aligning and having a look at raw sequencig data are the one who *publish* on that dataset. How many publications do you get from one HT dataset / experiment usually? One.

If you do some high-throughput experiment and publish, my guess is that if your paper get's about 10.000 retrievals per year, which is not bad, maybe one single person out of the 10.000 will bother to take a look at your *aligned* sequences. No one will ever take a look at the raw sequences.

Allow me to disagree. We do a lot of reanalysis of existing datasets and we *always* go back to the raw reads rather than alignments or derived data. If a repository were to limit the deposited data then the raw reads are the one thing you need to keep and everything else is optional since you can always re-derive the aligned and analysed data by reproducing the original analysis (or at least you should be able to).

In simple cases you find a lot of data which was aligned against older genome assemblies so it's easier and better to work against the latest assembly. There are also variations between the results produced by different aligners such that it's more consistent to use the same aligner for each data set. It also helps to be able to QC and reprocess the original data since many older studies just seemed to skip this step all together.

The biggest advantage to having raw data is that you can do things not envisaged by the original study authors. Our most interesting results come from using sequence data for purposes the original study never envisaged and often these wouldn't be possible if you didn't have the original data.

PS - To stay on topic for the original post in this thread, it appears that NCBIs SRA has been reprieved. Hopefully it will still get the major overhaul it so desperately needs, but I'm glad to see that it's funding will continue.
Leave a comment:
Michael.James.Clark replied

05-18-2011, 09:10 PM
Originally posted by Azazel View Post

Why would anyone want to store *raw* reads in a centralized database? It's not like there are thousands of queries every day requesting a certain dataset, and thousands of scientists re-aligning & analyzing whatever someone uploaded to SRA. I may be missing the point of SRA, but to me it sounds just ridiculous.

Because current analytical tools can extract additional information from the same primary data.

Because federal funding doesn't necessarily provide for hosting massive amounts of data but does require sharing said data openly.

Because there actually are numerous queries for said primary data and, again, no funding provided for hosting or sharing it.
Leave a comment:
Azazel replied

05-18-2011, 09:05 PM
To me SRA made absolutely no sense whatsoever.

Why would anyone want to store *raw* reads in a centralized database? It's not like there are thousands of queries every day requesting a certain dataset, and thousands of scientists re-aligning & analyzing whatever someone uploaded to SRA. I may be missing the point of SRA, but to me it sounds just ridiculous.

The only people ever aligning and having a look at raw sequencig data are the one who *publish* on that dataset. How many publications do you get from one HT dataset / experiment usually? One.

If you do some high-throughput experiment and publish, my guess is that if your paper get's about 10.000 retrievals per year, which is not bad, maybe one single person out of the 10.000 will bother to take a look at your *aligned* sequences. No one will ever take a look at the raw sequences.

Then there are the very few projects were many scientists will actually want to analyze themselves, like Encode. But serving this data is the responsibilty of these mega-projects themselves, and they are up to it.
Leave a comment:
nickloman replied

05-11-2011, 07:01 AM
Short Read Archive reprieve!

Home - SRA - NCBI

http://www.ncbi.nlm.nih.gov/sra/

Sequence Read Archive (SRA) is still in service.
Recently, NCBI announced that due to budget constraints, it would be discontinuing its Sequence Read Archive (SRA) and Trace Archive repositories for high-throughput sequence data. However, NIH has since committed interim funding for SRA in its current form until October 1, 2011. In addition, NCBI has been working with staff from other NIH Institutes and NIH grantees to develop an approach to continue archiving a widely used subset of next generation sequencing data after October 1, 2011.

We now plan to continue handling sequencing data associated with:

RNA-Seq, ChIP-Seq, and epigenomic data that are submitted to GEO
Genomic and Transcriptomic assemblies that are submitted to GenBank
Genomic assemblies to GenBank/WGS
16S ribosomal RNA data associated with metagenomics that are submitted to GenBank
In addition, NCBI will continue to provide access to existing SRA and Trace Archive data for the foreseeable future. NCBI is also continuing to discuss with NIH Institutes approaches for handling other next-generation sequencing data associated with specific large-scale studies.
Leave a comment:
Joann replied

03-29-2011, 12:02 PM
More Discussion

Editorial at Genome Biology

"Closure of the NCBI SRA and implications for the long-term future of genomics data storage".

doi:10.1186/gb-2011-12-3-402
Leave a comment:
lh3 replied

02-21-2011, 10:39 AM
My preference is to store raw reads (without trimming, alignment-based recalibration or duplicate removal) in the BAM format, in the order in which the reads come off sequencing machines. Reads may be optionally mapped. I do see the advantages of keeping sorted alignment, but even if no information is lost, unsorted reads are more convenient if we want to redo alignment. It would be good to keep unsorted and sorted data, but this leads to duplicates and may deviate from the intention of SRA.

Anyway, I agree with James that the most important goal of SRA is to store the primary data. Alignment without the loss of primary data, though preferred, comes only in the second place. I do not think SNPs and other annotations should go into SRA. These are the objective of a 3rd-party database, not SRA.

Last edited by lh3; 02-21-2011, 10:42 AM.
Leave a comment:
jkbonfield replied

02-21-2011, 09:56 AM
People seem to be arguing different points here, although possibly it indicates the problem SRA tried to solve wasn't the primary issue the community faces. I'm not sure.

Anyway SRA was designed to store the primary data. That was originally trace files, but later just the raw calls and confidence values. The purpose was to allow any analysis to be rerun on the input data so we can reproduce results or "upgrade" results by using a newer set of analysis tools.

More recently the discussions here seem to be centred around storage and retrieval of analysis results: aligned BAM files, SNP VCF files, etc. Heng has been involved in a variety of formats here to tackle such things (BAM, tab delimited "tabix" indexed files, etc). Samtools also has existing code to on-the-fly download portions of bam files via http or ftp for any specific region, so it works neatly with existing web protocols. I'm unsure of security and SSL concerns, but either way it's a solid start.

Obviously the two scenarios aren't quite the same though, but with careful consideration perhaps they can be merged. For example if the aligned bam files contain all qualities (there have been discussions about only storing qualities for sites that differ to the reference), mark duplicates via flags rather than removal, have the original qualities instead of recalibrated ones, and store all unmapped reads, then and only then we can extract the primary data back from the aligned bam.

Is it worth it? Perhaps not. Sorted BAM is an ok format for storing primary data (as it's relatively compact, although not the best, and well understood), but some groups will want to do so much processing of their bams they'll need a second copy anyway.
Leave a comment:
lh3 replied

02-21-2011, 08:58 AM
To me, the easiest way to access data is a hierarchical FTP/HTTP directory containing all the fastq/sra/bam files with a top-level TAB delimited file briefly describing the batch, species if applicable, type of data (metagenomics, RNA-seq, ChIP-seq, targeted, exome, whole-genome sequencing), sample name, number of sequences, average read length, barcode length etc (possibly also submission date, etc) of each file, something similar to what the 1000g project is providing but probably a little more comprehensive. A single XML would also be fine, though I am happier with a TAB delimited file.
Leave a comment:
Nix replied

02-21-2011, 08:39 AM
Originally posted by Richard Finney View Post

Can DAS or another off-the-shelf system address the concern for security?

DAS servers are web apps and as such can take advantage of the same security protocols worked out for banks and hospitals (ssl, https, digest authentication, vpn). With our GenoPub server, visibility of each dataset is set to either the owner, the lab, particular defined collaborators of the lab, the institute, or the public.

As far as getting NCBI to take care of the problem. Good luck. The SRA was swamped from day one. I doubt SRA 2.0 can do any better without a significant increase in resources, which currently are slated to go to the Dept of Defense (4.7% budget increase for 2012!).

If the ENA is willing to host all of the US data great but, if I'm not mistaken, they still don't provide a programatic way of accessing analysis (bam files, variant calls, enrichment tracks, etc.). Neither did the SRA for that matter.

I believe our scientific community can do better.
Leave a comment:
jkbonfield replied

02-21-2011, 01:47 AM
Originally posted by Nix View Post

Why aren't people hosting and publishing their own data? There's no need to centralize this activity.

You are responsible for providing plasmids and cell lines etc. used in your papers why not the genomic data too in both its raw and processed form.

I was at a meeting at NCBI a few years back, before SRA got off the ground, to discuss how it should all work and explain things to the main sequencing centres. I dared to ask why people didn't want a federated service instead, or at least a central store for meta-data with redirections to the labs own store of data.

The question was met with pretty much universal dismay and disagreement. I later realised why - it costs money, time and effort to host data. NCBI were promising to do this for everyone, essentially solving all their problems. Why would you agree to taking the hard route of storing it yourself (and harder still agreeing with all the other centres to do it in a uniform manner) when NCBI will take the data off your hands for free and do all the hard work for you?
Leave a comment:

Previous 1 2 3 4 5 6 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News