Short Read Archive Canned

Joann replied

02-20-2011, 04:25 AM
For international database collaborations (INSCD)

Quoting directly from the ENA web page:

"The European Nucleotide Archive (ENA) accepts data generated by next-generation sequencing methodologies such as 454, Illumina Genome Analyzer and ABI SOLiD into the Sequence Read Archive (SRA). ENA works in close collaboration with the NCBI and DDBJ as part of the International Nucleotide Sequence Database Collaboration (INSDC). All submitted public data is exchanged between the partners on a daily basis. All three partners use the same data and metadata formats.

For all questions and enquiries please contact [email protected]."
Leave a comment:
laura replied

02-19-2011, 02:56 PM
The ENA will continue to accept open access data and the EGA will continue to accept human data with consent agreements and Data Access committees, lets hope these changes don't stop people releasing data into the public domain, certainly all the 1000 genomes data will remain freely available for everyone
Leave a comment:
Richard Finney replied

02-19-2011, 10:07 AM
Can DAS or another off-the-shelf system address the concern for security? The US government sponsored research has some weighty Patient Privacy restrictions. I'm not sure that dishing up Bams via FTP by just "setting up something as simple as installing apache" is going to work. I'd like it to, but I'm thinking someone's going to say "no go". I hope there's an easing of the "must lock down data. only high priests than even think about looking at the data" mentality. But ... we're not there yet.

Last edited by Richard Finney; 02-19-2011, 10:47 AM.
Leave a comment:
laura replied

02-19-2011, 07:46 AM
The NCBI is part of United States National Library of Medicine
Leave a comment:
Nix replied

02-18-2011, 03:37 PM
Originally posted by Joann View Post

This is where I wish use of our institutional libraries would come to mind....

Funny you should mention this. I've shared many a beer with librarians over a campfire and they are very keen on doing just this sort of thing. They are way into the proper annotation and curation of data too.

For those "centralists" in the group... In theory one can pull data off of one DAS server and host it on your own. Thus a centralized DAS server could be built that is continuously updating it's repository from other DAS servers. This kind of defeats the purpose of a Distributed Annotation System though.

Large groups (Institutions, Universities, Journals, NIH (if they can fund it)) would be the best final repositories for genomic data. The SRA was swamped almost from the start. I think the only way to keep up with the deluge is to distribute the data.
Leave a comment:
Joann replied

02-18-2011, 03:04 PM
central solution

This is where I wish use of our institutional libraries would come to mind. They have and maintain the long term academic infrastructure and well understand inter-institutional standards (also publishing standards). Why can't institutional biological repositories hosting for open access scientific research purposes include next gen sequencing data? If a set of deposit and access standards were worked out and agreed upon (as is now being discussed in this thread) (and that could be the beneficiary of a collaborative grant proposal for start-up funding) library consortia could be forged and linked-up to accomplish specialized sequence data deposit/access, large and small. This is truly the kind of non-profit academic research purpose that is for the good and advancement of society and science. It enhances and builds upon using our existing, traditional academic resources.
Leave a comment:
Michael.James.Clark replied

02-18-2011, 03:00 PM
Originally posted by Nix View Post

The problem with both of these approaches is the lack of a formal language to associate annotation with the data. Species, genome version, file format are often anyone's guess. Every group encodes this information differently. Thus making use of such data requires manual download and interpretation.

Don't believe me? Try intersecting chIP-seq peaks calls from 20 different datasets hosted by UCSC, Ensembl, and GMOD. That's at minimum 1 day of work by a skilled bioinformatician. Now do it again and again with each new release. Analysis applications that speak DAS can do this in minutes with no human error.

DAS provides the "glue language" (in XML) for brokering queries and returning slices of data based on coordinates or feature searches. It is defined by a community vetted specification with several server and client implementations. Among other things these define how you represent each species, their genome, and their coordinate system.

Another key advantage of DAS is that allows one to separate data distribution from data use. I love the UCSC data tracks but can't always use their browser to analyze them.

Honestly, if I have to choose between setting up and maintaining my own server and uploading my data to a centralized depository, there really is no choice at all.

I think we'd be lucky if labs would even provide you with more than a list of variants if they were forced to host data themselves. That's the whole reason for having a centralized host. Most people these days know how to deal with FTP or HTTP, so that's what we'd end up with

SRA was a good idea, but with clunky implementation. A group, government or academic, ought to pick up the mantle and host the world's genomic data. Hey, it could be you. Host it all on a DAS/2 server.
Leave a comment:
laura replied

02-18-2011, 02:35 PM
The trouble with any none centralised solution is consistency. Databases like Ensembl and UCSC have been around for a long time as have the sequence archives (and long may they continue) how many labs have the resources to put up their data pretty much forever I would suspect the answer to that question is very few
Leave a comment:
Nix replied

02-18-2011, 02:26 PM
Originally posted by lh3 View Post

In general, hosting data with HTTP/FTP is much more convenient to most researchers. If we want to look at data in small regions, we can use IGV/UCSC to view remote bam/bigbed/bigwig files. IGV also supports tabix indexing and thus VCF files.

The problem with both of these approaches is the lack of a formal language to associate annotation with the data. Species, genome version, file format are often anyone's guess. Every group encodes this information differently. Thus making use of such data requires manual download and interpretation.

Don't believe me? Try intersecting chIP-seq peaks calls from 20 different datasets hosted by UCSC, Ensembl, and GMOD. That's at minimum 1 day of work by a skilled bioinformatician. Now do it again and again with each new release. Analysis applications that speak DAS can do this in minutes with no human error.

DAS provides the "glue language" (in XML) for brokering queries and returning slices of data based on coordinates or feature searches. It is defined by a community vetted specification with several server and client implementations. Among other things these define how you represent each species, their genome, and their coordinate system.

Another key advantage of DAS is that allows one to separate data distribution from data use. I love the UCSC data tracks but can't always use their browser to analyze them.
Leave a comment:
lh3 replied

02-18-2011, 01:42 PM
In general, hosting data with HTTP/FTP is much more convenient to most researchers. If we want to look at data in small regions, we can use IGV/UCSC to view remote bam/bigbed/bigwig files. IGV also supports tabix indexing and thus VCF files.
Leave a comment:
fpepin replied

02-18-2011, 01:32 PM
Originally posted by Nix View Post

If your group can run a web site it can run a DAS/2 server. It really is rather easy to set up, just mysql and tomcat. Then have the biologists use the point and click web interface to load the data.

It is probably overkill to have every lab run a DAS/2 server, although they can. It is best if your department/ institute/ organization maintains one alongside their web server.

I've seen my share of wet labs that have students bring in their own personal laptops and not even a backup system, and the department wasn't much savvier either.

It's doable, but I know I'd be happier having a central managed repository than depending on every group/department to have a server running properly that contains the data reasonable format.
Leave a comment:
Nix replied

02-18-2011, 12:52 PM
If your group can run a web site it can run a DAS/2 server. It really is rather easy to set up, just mysql and tomcat. Then have the biologists use the point and click web interface to load the data.

It is probably overkill to have every lab run a DAS/2 server, although they can. It is best if your department/ institute/ organization maintains one alongside their web server.

Forcing folks to properly annotate their data is another issue. Best to have journals hold the stick and require that datasets for publication be MINSEQE compliant.
Leave a comment:
fpepin replied

02-18-2011, 12:36 PM
Originally posted by Nix View Post

Why wait for the government to fix our problems when we can do it ourselves?

Not everyone has the resources or expertise to set up (and maintain!) a DAS server or another type of repository. This is going to be exacerbated as costs go down and every lab is going to be doing NGS.

Having a central repository makes it a lot easier to make sure that the data is consistent and has enough details to be useful.
Leave a comment:
csoong replied

02-18-2011, 11:55 AM
we could definitely DIY the process, the limitation being the bandwidth demand for any lab to do so. again this could be solved by using fedex-ing harddisks, but again who wants to take time in a lab to do so? and who has the expertise to setup such a thing. if not properly set up, backup would be an issue so would general performance of the server, since it will be soley used for fetching data all the time.
Leave a comment:
Nix replied

02-18-2011, 11:44 AM
DAS/2 to the rescue?!

Why aren't people hosting and publishing their own data? There's no need to centralize this activity.

You are responsible for providing plasmids and cell lines etc. used in your papers why not the genomic data too in both its raw and processed form.

It's quite easy and useful to do this provided everyone uses the same communication protocol to enable programatic access so one doesn't have to manually download and reprocess the data before using it.

DAS (Distributed Annotation System) is one such protocol designed to do exactly this, it's been in use for >10yrs, with hundreds of servers world wide. DAS/2 is a modification to the original DAS/1 protocol optimized for large scale genomic data distribution using any file format (bam, bar, gff, bed, etc).

Check out http://www.biodas.org and http://www.biodas.org/wiki/DAS/2 and feel free to play around with our DAS/2 server http://bioserver.hci.utah.edu/BioInf.../Software:DAS2 or install your own http://bioserver.hci.utah.edu/BioInf...GenoPubInstall .

We've written up some of these tools in a recent paper if folks want to take a look: http://www.biomedcentral.com/1471-2105/11/455

Why wait for the government to fix our problems when we can do it ourselves?
Leave a comment:

Previous 1 2 3 4 5 6 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News