NGS data directory structures

ErikFas

Member

Join Date: Jun 2014

Posts: 86
- Share
- Tweet
#1

NGS data directory structures

10-22-2015, 02:54 AM

After a recent influx of new data, I am beginning to doubt the way that I previously stored my data (both raw and processed), so I figured I might as well ask on here what people's directory structures look like. Maybe there is some kind of ideal structure that people tend to use? I've taught myself most of what I know of the bioinformatics, and while I now feel quite fluent in both Python, R and some basic bash coding, I think I'm lacking quite a bit in those basic things that are taught at computer science courses/programmes.

This is what my first few levels of the directory structure looks like at the moment:

Code:

~/data/rna/fastq/ bam counts fpkm de_analyses

... and each directory there has subdirectories as follows:

Code:

.../cell_type_1/<appropriate data files> cell_type_2 cell_type_3

So, it's kind of cell type-oriented at the moment. I have some scripts (for differential expression, for example) that automatically finds the files in the structure specified in the command line argument, such as "de_analysis.R --samples cell_type_1,cell_type_2" (which run DESeq2 on cell_type_1 vs. cell_type_2).

I started doubting this structure now that I got more data, some of which is new sequencing runs of the same cell types, but under different cultivating conditions. I could just keep my current structure and name the new data something like "cell_type_1_b", but I don't want to do anything quite yet without hearing what other people tend to do.

I thought about making it more experiment-centric (i.e. use dates up in the hierarchy and appropriate subfolders below), but then I'd have to re-write a lot of my scripts. If that would be generally better (especially for the future, with even more data) then I'd do it, though, but It'd be nice to hear other people's thoughts first =P
Tags: None
dpryan

Devon Ryan

Join Date: Jul 2011

Posts: 3478
- Share
- Tweet
#2

10-22-2015, 03:10 AM

For whatever it's worth, I generally group everything by project, since I've rarely needed to analyze samples across projects. So it's something like Project/[fastq|bam|counts|DE_analysis]. I've never done subdirectories for groups, I just have a sample table that has all of the group information within a project that I can parse in R (or python) when needed. My scripts for mapping are separate from this, since that involves moving files to/from a cluster. For DE and similar analyses, the scripts often get tweaked for each project anyway, so each project just has its own (normally an Rmd file, in the case of R-based analyses).
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

NGS data directory structures

Comment

Latest Articles

ad_right_rmr

News