After a recent influx of new data, I am beginning to doubt the way that I previously stored my data (both raw and processed), so I figured I might as well ask on here what people's directory structures look like. Maybe there is some kind of ideal structure that people tend to use? I've taught myself most of what I know of the bioinformatics, and while I now feel quite fluent in both Python, R and some basic bash coding, I think I'm lacking quite a bit in those basic things that are taught at computer science courses/programmes.
This is what my first few levels of the directory structure looks like at the moment:
... and each directory there has subdirectories as follows:
So, it's kind of cell type-oriented at the moment. I have some scripts (for differential expression, for example) that automatically finds the files in the structure specified in the command line argument, such as "de_analysis.R --samples cell_type_1,cell_type_2" (which run DESeq2 on cell_type_1 vs. cell_type_2).
I started doubting this structure now that I got more data, some of which is new sequencing runs of the same cell types, but under different cultivating conditions. I could just keep my current structure and name the new data something like "cell_type_1_b", but I don't want to do anything quite yet without hearing what other people tend to do.
I thought about making it more experiment-centric (i.e. use dates up in the hierarchy and appropriate subfolders below), but then I'd have to re-write a lot of my scripts. If that would be generally better (especially for the future, with even more data) then I'd do it, though, but It'd be nice to hear other people's thoughts first =P
This is what my first few levels of the directory structure looks like at the moment:
Code:
~/data/rna/fastq/ bam counts fpkm de_analyses
Code:
.../cell_type_1/<appropriate data files> cell_type_2 cell_type_3
I started doubting this structure now that I got more data, some of which is new sequencing runs of the same cell types, but under different cultivating conditions. I could just keep my current structure and name the new data something like "cell_type_1_b", but I don't want to do anything quite yet without hearing what other people tend to do.
I thought about making it more experiment-centric (i.e. use dates up in the hierarchy and appropriate subfolders below), but then I'd have to re-write a lot of my scripts. If that would be generally better (especially for the future, with even more data) then I'd do it, though, but It'd be nice to hear other people's thoughts first =P
Comment