Hi Folks,
I am interested in doing population genomic analysis using Stacks for a species with a reference genome. My plan is to use a lane of HiSeq 4000 to multiplex around 300 individuals.
Since a lane of HiSeq 4000 generates over 320 - 390M reads, a 300 plex will mean 1.16 M reads per individual (350M reads). Then, I assumed the following parameters: 1 SNP every 1000 bps, coverage of 30X and an average read length of 120 (paired-end read of 60 bp for each side). With those parameters, I will get ~ 4666 SNPs.
I have used the R Package SimRAD to select the best RE combinations that will get me around that N of SNPs- I AM VERY LUCKY THAT MY ORGANISM HAS A SEQUENCED GENOME. EcoRI and HindIII will get me around ~4K Snps using a size selection range of 150-300 bps.
Generating around 100 -120 GB of data brings me some bold/hard concerns: how much computer power do I need for each dataset?
On the protocol: Deriving genotypes from RAD-seq short-read data using Stacks, the Hardware and software SECTION says: "Access to a computing cluster running under Linux, preferably with at least 8–16 cores and 64-Gb of memory...". This is why I am concerned of perhaps not having enough resources to run the analysis.
I will appreciate any suggestions or tips you may have for my project. I am seeing that projects based on RAD-seq have a strong component of computer resources. Thus, I am looking for any advice to maximize the probability of success! and not miss important aspects that I am not aware of.
Thank you very much!
I am interested in doing population genomic analysis using Stacks for a species with a reference genome. My plan is to use a lane of HiSeq 4000 to multiplex around 300 individuals.
Since a lane of HiSeq 4000 generates over 320 - 390M reads, a 300 plex will mean 1.16 M reads per individual (350M reads). Then, I assumed the following parameters: 1 SNP every 1000 bps, coverage of 30X and an average read length of 120 (paired-end read of 60 bp for each side). With those parameters, I will get ~ 4666 SNPs.
I have used the R Package SimRAD to select the best RE combinations that will get me around that N of SNPs- I AM VERY LUCKY THAT MY ORGANISM HAS A SEQUENCED GENOME. EcoRI and HindIII will get me around ~4K Snps using a size selection range of 150-300 bps.
Generating around 100 -120 GB of data brings me some bold/hard concerns: how much computer power do I need for each dataset?
On the protocol: Deriving genotypes from RAD-seq short-read data using Stacks, the Hardware and software SECTION says: "Access to a computing cluster running under Linux, preferably with at least 8–16 cores and 64-Gb of memory...". This is why I am concerned of perhaps not having enough resources to run the analysis.
I will appreciate any suggestions or tips you may have for my project. I am seeing that projects based on RAD-seq have a strong component of computer resources. Thus, I am looking for any advice to maximize the probability of success! and not miss important aspects that I am not aware of.
Thank you very much!