Hi friends,
I'm trying to align 50bp paired-end Illumina reads to the mm10 genome/transcriptome with tophat 2.0.8. I've done a few runs on our local desktop Mac Pros to get an idea of how the software is working, and now I'm starting to migrate this onto our local high performance computing cluster in the hopes of running the alignments faster (or at least running them in parallel rather than sequentially) as these are large data sets. I'm wondering if anyone has advice on how much computational resources I should request per data set to get them to run quickly, without hogging the system, given that not every step in the tophat pipeline is multi-threaded?
From my pilot alignments and from reading these forums, I understand that the segment_juncs step is single-threaded and time-consuming -- will this step run more quickly if more memory is available to it? (i.e., Is it "fair" to request more cores from the system just to have their memory? Does the speed of this step scale at all? In my pilots, the run time has been quite variable, and I haven't been able to correlate it with anything obvious.)
Empirically I've also seen that setting the -p value to less than the actual number of cores available is also necessary to avoid problems when the tophat script shell tries to invoke samtools or other processes while running the alignment and output steps, but it is not totally clear to me what a good value for this should be to avoid problems. Is there any good rule of thumb here, like p="number of cores available" - "some particular constant that I don't know"?
Thanks in advance for any advice you can provide. The computational side of this is pretty intimidating to a bench biologist, and I've tried to RTFM as best I can understand it, I swear I have!
I'm trying to align 50bp paired-end Illumina reads to the mm10 genome/transcriptome with tophat 2.0.8. I've done a few runs on our local desktop Mac Pros to get an idea of how the software is working, and now I'm starting to migrate this onto our local high performance computing cluster in the hopes of running the alignments faster (or at least running them in parallel rather than sequentially) as these are large data sets. I'm wondering if anyone has advice on how much computational resources I should request per data set to get them to run quickly, without hogging the system, given that not every step in the tophat pipeline is multi-threaded?
From my pilot alignments and from reading these forums, I understand that the segment_juncs step is single-threaded and time-consuming -- will this step run more quickly if more memory is available to it? (i.e., Is it "fair" to request more cores from the system just to have their memory? Does the speed of this step scale at all? In my pilots, the run time has been quite variable, and I haven't been able to correlate it with anything obvious.)
Empirically I've also seen that setting the -p value to less than the actual number of cores available is also necessary to avoid problems when the tophat script shell tries to invoke samtools or other processes while running the alignment and output steps, but it is not totally clear to me what a good value for this should be to avoid problems. Is there any good rule of thumb here, like p="number of cores available" - "some particular constant that I don't know"?
Thanks in advance for any advice you can provide. The computational side of this is pretty intimidating to a bench biologist, and I've tried to RTFM as best I can understand it, I swear I have!