We're still waiting for the reads, but we were planning on using the trinity package from the Broad: http://trinityrnaseq.sourceforge.net/
Basically, we figure if its good enough for the broad, its good enough for us. But we're green to this and trying to assemble a vertebrate transcriptome, so I'm certainly up for suggestions. Can anyone compare runtimes/processing requirements and the like, for ABySS and other programs? Broad suggests 2GB memory per million reads for example. We expect to have roughly 100M reads of paired end 100bp. Do we really need 200GB of memory? We have access to cluster that would make that possible, but it sounds like ABySS maybe runs on less, with that breast cancer paper saying they used 20 nodes with 2GBs each for 194M reads of 36bp?
Also, for assessing quality, I'd guess the best way would to simply compare to the distribution of a related but more fully annotated species. I don't expect that would be easy, however, requiring a large batch-blast-type analysis while understanding sequence divergence and gene duplications/deletions issues. Other than that, I just don't know how telling these kind of k-mer analysis things really are. So you got X# of contigs bigger than 100bps, or max of 10kb, who cares, exactly? Especially when you go looking through your RNA-seq data that was alligned to reference genome and see all kinds of areas coming up out side gene regions, even on well annotated species like mouse. How much of this is just genomic contamination or a kind of "phantom" or random transcription of areas that do nothing? Basically, I just want to know how well you covered the ~20K genes in a vertebrate genome. After you show me that, I can start carrying about micro-RNAs, or your k-mers.
Basically, we figure if its good enough for the broad, its good enough for us. But we're green to this and trying to assemble a vertebrate transcriptome, so I'm certainly up for suggestions. Can anyone compare runtimes/processing requirements and the like, for ABySS and other programs? Broad suggests 2GB memory per million reads for example. We expect to have roughly 100M reads of paired end 100bp. Do we really need 200GB of memory? We have access to cluster that would make that possible, but it sounds like ABySS maybe runs on less, with that breast cancer paper saying they used 20 nodes with 2GBs each for 194M reads of 36bp?
Also, for assessing quality, I'd guess the best way would to simply compare to the distribution of a related but more fully annotated species. I don't expect that would be easy, however, requiring a large batch-blast-type analysis while understanding sequence divergence and gene duplications/deletions issues. Other than that, I just don't know how telling these kind of k-mer analysis things really are. So you got X# of contigs bigger than 100bps, or max of 10kb, who cares, exactly? Especially when you go looking through your RNA-seq data that was alligned to reference genome and see all kinds of areas coming up out side gene regions, even on well annotated species like mouse. How much of this is just genomic contamination or a kind of "phantom" or random transcription of areas that do nothing? Basically, I just want to know how well you covered the ~20K genes in a vertebrate genome. After you show me that, I can start carrying about micro-RNAs, or your k-mers.
Comment