Hi all.
I'm trying to do a de novo transcriptome assembly using ABI SOLiD data. I'm trying to use Velvet/Oases at the moment, and I've found that PCR duplicates seem to be a serious problem during the postprocessing step when the double-encoded contigs are converted back into colour-space reads prior to the final assembly. This step takes at least 72 hours, which is an order of magnitude greater than the time required by the Velvet/Oases assemblers themselves. The postprocessing output file just keeps swelling in size because there are so many PCR duplicates.
So the question is: is there an efficient program out there I can use to remove duplicate reads from my .csfasta (and preferably the corresponding _QV.qual) file prior to assembly? I know there's an option to do this filtering on the SOLiD machine itself, but the person who did the sequencing didn't enable it.
Thanks.
I'm trying to do a de novo transcriptome assembly using ABI SOLiD data. I'm trying to use Velvet/Oases at the moment, and I've found that PCR duplicates seem to be a serious problem during the postprocessing step when the double-encoded contigs are converted back into colour-space reads prior to the final assembly. This step takes at least 72 hours, which is an order of magnitude greater than the time required by the Velvet/Oases assemblers themselves. The postprocessing output file just keeps swelling in size because there are so many PCR duplicates.
So the question is: is there an efficient program out there I can use to remove duplicate reads from my .csfasta (and preferably the corresponding _QV.qual) file prior to assembly? I know there's an option to do this filtering on the SOLiD machine itself, but the person who did the sequencing didn't enable it.
Thanks.
Comment