If you want to deduplicate based on mapping, I recommend converting to bam, sorting, and using some other program (Picard, for example). Or you can use dedupebymapping.sh which DOES accept sam files. It's much faster than Picard but does not fully take advantage of sorting, or write temp files, so it uses a lot of memory (no more than Dedupe, though).
Dedupe is for deduplicating by sequence. So, whether you deduplicate the fastq then map, or map and deduplicate the sam, depends on your goal - the results will differ slightly. In practice, it won't usually matter much for paired reads, but single-ended reads are much more impacted by deduplication methodology.
What kind of data are you using, and what's your experiment?
Dedupe is for deduplicating by sequence. So, whether you deduplicate the fastq then map, or map and deduplicate the sam, depends on your goal - the results will differ slightly. In practice, it won't usually matter much for paired reads, but single-ended reads are much more impacted by deduplication methodology.
What kind of data are you using, and what's your experiment?
Comment