It's been a while since I got my hands on shotgun Illumina metagenomic data. I've found that it's quite important to dereplicate before doing any downstream analysis to avoid problems with assembly and inaccurate quantification. The last time around I used usearch --derep_fulllength on a subset of the data to filter out artificial replicate reads, but it is choking on the larger datasets I have now. My approach was to identify a high quality subsection of R1 and dereplicate that, then filter out reads from the raw data. The reason for this is that often there can be a single cycle with high error, and there is always higher error at the end of the read, so some actual replicates could be missed if the whole read is used.
Can anyone recommend a good current tool for dereplicating Illumina reads? My datasets are about 20-30 million reads each. I came across Fulcrum with google search--any experiences with that? (paper)
Can anyone recommend a good current tool for dereplicating Illumina reads? My datasets are about 20-30 million reads each. I came across Fulcrum with google search--any experiences with that? (paper)
Comment