Hi,
I am trying to get rid of "redundant" sequences from a trinity assembly. I used dedupe.sh to get rid of duplicates in the illumina source files and got a very good result.
After assembling with trinity I get 85497 output sequences. If I cluster these with cd-hit at 95% I get 69413 clusters (mostly with >99% identity).
How can I extract a single sequence from each cluster (the longest I assume)? I'm not sure how to go from the cd-hit clstr file to getting the largest sequence of each cluster out of my assembled fasta file...
I tried to use dedupe on the assembled file but it only removed 2 sequences (which I assume were identical). What flag would I set to remove duplicates at the 99% identity level?
Thank you in advance.
I am trying to get rid of "redundant" sequences from a trinity assembly. I used dedupe.sh to get rid of duplicates in the illumina source files and got a very good result.
After assembling with trinity I get 85497 output sequences. If I cluster these with cd-hit at 95% I get 69413 clusters (mostly with >99% identity).
How can I extract a single sequence from each cluster (the longest I assume)? I'm not sure how to go from the cd-hit clstr file to getting the largest sequence of each cluster out of my assembled fasta file...
I tried to use dedupe on the assembled file but it only removed 2 sequences (which I assume were identical). What flag would I set to remove duplicates at the 99% identity level?
Thank you in advance.
Comment