Greetings,
I have a large list of fasta sequences (not paired end) from which I want to isolate the longest version of each "sequence" (obviously the longest one has a different sequence). For example:
input.fa:
>Seq_1
AAACCCGGGTTT
Seq_2
CCCGGG
output_keep.fa:
>Seq_1
AAACCCGGGTTT
output_ditch.fa:
>Seq_2
CCCGGG
I tried a number of approaches using a mock dataset containing mismatches, exact copies, and known truncations from either end of the long form - "test_dups.fa":
(1) dedupe.sh in SAMtools
dedupe.sh test_dups.fa test_dups_dedupe.fa outd=test_dups_removed.fa
this didn't work: its output didn't make sense, and an exact duplicate pair was removed completely (didn't leave behind one of them, which is bad news).
(2) with the ac=f option
dedupe.sh test_dups.fa test_dups_dedupe.fa outd=test_dups_removed.fa ac=f
that only removed the exact copy pair, again not leaving one copy in file.
(3) clustering by overlap
dedupe.sh test_dups.fa test_dups_dedupe.fa outd=test_dups_removed.fa ac=f mo=50 c pc csf=stats.txt outbest=test_best.fa mcs=1
didn't work (retained were not longest examples).
Then I moved on to VSEARCH, to sort by size, then form clusters and output the "seed":
(4) vsearch clustering
--cluster_fast dups.fa --centroids test_unique_1.0.fa --id 1.0
oddly, this kept a truncation of the longest sequence, but was close...
(5) using the iddef = 0, to ignore end gaps
vsearch --cluster_fast dups.fa --centroids test_unique_1.0.fa --id 1.0 --iddef 0
this did not ignore the end gaps penalties as expected.
(6) Align two seqs and see what the perceived idenity is
vsearch --allpairs_global dups_9_10.fa --acceptall --alnout test_aln.fa --iddef 0
the resulting alignment shows 100% identity, as expected with the iddef 0 option.
Why are those sequences not in the same cluster then?
Searching for help on "remove duplicates" is a disaster here, so I hope I can get some help. One consideration is to merge the sequences with 100% identity for the "overlap", but those tools merge from one end of the read, as is used to combine paired end reads. The other consideration is spending a day writing a script. I reasoned I can not be the first person to do this.
I have a large list of fasta sequences (not paired end) from which I want to isolate the longest version of each "sequence" (obviously the longest one has a different sequence). For example:
input.fa:
>Seq_1
AAACCCGGGTTT
Seq_2
CCCGGG
output_keep.fa:
>Seq_1
AAACCCGGGTTT
output_ditch.fa:
>Seq_2
CCCGGG
I tried a number of approaches using a mock dataset containing mismatches, exact copies, and known truncations from either end of the long form - "test_dups.fa":
(1) dedupe.sh in SAMtools
dedupe.sh test_dups.fa test_dups_dedupe.fa outd=test_dups_removed.fa
this didn't work: its output didn't make sense, and an exact duplicate pair was removed completely (didn't leave behind one of them, which is bad news).
(2) with the ac=f option
dedupe.sh test_dups.fa test_dups_dedupe.fa outd=test_dups_removed.fa ac=f
that only removed the exact copy pair, again not leaving one copy in file.
(3) clustering by overlap
dedupe.sh test_dups.fa test_dups_dedupe.fa outd=test_dups_removed.fa ac=f mo=50 c pc csf=stats.txt outbest=test_best.fa mcs=1
didn't work (retained were not longest examples).
Then I moved on to VSEARCH, to sort by size, then form clusters and output the "seed":
(4) vsearch clustering
--cluster_fast dups.fa --centroids test_unique_1.0.fa --id 1.0
oddly, this kept a truncation of the longest sequence, but was close...
(5) using the iddef = 0, to ignore end gaps
vsearch --cluster_fast dups.fa --centroids test_unique_1.0.fa --id 1.0 --iddef 0
this did not ignore the end gaps penalties as expected.
(6) Align two seqs and see what the perceived idenity is
vsearch --allpairs_global dups_9_10.fa --acceptall --alnout test_aln.fa --iddef 0
the resulting alignment shows 100% identity, as expected with the iddef 0 option.
Why are those sequences not in the same cluster then?
Searching for help on "remove duplicates" is a disaster here, so I hope I can get some help. One consideration is to merge the sequences with 100% identity for the "overlap", but those tools merge from one end of the read, as is used to combine paired end reads. The other consideration is spending a day writing a script. I reasoned I can not be the first person to do this.
Comment