Hi all,
Recently I've been looking at a draft genome assembly a collaborator has sent me. The first thing I notice is that the size of the assembly is really big 3-4X the size of the estimated genome size.
I filter out some contaminate scaffolds but this doesn't account for much of the size difference.
I then do a All-vs-All blast and find that there are a number (thousands) of smaller scaffolds (6-14kb) that have ~100% identity matches to larger scaffolds for >95% of their length. For near perfect matches 99-100% identity it seems reasonable to assume these are 'duplicate scaffolds' and I can set them aside.
I have about 7 mbases of such scaffolds with 100% identity.
If I low the stringency to 99% identity or better I have ~30 mbases.
I am wondering if anyone had insight into a good way to set/determine a cutoff for percent identity and percent length for these 'duplicate' scaffolds.
I don't want to be throwing out recent gene duplications - but at the same time I don't want to confound gene family expansion analysis with repeat scaffolds and alleles.
Anyone have any insight?
cheers,
t
Recently I've been looking at a draft genome assembly a collaborator has sent me. The first thing I notice is that the size of the assembly is really big 3-4X the size of the estimated genome size.
I filter out some contaminate scaffolds but this doesn't account for much of the size difference.
I then do a All-vs-All blast and find that there are a number (thousands) of smaller scaffolds (6-14kb) that have ~100% identity matches to larger scaffolds for >95% of their length. For near perfect matches 99-100% identity it seems reasonable to assume these are 'duplicate scaffolds' and I can set them aside.
I have about 7 mbases of such scaffolds with 100% identity.
If I low the stringency to 99% identity or better I have ~30 mbases.
I am wondering if anyone had insight into a good way to set/determine a cutoff for percent identity and percent length for these 'duplicate' scaffolds.
I don't want to be throwing out recent gene duplications - but at the same time I don't want to confound gene family expansion analysis with repeat scaffolds and alleles.
Anyone have any insight?
cheers,
t
Comment