Unconfigured Ad

**Brian Bushnell** · 03-13-2014, 04:22 PM

I have a neat tool for reducing redundancy in nucleotide space, which is often present in metagenome assemblies, depending on your methodology.

dedupe.sh -Xmx31g in=assembly.fa out=unique.fa

By default, if you run it like that, it will remove all but one copies of any duplicate scaffolds, and remove any scaffolds that are substrings of other scaffolds. But you can additionally specify other parameters, such as a minimum percent overlap (so that a transcript doesn't absorb another fully-contained transcript, for example) and a variable edit distance or percent identity.

For example:
dedupe.sh -Xmx31g in=assembly.fa out=unique.fa maxedits=20 minidentity=98 minoverlappercent=80

...will remove anything that is a substring of another string with identity of at least 98% (up to a maximum of 20 edits, which sets the bandwidth of the banded alignment), as long as the shorter one is at least 80% of the length of the longer one.

It's incredibly fast with no edits allowed, and still very fast with edits allowed. The -Xmx flag, by the way, should be set to around 85% of your system's physical memory. The whole assembly is stored in memory, at about 1 byte per base plus a few hundred bytes per scaffold.

Available here:

BBMap

http://sourceforge.net/projects/bbmap/

Download BBMap for free. BBMap short read aligner, and other bioinformatic tools. This package includes BBMap, a short read aligner, as well as various other bioinformatic tools. It is written in pure Java, can run on any platform, and has no dependencies other than Java being installed (compiled for Java 6 and higher).

**dongilbert** · 03-15-2014, 09:37 AM

Consider using EvidentialGene at http://arthropods.eugenes.org/EvidentialGene/

The EvidentialGene_trassembly pipeline software does a good job of picking best non-redundant mRNA loci and alternates from a large collection of partly redundant mRNA assemblies. It uses coding sequence metrics, and so avoids the problem of selecting errors by picking longest transcripts. It works well with large collections (10 millions) of transcripts produced by several assemblers on the same data, and gives you best results that way (each assembler gets only some complete genes).

Using CD-HIT-EST as you consider is not the right way, as that will select for errors (gene joins and misassemblies) and lose some of your valuable best ortholog genes.

**Apexy** · 03-16-2014, 11:06 PM

Hi,
To add to what dongilbert suggested, I would say that care has to be taken when performing post-assembly clustering or re-assembly (across multiple kmers or assemblies from different assemblies). Haznedaroglu et al. BMC Bioinformatics 2012, 13:170 suggested an optimization procedure when employing such post-assembly processing and as you have noted there is the risk of losing potentially unique transfrags. Thus there is generally no concensus as to what identity threshold to implement as this will also vary with transfrag diversity in your assembly.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, Yesterday, 11:08 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 53 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

redundancy in de novo transcriptome assembly

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News