Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • redundancy in de novo transcriptome assembly

    Hi everyone,

    I'm currently looking at ways to reduce redundancy in de novo transcriptomes of some closely related species with the goal of searching for orthologs for phylogenetics afterwards.

    I'm interested in using CD-HIT-EST, but I'm unsure about was threshold of similarity is best to use. 95 and 90 aren't that different in terms of how many clusters are formed. Is 90 too stringent a threshold, potentially losing important genes? Since I'm looking for useful orthologs downstream, reducing redundancy as much as possible is important, but I don't want to sacrifice unique genes...

  • #2
    I have a neat tool for reducing redundancy in nucleotide space, which is often present in metagenome assemblies, depending on your methodology.

    dedupe.sh -Xmx31g in=assembly.fa out=unique.fa

    By default, if you run it like that, it will remove all but one copies of any duplicate scaffolds, and remove any scaffolds that are substrings of other scaffolds. But you can additionally specify other parameters, such as a minimum percent overlap (so that a transcript doesn't absorb another fully-contained transcript, for example) and a variable edit distance or percent identity.

    For example:
    dedupe.sh -Xmx31g in=assembly.fa out=unique.fa maxedits=20 minidentity=98 minoverlappercent=80

    ...will remove anything that is a substring of another string with identity of at least 98% (up to a maximum of 20 edits, which sets the bandwidth of the banded alignment), as long as the shorter one is at least 80% of the length of the longer one.

    It's incredibly fast with no edits allowed, and still very fast with edits allowed. The -Xmx flag, by the way, should be set to around 85% of your system's physical memory. The whole assembly is stored in memory, at about 1 byte per base plus a few hundred bytes per scaffold.

    Available here:
    Download BBMap for free. BBMap short read aligner, and other bioinformatic tools. This package includes BBMap, a short read aligner, as well as various other bioinformatic tools. It is written in pure Java, can run on any platform, and has no dependencies other than Java being installed (compiled for Java 6 and higher).
    Last edited by Brian Bushnell; 03-13-2014, 04:40 PM.

    Comment


    • #3
      Consider using EvidentialGene at http://arthropods.eugenes.org/EvidentialGene/

      The EvidentialGene_trassembly pipeline software does a good job of picking best non-redundant mRNA loci and alternates from a large collection of partly redundant mRNA assemblies. It uses coding sequence metrics, and so avoids the problem of selecting errors by picking longest transcripts. It works well with large collections (10 millions) of transcripts produced by several assemblers on the same data, and gives you best results that way (each assembler gets only some complete genes).

      Using CD-HIT-EST as you consider is not the right way, as that will select for errors (gene joins and misassemblies) and lose some of your valuable best ortholog genes.

      Comment


      • #4
        Hi,
        To add to what dongilbert suggested, I would say that care has to be taken when performing post-assembly clustering or re-assembly (across multiple kmers or assemblies from different assemblies). Haznedaroglu et al. BMC Bioinformatics 2012, 13:170 suggested an optimization procedure when employing such post-assembly processing and as you have noted there is the risk of losing potentially unique transfrags. Thus there is generally no concensus as to what identity threshold to implement as this will also vary with transfrag diversity in your assembly.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          New Genomics Tools and Methods Shared at AGBT 2025
          by seqadmin


          This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

          The Headliner
          The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
          03-03-2025, 01:39 PM
        • seqadmin
          Investigating the Gut Microbiome Through Diet and Spatial Biology
          by seqadmin




          The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
          02-24-2025, 06:31 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 12:50 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-03-2025, 01:15 PM
        0 responses
        181 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 02-28-2025, 12:58 PM
        0 responses
        275 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 02-24-2025, 02:48 PM
        0 responses
        663 views
        0 likes
        Last Post seqadmin  
        Working...
        X