Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • DrYak
    Member
    • Sep 2013
    • 13

    Dedupe on assembled RNA-Seq?

    Hi,

    I am trying to get rid of "redundant" sequences from a trinity assembly. I used dedupe.sh to get rid of duplicates in the illumina source files and got a very good result.

    After assembling with trinity I get 85497 output sequences. If I cluster these with cd-hit at 95% I get 69413 clusters (mostly with >99% identity).

    How can I extract a single sequence from each cluster (the longest I assume)? I'm not sure how to go from the cd-hit clstr file to getting the largest sequence of each cluster out of my assembled fasta file...

    I tried to use dedupe on the assembled file but it only removed 2 sequences (which I assume were identical). What flag would I set to remove duplicates at the 99% identity level?

    Thank you in advance.
  • DrYak
    Member
    • Sep 2013
    • 13

    #2
    Hi,

    Well, I found (to my chagrin) that cd-hit has an aux tools package containing the cd-hit-dup tool.

    I do not, however, get the same results using cd-hit-est and cd-hit-dup.

    If I use cd-hit with the following parameters:

    cd-hit-est -i in.fasta -o out -c 0.95 -n 10 -d 0 - T 20

    I get 85497 finished 69413 clusters

    i.e. 69413 clusters from 85497 starting sequences.

    If I use cd-hit-dup with the following parameters:

    cd-hit-dup -i in.fasta -o out-nodupes.fasta -m false -e 0.05 -f true

    Which as far as I know should have the same similarity cut-off (95%) and remove smaller sequences (-m false) and chimeras, I get:

    Number of reads: 85497
    Number of clusters found: 82927
    Number of chimeric clusters found: 6

    i.e 82921 clusters from 85497 starting sequences.

    Can someone suggest an explanation for the such a huge difference?

    Thanks in advance.

    Comment

    • mastal
      Senior Member
      • Mar 2009
      • 666

      #3
      I think what you want is software that calls a consensus sequence from each cluster, rather than dedupe.

      Comment

      Latest Articles

      Collapse

      • SEQadmin2
        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
        by SEQadmin2


        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


        Here are nine questions we think about, in roughly the order they matter, before...
        Yesterday, 07:11 AM
      • SEQadmin2
        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
        by SEQadmin2


        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
        ...
        06-02-2026, 10:05 AM
      • SEQadmin2
        Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
        by SEQadmin2


        With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


        Introduction

        Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
        05-22-2026, 06:42 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by SEQadmin2, 06-17-2026, 06:09 AM
      0 responses
      20 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-09-2026, 11:58 AM
      0 responses
      38 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-05-2026, 10:09 AM
      0 responses
      45 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-04-2026, 08:59 AM
      0 responses
      49 views
      0 reactions
      Last Post SEQadmin2  
      Working...