Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Yevaud
    Junior Member
    • Jul 2011
    • 3

    How to collapse sequences excluding sequencing artifacts? (454)

    Hi all,
    I'm begginer in the next generation sequencing and recently I got my first data. I have some issues with them because there is just one study which had similar approach so far. Briefly I'm working with phylogeny of plants which includes polyploidy events. We have few markers and in some species (e.g. tetraploids) there may be few copies of one gene (marker). To obtain all possible copies we used 454 and got around 100 sequences per species. I tried to used BAPS (bayesian clustering program used for population studies mostly) to decide whether there is one copy or two of each gene. Although it is quite labourious basically it works and I couldn't figure out anything better so far. But then there is one thing which I would like to automate a little bit and make it more objective:
    Even when I know that there is one or two copies in my 100 reads there are still sequences which are different only by 1 or 2 bp, usually it's easily visible that it is artifact especially when it is present only once in the dataset. I wonder wheter there is any software that could collapse sequences for me and exlude this artifacts? Let's say that I would like to have sequences that are present with minimum 20% in the dataset. The truth that in most cases in the end I need just 2 sequences out of 100...
    Most programs with collapsing option just collapse everything which is identical to one haplotype and doesn't include any onformation about presence in the data or similarities. And they leave some sequences with more than one difference which makes things even more complicated.
    Is there any program that I could use for my analysis or do I have to do this by hand? If anybody has any idea I would be grateful!
    Thank's a lot in advance for any help!
  • parasitehunter
    Junior Member
    • Jun 2011
    • 3

    #2
    Hi all -
    I have basically the same question. Amplicon Illumina data that's been quality filtered (fastx), duplicate reads removed (fastx), aligned to the reference (~1.2kb) with bwa. Used ShoRAH to predict haplotypes - returned 13 haps with a frequency >1%. Most of these are frequency haplotypes and are due to a single SNP in a single read. Is there a way to collapse such sequences into their nearest relatives. Have done some searching, but no luck yet.
    Thanks!

    Comment

    • Kennels
      Senior Member
      • Feb 2011
      • 149

      #3
      Hi,
      You probably can use the program, cd-hit-est in the cd-hit suite: http://weizhong-lab.ucsd.edu/cd-hit/
      You can set a threshold identity (e.g. 90%, 95%), and it clusters all smaller lengthed sequences into a longer representative one when it falls above this threshold, much like generating a unigene set.

      Comment

      • parasitehunter
        Junior Member
        • Jun 2011
        • 3

        #4
        Kennels -
        Thanks for the idea - looks promising. I've been trying to get cd-hit-est to work (both locally and on their servers), but it keeps returning an error. Probably something I'm doing wrong. Or perhaps it's because all my predicted haplotypes are the same length. However, their cd_454 clusterer seems to work with my data. Hope that's legit to use ...

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


          Here are nine questions we think about, in roughly the order they matter, before...
          06-18-2026, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        30 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        96 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        117 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-04-2026, 08:59 AM
        0 responses
        109 views
        0 reactions
        Last Post SEQadmin2  
        Working...