Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Retro
    Member
    • Apr 2011
    • 27

    clustering algorithm for single reads from transposon integrations

    We have Ion Torrent reads from retrovirus (transposon) integration sites in unsequenced genome and we need to cluster them by sequence identity. The first fifty bases of each read is always the transposon end and the rest is basically random piece of genomic DNA that flanks the insertion. We need to collapse or cluster the reads from each unique integration site together. Currently we use de novo assembly algorithms, but those perform poorely. We need to relax the stringency of alignment because of the sequencing errors, and then de novo assembly joins artificially clusters together. Our clusters should have length of only one read.

    Would anybody know of suitable algorithm to create these single read clusters?
  • SES
    Senior Member
    • Mar 2010
    • 275

    #2
    Originally posted by Retro View Post
    We have Ion Torrent reads from retrovirus (transposon) integration sites in unsequenced genome and we need to cluster them by sequence identity. The first fifty bases of each read is always the transposon end and the rest is basically random piece of genomic DNA that flanks the insertion. We need to collapse or cluster the reads from each unique integration site together. Currently we use de novo assembly algorithms, but those perform poorely. We need to relax the stringency of alignment because of the sequencing errors, and then de novo assembly joins artificially clusters together. Our clusters should have length of only one read.

    Would anybody know of suitable algorithm to create these single read clusters?
    As I was preparing a response it became less clear exactly what you are trying to achieve. When you say that you want to relax the stringency of alignment associated with assembly and use a clustering approach, that makes since. When you say that clusters should contain one read, that seems completely in conflict with the previous statement. Could you clarify your post?

    Comment

    • Retro
      Member
      • Apr 2011
      • 27

      #3
      Thanks for your response. The clusters should have a length of one read. They can contain for example 50 reads, but all reads start at position 1 ("left side" in aligned cluster). The reads in a cluster might differ in length based on the initial fragmentation.

      To make it more difficult, our reads come from a pool of animals, so in addition to sequencing errors we also see SNPs. That is why we cannot use assembly based on let's say 99% homology. The de novo algorithm then starts adding read to our clusters that extend the cluster in length, mosty based on random inverted repeats in the genomic tags.

      Comment

      • Retro
        Member
        • Apr 2011
        • 27

        #4
        OK, finally I found a great program USEARCH (http://www.drive5.com/usearch/usearch_docs.html) that does exactly that.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Pathogen Surveillance with Advanced Genomic Tools
          by seqadmin




          The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
          03-24-2025, 11:48 AM
        • seqadmin
          New Genomics Tools and Methods Shared at AGBT 2025
          by seqadmin


          This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

          The Headliner
          The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
          03-03-2025, 01:39 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 03-20-2025, 05:03 AM
        0 responses
        49 views
        0 reactions
        Last Post seqadmin  
        Started by seqadmin, 03-19-2025, 07:27 AM
        0 responses
        57 views
        0 reactions
        Last Post seqadmin  
        Started by seqadmin, 03-18-2025, 12:50 PM
        0 responses
        49 views
        0 reactions
        Last Post seqadmin  
        Started by seqadmin, 03-03-2025, 01:15 PM
        0 responses
        200 views
        0 reactions
        Last Post seqadmin  
        Working...