Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • donquijotes
    Junior Member
    • Jul 2015
    • 7

    Help with UMI (unique molecular identifiers) data processing

    I've been browsing different papers and publications and trying to figure out what's the best way to analyze data with UMIs.
    So far I have used GATK to do some analysis couple of times, but other than that I was mostly playing with alternative splicing analysis so I'm rather new to this CNV calling with UMIs topic and area of research.
    What I would like to do is have the following design.

    adaptor-UMI-DNAlibraryINSERT-UMI-adaptor

    The UMIs will be 5 random bases on each side.

    I get the whole UMI thinking and analysis but what I haven't found yet is the software to do such analysis. I've seen few tools to mark/find UMIs and put them on the header of the fastq sequence but then what? How do you bin and get rid of the true PCR duplicates? Does Picard have a function for it? If I have to write my own code then I'm out of luck lol.

    I know Agilent supports UMIs with their Haloplex HS kits and their Surecall software that is mostly (from what I've heard) a nice GATK GUI.

    Any help and guidance would be much appreciated. Newbies have the right to learn too, right?

    Thank you in advance
  • nucacidhunter
    Jafar Jabbari
    • Jan 2013
    • 1250

    #2
    Product described in web page below uses Molecular Indexing and the sequences are given in product manual.


    They have described analysis step in a link in page below:

    Comment

    • charlescoldroom
      Junior Member
      • Apr 2012
      • 8

      #3
      I am also interested to know about how to handle UMIs and remove duplicated reads based on UMIs.

      I am using modified primers to have amplicon pools.

      Which tools are there to mark/find UMIs and put them on the header of the fastq sequence? How could I then process the reads?

      I have tried looking around, but I could not find any good step-by-step explanation, even papers just mention that they do the analysis but do not explain how.

      Thanks!

      Comment

      • danwiththeplan
        Member
        • Sep 2011
        • 72

        #4
        Molecular indexing

        Hi, you could try the script mentioned here:



        It's currently not working for me, but I'm in communication with the maintainer so I'll repost if I get everything working.

        Comment

        • luc
          Senior Member
          • Dec 2010
          • 469

          #5
          A very simple approach would be to do a general de-duplification of the reads with BBTools (I have not used it for thispurpose but it should be better than our in house script) which will likely require a considerable memory. Then you should trim the 5 random bases.

          Comment

          • charlescoldroom
            Junior Member
            • Apr 2012
            • 8

            #6
            Thanks guys, I will check out the suggestions!

            Comment

            • IanSudbery
              Junior Member
              • May 2011
              • 1

              #7
              I know this most is a few months old now, but you might like to try our UMI-tools package, which offers a range different algorithms for deduplicating UMI sequences.

              Comment

              • danwiththeplan
                Member
                • Sep 2011
                • 72

                #8
                Originally posted by IanSudbery View Post
                I know this most is a few months old now, but you might like to try our UMI-tools package, which offers a range different algorithms for deduplicating UMI sequences.

                https://github.com/CGATOxford/UMI-tools
                Hi, thanks for this contribution..

                I'm reading the code, and this is what it looks like to me, but am I correct in saying that this script would correctly deduplicate splice-aware mappings ? i.e. reads that jump across splice boundaries are handled correctly?

                Comment

                • sudders
                  Member
                  • Dec 2011
                  • 32

                  #9
                  Originally posted by danwiththeplan View Post
                  Hi, thanks for this contribution..

                  I'm reading the code, and this is what it looks like to me, but am I correct in saying that this script would correctly deduplicate splice-aware mappings ? i.e. reads that jump across splice boundaries are handled correctly?

                  You've probably worked this out already, but yes, it handles splice-aware mappings.

                  Comment

                  • medalofhonour
                    Member
                    • Jul 2011
                    • 18

                    #10
                    This group recently published a paper with a pipeline for analyzing UMI datasets. The software can be found here :

                    MAGERI - Assemble, align and call variants for targeted genome re-sequencing with unique molecular identifiers - mikessh/mageri

                    Comment

                    • cement_head
                      Senior Member
                      • Mar 2012
                      • 264

                      #11
                      If you are using CLC Genomics Workbench:

                      Comment

                      • Strandlife
                        strandlife
                        • May 2013
                        • 67

                        #12
                        You should try Strand NGS for UMI protocols.
                        Strand NGS is the only software to provide comprehensive and end-to-end support for multi Unique Molecular Identifier Protocols

                        Few features includes:

                        1. Protocol diversity. Strand NGS supports data analysis from UMI protocols
                        i. Qiagen GeneRead®
                        ii. Archer VariantPlex®
                        iii. Rubicon Thruplex®
                        iv. Bioo Scientific NextFlex®)
                        v. A robust interface to specify custom UMIs

                        2. End-to-end or point-to-point. Users can go from reads to variants, can start at aligned BAMs containing the BC tag, or start/end at any reasonable point in the alignment/analysis workflow.

                        3. Workflow diversity. Strand NGS supports UMI protocols in DNA-, RNA- and small RNA-Seq workflows

                        4. Somatic- and UMI-ready visualizations. The genome browser visualizes consensus read lists. Each read contains UMI-related metadata, such as family size, UMI and mate UMI. A filter allows the easy exclusion of wild-type reads. This is useful at high sequencing depths and low allele frequencies, typical of data from somatic/tumor samples.

                        You could get a 20-day free trial by registering here with your organization email id:
                        Strand NGS is Next generation sequencing data analysis tool. Supports DNA-Seq, RNA-Seq, ChIP-Seq, Methyl-Seq, MeDIP-Seq, small RNA-Seq, pathway analysis, downstream analysis

                        Comment

                        • chen@haplox.com
                          Member
                          • Aug 2015
                          • 16

                          #13
                          You can use fastp to preprocess UMI from fastq.
                          OpenGene(Libraries and tools for NGS data analysis),AfterQC(Fastq Filtering and QC)
                          FusionDirect.jl( Detect gene fusion), SeqMaker.jl(Next Generation Sequencing simulation)

                          Comment

                          Latest Articles

                          Collapse

                          • SEQadmin2
                            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                            by SEQadmin2


                            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                            ...
                            Yesterday, 10:05 AM
                          • SEQadmin2
                            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                            by SEQadmin2


                            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                            Introduction

                            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                            05-22-2026, 06:42 AM
                          • SEQadmin2
                            Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                            by SEQadmin2

                            Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                            Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                            05-06-2026, 09:04 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, Yesterday, 12:03 PM
                          0 responses
                          17 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, Yesterday, 11:40 AM
                          0 responses
                          13 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-28-2026, 11:40 AM
                          0 responses
                          29 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-26-2026, 10:12 AM
                          0 responses
                          31 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...