Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Diginorm Algorithm

    Hello,

    I am having trouble understanding a point made in the Diginorm paper:


    They say that Diginorm discards some terminal kmer and low-abundance isoform information but I am wondering why this is?

    According to the description of the algorithm, Diginorm estimates read coverage by using the median abundance of kmers for each read and discards the read if the median abundance is above some cutoff level. This should mean that any low abundance reads would be retained. If this is true, under what situations would it discard reads pertaining to terminal kmers and low-abundance isoforms?

    I suspect I am missing something here and it would be very helpful to get some outside views to get me out of this mind trap.

    Thank you!

  • #2
    Hi,

    You are correct in that diginorm will retain low abundance reads where abundance is estimated as the median abundance of all k-mers in the read. If you were to rank order all the k-mers in a read by its observed abundance in the dataset, the abundance would be the median value. Thus, the read would be discarded based on median abundance of the kmer abundance distribution of the read (not necessarily the terminal kmers). The k-length and read length affects how sensitive the median estimation is (as described in the paper) to i.e., sequencing errors typically found at the end of Illumina reads.

    Diginorm would discard reads pertaining to terminal kmers if its was, for example, a repetitive region in a read that was observed in high abundance in the dataset. In this case, the distribution of k-mer abundances of the entire read is likely even (due to repeats) and the abundance of the terminal k-mer abundance is more likely to be the median abundance of the read.

    Hope this helps!

    Comment


    • #3
      Let's see if I can give some intuition too...

      Suppose you have an undersampled region (like the terminal end of a contig, or a low-abundance splice variant) next to a bunch of highly sampled regions. Then if you had a completely correct read that crossed both the highly sampled and the low sampled region, but contained more of the highly sampled region, the median would be high, and the read would be discarded. So it really has to do with high sampling right next to low sampling -- basically what adina said about repeats.

      We know how to deal with this properly and have a prototype implementation, but it isn't really ready for use yet.

      Comment


      • #4
        Thank you Adina and Titus. That does make a lot of sense now. Can I ask if the new implementation makes use of the phred scores? And do you have an estimate of when it will be released?

        Comment


        • #5
          No short-term plans to make use of phred scores; no short-term plans on releasing the new approaches. The end-trimming problems are fairly easily solved by using a high C, like C=20 or C=50, so it's not a blocker for anyone; and for now we're trying to focus on getting the next version out. Plus pubs.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Exploring the Dynamics of the Tumor Microenvironment
            by seqadmin




            The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
            07-08-2024, 03:19 PM
          • seqadmin
            Exploring Human Diversity Through Large-Scale Omics
            by seqadmin


            In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
            06-25-2024, 06:43 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 07-19-2024, 07:20 AM
          0 responses
          25 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 07-16-2024, 05:49 AM
          0 responses
          41 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 07-15-2024, 06:53 AM
          0 responses
          46 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 07-10-2024, 07:30 AM
          0 responses
          42 views
          0 likes
          Last Post seqadmin  
          Working...
          X