Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identical fragments, different chromosomes, picard MarkDuplicates

    Hi All,

    I have RNA seq data from ~ 20 samples, 2x72, Solexa, about 20-25 million fragments per sample.

    When trying to run picard's MarkDuplicates I got this error back:

    Exception in thread "main" java.lang.RuntimeException: SAM validation error: ERROR: Record 2278214, Read name WICMT-SOLEXA_100409_61E8NAAXX:2:17:3572:14759#0, Mate Alignment start (195002931) must be <= reference sequence length (181748087) on reference chr2

    If looking at the read-pair that caused this error:
    grep WICMT-SOLEXA_100409_61E8NAAXX:2:17:3572:14759#0 accepted_hits.sam
    WICMT-SOLEXA_100409_61E8NAAXX:2:17:3572:14759#0 113 chr1 195002931 255 72M = 3420320 0 AGAAAAAAATCCACCACCACCACCACCACCAAAAGGAACTACCCCACTGTGATGTAGGGCTGTAGAGGGGGG ###?BBB??'>=/=>2>A/AA7BB9BBBDBEGFEDEDBEDBEEFFCFDEEEEFFEDGGFGGGGGGGGGGGGG NM:i:1
    WICMT-SOLEXA_100409_61E8NAAXX:2:17:3572:14759#0 177 chr2 3420320 255 72M = 195002931 0 TTTTTTTTTTCTTTGAGACAGGGTTTCTCTGTGTAGCCTTGGCTGTCCTGGAACTCACTCTGTAGACCAAGC GDEEEEDEEDGFEFGGGEGGGGGEGFGGGGGGGGGG?GGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGG NM:i:2

    The problem is that I have fragments where the different ends map to different chromosomes. In this case this causes an error because the first end maps on pos 195002931 (on chromosome 1), and chromosome 2, which the second end maps to, is not that long.

    Is there a way to inform picard to swallow these alignments? Would be good if the SAM format would include the chr mapping for the pair as well. Picard does not disregard other non-proper pairs.

    Or should I just not use fragments where the different ends map to diff chromosomes? How do you usually treat this?

    Thank you,
    Boel

  • #2
    Hi Boel,

    I stumbled over this as well. I think Picard can handle these correctly, but I think there is a bug in TopHat that causes these to be reported incorrectly.

    What I have noticed is that TopHat always uses the '=' symbol for the 2nd mate's reference ID. So that even if the mate maps to a different chromosome, it is still marked as the same chromosome in TopHat. A lot of these potentially could unnoticed by Picard as long as the position of the mate is less than the chromosome size. However, Picard complains when it (inevitably) encounters a 2nd mate that violates chromosome size boundaries.

    Am I correct in observing this?

    Currently I just throw these reads away. Is there a better way to handle it? I suppose it would be possible to sort by read name and repair the mate chromosome for these alignments.

    Overall, it would be great to see better SAM compatibility in TopHat.

    Comment


    • #3
      Hi choy, and thanks for your reply. Your observation seems to be true ("=' given by TopHat despite mapping to diff chromosomes). I'll try to correct these errors in my files. Would definitely be great to have TopHat give the right SAM expressions.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Quality Control Essentials for Next-Generation Sequencing Workflows
        by seqadmin




        Like all molecular biology applications, next-generation sequencing (NGS) workflows require diligent quality control (QC) measures to ensure accurate and reproducible results. Proper QC begins at nucleic acid extraction and continues all the way through to data analysis. This article outlines the key QC steps in an NGS workflow, along with the commonly used tools and techniques.

        Nucleic Acid Quality Control
        Preparing for NGS starts with isolating the...
        Yesterday, 01:58 PM
      • seqadmin
        An Introduction to the Technologies Transforming Precision Medicine
        by seqadmin


        In recent years, precision medicine has become a major focus for researchers and healthcare professionals. This approach offers personalized treatment and wellness plans by utilizing insights from each person's unique biology and lifestyle to deliver more effective care. Its advancement relies on innovative technologies that enable a deeper understanding of individual variability. In a joint documentary with our colleagues at Biocompare, we examined the foundational principles of precision...
        01-27-2025, 07:46 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 02-07-2025, 09:30 AM
      0 responses
      48 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 02-05-2025, 10:34 AM
      0 responses
      66 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 02-03-2025, 09:07 AM
      0 responses
      54 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 01-31-2025, 08:31 AM
      0 responses
      44 views
      0 likes
      Last Post seqadmin  
      Working...
      X