Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identical fragments, different chromosomes, picard MarkDuplicates

    Hi All,

    I have RNA seq data from ~ 20 samples, 2x72, Solexa, about 20-25 million fragments per sample.

    When trying to run picard's MarkDuplicates I got this error back:

    Exception in thread "main" java.lang.RuntimeException: SAM validation error: ERROR: Record 2278214, Read name WICMT-SOLEXA_100409_61E8NAAXX:2:17:3572:14759#0, Mate Alignment start (195002931) must be <= reference sequence length (181748087) on reference chr2

    If looking at the read-pair that caused this error:
    grep WICMT-SOLEXA_100409_61E8NAAXX:2:17:3572:14759#0 accepted_hits.sam
    WICMT-SOLEXA_100409_61E8NAAXX:2:17:3572:14759#0 113 chr1 195002931 255 72M = 3420320 0 AGAAAAAAATCCACCACCACCACCACCACCAAAAGGAACTACCCCACTGTGATGTAGGGCTGTAGAGGGGGG ###?BBB??'>=/=>2>A/AA7BB9BBBDBEGFEDEDBEDBEEFFCFDEEEEFFEDGGFGGGGGGGGGGGGG NM:i:1
    WICMT-SOLEXA_100409_61E8NAAXX:2:17:3572:14759#0 177 chr2 3420320 255 72M = 195002931 0 TTTTTTTTTTCTTTGAGACAGGGTTTCTCTGTGTAGCCTTGGCTGTCCTGGAACTCACTCTGTAGACCAAGC GDEEEEDEEDGFEFGGGEGGGGGEGFGGGGGGGGGG?GGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGG NM:i:2

    The problem is that I have fragments where the different ends map to different chromosomes. In this case this causes an error because the first end maps on pos 195002931 (on chromosome 1), and chromosome 2, which the second end maps to, is not that long.

    Is there a way to inform picard to swallow these alignments? Would be good if the SAM format would include the chr mapping for the pair as well. Picard does not disregard other non-proper pairs.

    Or should I just not use fragments where the different ends map to diff chromosomes? How do you usually treat this?

    Thank you,
    Boel

  • #2
    Hi Boel,

    I stumbled over this as well. I think Picard can handle these correctly, but I think there is a bug in TopHat that causes these to be reported incorrectly.

    What I have noticed is that TopHat always uses the '=' symbol for the 2nd mate's reference ID. So that even if the mate maps to a different chromosome, it is still marked as the same chromosome in TopHat. A lot of these potentially could unnoticed by Picard as long as the position of the mate is less than the chromosome size. However, Picard complains when it (inevitably) encounters a 2nd mate that violates chromosome size boundaries.

    Am I correct in observing this?

    Currently I just throw these reads away. Is there a better way to handle it? I suppose it would be possible to sort by read name and repair the mate chromosome for these alignments.

    Overall, it would be great to see better SAM compatibility in TopHat.

    Comment


    • #3
      Hi choy, and thanks for your reply. Your observation seems to be true ("=' given by TopHat despite mapping to diff chromosomes). I'll try to correct these errors in my files. Would definitely be great to have TopHat give the right SAM expressions.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      59 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      57 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      51 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      56 views
      0 likes
      Last Post seqadmin  
      Working...
      X