Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Bio.X2Y
    Member
    • Apr 2010
    • 46

    TopHat SAM - Expressing Paired End Multi-reads

    Hi,

    I've been struggling to understand exactly how TopHat represents multi-reads in its output SAM file, especially in the context of paired end reads. I've done some background reading, but I haven't been able to clear things up - somefully someone can help.

    Let's say TopHat is considering a pair with ends A and B, and it finds two alignment combinations that make sense (i.e. where both ends map to opposite strands of the same chromosome, at an expected distance apart):

    A->P1
    B->P2

    A->P3
    B->P4

    How are these represented in the SAM? The way I understand it at the moment, there will be 4 lines in the SAM, one for each alignment. However, I can't see how the knowledge that A->P1 and B->P2 are associated with each other in an important way is represented.

    Put another way, I can't see how you could take the SAM record A->P1 and "find" the corresponding sibling B->P2 while recognizing that B->P4 is not the correct sibling.

    If this information is in fact lost, does this mean that SAM is not expressive enough to capture ambiguous alignments for paired end reads? And won't it mean that downstream processors, e.g. CuffLinks, will not have access to important information that was originally available?

    Apologies for the long-winded question!

    thanks for your time,
  • Thomas Doktor
    Senior Member
    • Apr 2009
    • 105

    #2
    I believe the position of the mate is contained in field 8 and the distance between mates is contained in the field 9 in the SAM format so the SAM format should be able to contain enough information to correctly match P1 with P2 and P3 with P4. Briefly looking at my own SAM files produced by TopHat, it seems TopHat does not use field 9, but the mate position is reported in field 8 (if the mate is mapped).

    Comment

    • Bio.X2Y
      Member
      • Apr 2010
      • 46

      #3
      Thanks Thomas.

      Perhaps there is still a certain amount of ambiguity in the some cases, e.g.?

      A->P1
      B->P2

      A->P3
      B->P2 (B is the same in both)

      In this scenario, presumably TopHat would output two B->P2 SAM records, each with a different field 8?

      If this is the case, field 8 of the A->P1 SAM record is now ambiguous (since it refers equally well to both B->P2s)?

      I wonder how CuffLinks etc. processes these kinds of scenarios - perhaps it doesn't need to deconvolute pairs?

      Thanks

      Comment

      Latest Articles

      Collapse

      • SEQadmin2
        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
        by SEQadmin2


        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
        ...
        06-02-2026, 10:05 AM
      • SEQadmin2
        Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
        by SEQadmin2


        With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


        Introduction

        Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
        05-22-2026, 06:42 AM
      • SEQadmin2
        Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
        by SEQadmin2

        Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


        Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
        05-06-2026, 09:04 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by SEQadmin2, Today, 08:59 AM
      0 responses
      9 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-02-2026, 12:03 PM
      0 responses
      21 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-02-2026, 11:40 AM
      0 responses
      17 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 05-28-2026, 11:40 AM
      0 responses
      30 views
      0 reactions
      Last Post SEQadmin2  
      Working...