No announcement yet.

TopHat SAM - Expressing Paired End Multi-reads

  • Filter
  • Time
  • Show
Clear All
new posts

  • TopHat SAM - Expressing Paired End Multi-reads


    I've been struggling to understand exactly how TopHat represents multi-reads in its output SAM file, especially in the context of paired end reads. I've done some background reading, but I haven't been able to clear things up - somefully someone can help.

    Let's say TopHat is considering a pair with ends A and B, and it finds two alignment combinations that make sense (i.e. where both ends map to opposite strands of the same chromosome, at an expected distance apart):



    How are these represented in the SAM? The way I understand it at the moment, there will be 4 lines in the SAM, one for each alignment. However, I can't see how the knowledge that A->P1 and B->P2 are associated with each other in an important way is represented.

    Put another way, I can't see how you could take the SAM record A->P1 and "find" the corresponding sibling B->P2 while recognizing that B->P4 is not the correct sibling.

    If this information is in fact lost, does this mean that SAM is not expressive enough to capture ambiguous alignments for paired end reads? And won't it mean that downstream processors, e.g. CuffLinks, will not have access to important information that was originally available?

    Apologies for the long-winded question!

    thanks for your time,

  • #2
    I believe the position of the mate is contained in field 8 and the distance between mates is contained in the field 9 in the SAM format so the SAM format should be able to contain enough information to correctly match P1 with P2 and P3 with P4. Briefly looking at my own SAM files produced by TopHat, it seems TopHat does not use field 9, but the mate position is reported in field 8 (if the mate is mapped).


    • #3
      Thanks Thomas.

      Perhaps there is still a certain amount of ambiguity in the some cases, e.g.?


      B->P2 (B is the same in both)

      In this scenario, presumably TopHat would output two B->P2 SAM records, each with a different field 8?

      If this is the case, field 8 of the A->P1 SAM record is now ambiguous (since it refers equally well to both B->P2s)?

      I wonder how CuffLinks etc. processes these kinds of scenarios - perhaps it doesn't need to deconvolute pairs?