Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strange secondary alignment

    Hi all,

    I did a RNAseq alignment using tohat (provided with a GFF annotation file) and got quite a lot of reads having multiple alignments. I checked the alignment output and found a strange result for a randomly picked read pair. Following is from tophat accepted_hits.sam file

    SRR.372451 97 chr11 18416183 3 1M2182N49M = 18421086 6241 GATTCCTTTTGGTTCCAAGTCCAATATGGCAACTCTAAAGGATCAGCTGA HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:+ NH:i:2 CC:Z:= CP:i:18418157 HI:i:0
    SRR.372451 353 chr11 18418157 3 1M208N49M = 18421086 4267 GATTCCTTTTGGTTCCAAGTCCAATATGGCAACTCTAAAGGATCAGCTGA HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:+ NH:i:2 HI:i:1
    SRR.372451 145 chr11 18421086 3 10M1288N40M = 18416183 -6241 TCTGGCAAAGACTATAATGTAACTGCAAACTCCAAGCTGGTCATTATCAC EHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:+ NH:i:2 CC:Z:= CP:i:18421086 HI:i:0
    SRR.372451 401 chr11 18421086 3 10M1288N40M = 18418157 -4267 TCTGGCAAAGACTATAATGTAACTGCAAACTCCAAGCTGGTCATTATCAC EHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:+ NH:i:2 HI:i:1

    Tophat reported two alignments for the read pair, both mapped to chromosome 11 at gene transcript NM_005566. For the two alignments, read 2 is the same. However, alignment for read 1 is a bit strange.

    The first alignment for read 1 is
    18416183 3 1M2182N49M
    where, according to the gff file used here, the first base mapped to coordinate 18416183, which happens to be the the last base of the first exon of transcript NM_005566; after 2182 skipped region (intron), the remaining 49 bases mapped to the second exon at coordinate 18418366 (the first base of exon 2) to 18418414.

    The secondary alignment for read 1 is
    18418157 3 1M208N49M
    where the first base mapped to coordinate 18418157, supposed to be inside intron region; the remaining 49 bases mapped to exon 2 as the first alignment.

    The first alignment is perfect. But it seems to me that the secondary alignment makes no sense.

    I don't know how to explain this result. Does anybody encounter similar situation?

    ps, I used tophat v2.0.6.

    Thanks,
    Alex
    Last edited by webappl; 01-31-2013, 10:29 AM.

  • #2
    18418157 is the last base of an exon too, at least on the current ensemble annotation. I'd suggest having a look again at your GTF file. I wouldn't be surprised to see this sort of situation when there's only 1 or 2 bp hanging over the edge of an exon on a gene with numerous exons and splice forms (one could argue that in those cases it's better to just soft-clip the base and output a unique alignment, but that's probably application dependent).

    Comment


    • #3
      Originally posted by dpryan View Post
      18418157 is the last base of an exon too, at least on the current ensemble annotation. I'd suggest having a look again at your GTF file. I wouldn't be surprised to see this sort of situation when there's only 1 or 2 bp hanging over the edge of an exon on a gene with numerous exons and splice forms (one could argue that in those cases it's better to just soft-clip the base and output a unique alignment, but that's probably application dependent).
      Thanks, dpryan. You are right.
      I realized that there was a mistake in my script for extracting transcript information from the gff annotation and as a consequence the alternative splicing forms were missed.

      It is annoying that alignment ambiguity due to 1 or 2 bases hanging over the edge of an exon complicates the estimation of transcription abundance.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Recent Advances in Sequencing Analysis Tools
        by seqadmin


        The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
        05-06-2024, 07:48 AM
      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 05-14-2024, 07:03 AM
      0 responses
      19 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-10-2024, 06:35 AM
      0 responses
      44 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-09-2024, 02:46 PM
      0 responses
      54 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-07-2024, 06:57 AM
      0 responses
      42 views
      0 likes
      Last Post seqadmin  
      Working...
      X