Header Leaderboard Ad

Collapse

TopHat: Aligning SOLiD strand specific RNAseq

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TopHat: Aligning SOLiD strand specific RNAseq

    Hey all,

    I'm interested in realigning some single-end strand specific RNAseq data I got from SOLiD, using TopHat.

    However, I've been completely unable to map the reads to only one strand at a time. In the current manual (version 1.2.0) I can not find any flag for forcing a no-reverse complement search. All I found is the confusing --library-type fr-firststrand/fr-secondstrand/fr-unstranded flag, and all of them align reads to both strands anyway.

    I know that TopHat is built on top of Bowtie, so I tried aligning the reads by directly using Bowtie, and then, by using the --norc and --nofw flags I am actually able to map to one strand at a time.

    Has anyone found a way to make it work properly with TopHat?
    Thanks!

  • #2
    No one?

    I have these imput files:
    forward.csfasta, forward.qual
    reverse.csfasta, reverse.qual

    I proceed to align only the forward files, using the typical TopHat flags,
    Code:
    tophat --color --quals -p 7 -g 1 --coverage-search
    plus any of the three options that seem to be strand-related,
    Code:
    --library-type fr-unstranded
    --library-type fr-firststrand
    --library-type fr-secondstrand
    and I look at the reads that map on a gene on the reverse strand. With any of the three options, there are still some reads mapping to that gene.

    I was expecting zero hits, since I'm aligning forward reads to a reverse gene. However none of the methods gives me that. The other option that occured to me would be that if the protocol is not 100% reliable, the 'forward' labeled reads would contain some real 'reverse' reads. If this were the case, then maybe TopHat was still mapping them somehow.

    Looking a little bit deeper into that, I searched the sam file for the strand information of the aligned reads that fell in (a region of) that gene, to discover that

    - For the fr-unstranded flag: there are 3580 hits, only 1 with the XS:A:- (correct) strand and 15 with the XS:A:+ (wrong) strand (the rest simply do not have strand information)
    - For the fr-firststrand: there are 3579 hits, 3564 with the XS:A:- (correct) strand and 15 with the XS:A:+ (wrong) strand
    - For the fr-secondstrand: there are 3580 hits, 2 with the XS:A:- (correct) strand and 3564 with the XS:A:+ (wrong) strand

    A first glimpse, it would seem that the fr-firststrand flag does exactly that, since the majority of the aligned reads report negative (correct) strand. However I got a little bit suspicious to see that exactly the same number of reads were reported to map to the positive (wrong) strand by using the fr-secondstrand flag. A look at the names of the reads revealed that indeed, it is the very same reads that are being mapped to the region with the two methods, and that only the reported strand is changed, like if TopHat was only adding the strand information based on which flag did I use, but not producing it himself by seeing if he could map the read to the expected strand, or if he had to make the reverse compliment in order to map it.

    I'm still at a loss here, since as I said in the OP, bowtie is able to correctly map those very same reads to one strand or the other by the use of the --norc and --nofw flags.

    Any thoughts?
    Last edited by Ender985; 04-21-2011, 08:35 AM.

    Comment


    • #3
      Well, just for future reference, I managed to do it. But it is not as straightforward as one would whish.

      One of the biggest problems I had is that I was under the assumption that one file contained only the 'originally forward reads' (as in, reads coming from forward-stranded genes) and another with the 'originally reversed reads'. But this turned out to be completely false. What I really had was two files (since it was a PE experiment) with a number of reads coming from both strands, that were guaranteed NOT to be complimentary to the original RNA orientation, as this is what the strand-specific protocol does for you.

      Once that was cleared out, I still had to battle against the poorly explained library flags of TopHat. After several tries, it turns out that for the SOLiD data, the flag that finally gave me correct results was --fr-secondstranded , but I think there is no easy way to know which is the flag you need to use short of trying both of them and seeing which one produces correctly stranded results.

      The third problem was that TopHat silently does the strand specific alignment, but then dumps both forward and reverse reads in the same .bam file. It turns out you have to manually extract the reads that mapped to each strand. I did using samtools and grep:

      samtools view -h accepted_hits.sorted.bam | grep -e XS:A:+ -e '^@' | samtools view -bS -o forward/accepted_hits.bam -
      samtools view -h accepted_hits.sorted.bam | grep -e XS:A:- -e '^@' | samtools view -bS -o reverse/accepted_hits.bam -


      Ony then, if you produce pileups from the two resulting .bam files, you will have a pileup with the forward RNA information, and another one with the reverse one.

      I hope this post will save some people the few weeks of frustration I had to undergo to finally solve this.

      Comment


      • #4
        Originally posted by Ender985 View Post
        Well, just for future reference, I managed to do it. But it is not as straightforward as one would whish.

        One of the biggest problems I had is that I was under the assumption that one file contained only the 'originally forward reads' (as in, reads coming from forward-stranded genes) and another with the 'originally reversed reads'. But this turned out to be completely false. What I really had was two files (since it was a PE experiment) with a number of reads coming from both strands, that were guaranteed NOT to be complimentary to the original RNA orientation, as this is what the strand-specific protocol does for you.

        Once that was cleared out, I still had to battle against the poorly explained library flags of TopHat. After several tries, it turns out that for the SOLiD data, the flag that finally gave me correct results was --fr-secondstranded , but I think there is no easy way to know which is the flag you need to use short of trying both of them and seeing which one produces correctly stranded results.

        The third problem was that TopHat silently does the strand specific alignment, but then dumps both forward and reverse reads in the same .bam file. It turns out you have to manually extract the reads that mapped to each strand. I did using samtools and grep:

        samtools view -h accepted_hits.sorted.bam | grep -e XS:A:+ -e '^@' | samtools view -bS -o forward/accepted_hits.bam -
        samtools view -h accepted_hits.sorted.bam | grep -e XS:A:- -e '^@' | samtools view -bS -o reverse/accepted_hits.bam -


        Ony then, if you produce pileups from the two resulting .bam files, you will have a pileup with the forward RNA information, and another one with the reverse one.

        I hope this post will save some people the few weeks of frustration I had to undergo to finally solve this.
        An useful thread to me.
        I wanna say that it seems not appropriate to align RNA-Seq reads with Bowtie cuz it doesn't support the split reads mapping.
        And the definition of XS:A which produced by Tophat is telling you the strand of the transcript that produced the read, not the read itself.

        I also have some questions Ender, could you give me some help?
        1. I have the SOLiD RNA-Seq PE data, as far as I concerned, the F3 and F5 are in opposite orientation and both of them mapping to the different strands separately, am I right? Because I saw a Post (the last one) http://seqanswers.com/forums/showthread.php?t=6317 illustrated the F3 and F5 are actually on the same strand, which confused me very much.
        2. If I'm not wrong, so if I wanna see the reads on the positive strand, is that correct (like you do) to get the reads with XS:A:+ tag, not just simply sum up the reads mapping to the positive strand cuz the F5 read in the same pair which maps to negative strands are actually from the positive strand too?

        Thank you.
        Conan.
        Last edited by dadada4ever; 07-09-2012, 10:38 PM. Reason: the code command in my reply cause reading difficulty

        Comment

        Working...
        X