Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Optimizing tophat mapping for mixed RNA-Seq data

    Hi all,

    I’m currently using Tophat and bowtie2 to map 100bp PE RNA-Seq reads from a mixed human/bacterial sample. We’re more interested in the bacterial side of things, but there's plenty that we can learn from the human reads too. We originally used bowtie2 to map human reads to hg19, and then another bowtie2 to map bacterial reads. However we then switched to tophat for obvious reasons and redid the processing, and obviously a much larger number of human reads were mapping. But when we repeated the bowtie2 run for bacterial reads we had significantly less reads map.

    We’ve also repeated tophat on a few different settings to try find whats optimal. The no-discordant option in tophat changes the results quite a lot both for the amount of human reads mapped, and the number of bacterial reads mapped. I haven’t looked into the biological outcomes of this yet, but the differences in the amount of reads has me concerned, and the bacterial reads that come out from the file that were preprocessed with tophat on the default settings the no-discordant run
    I’ve looked into the differences between bacterial reads mapped by bowtie2 after tophat run with default settings and tophat run with the no-discordant option and they only share about 0.0007% of the bacterial reads, which is very odd.

    Basically I’m wondering if anyone could shed light on why the different tophat parametres have such a huge impact on the amount of reads which bowtie2 later identifies as being bacterial??

    Also any general advice would be appreciated
    Thanks

  • #2
    Are you mapping your reads first to one and then the other or at the same time? Ideally it shouldn't make a differences. The way you described it, where you map with tophat to human then got fewer reads with bowtie2 mapping to bacterial genomes makes me wonder if you are not mapping some of the bacterial reads to the human genome? Its similar to the problem of mapping reads to only part of the genome, rather than the whole genome. Tophat, bowtie2, or any tool will try to map the read no matter what. Maybe a read is genuinely from one genome, but if that genome is absent, it will settle for the best it can get from the reference you give it. Maybe combine your two references, map to both simultaneously, and see what results.

    Comment


    • #3
      Originally posted by chadn737 View Post
      Are you mapping your reads first to one and then the other or at the same time? Ideally it shouldn't make a differences. The way you described it, where you map with tophat to human then got fewer reads with bowtie2 mapping to bacterial genomes makes me wonder if you are not mapping some of the bacterial reads to the human genome? Its similar to the problem of mapping reads to only part of the genome, rather than the whole genome. Tophat, bowtie2, or any tool will try to map the read no matter what. Maybe a read is genuinely from one genome, but if that genome is absent, it will settle for the best it can get from the reference you give it. Maybe combine your two references, map to both simultaneously, and see what results.
      First to one, then to another. I had thought about this before, but when building the bacterial database we hit the max size of a reference database or and index that bowtie2 can build (well that's what I've been told, it was built just before I started this project). This is defiantly something to look into though, thanks!

      If I was going to be mapping both human and bacterial simultaneously, we'd have to use tophat in order to efficiently map the human reads (human reads comprise a large amount of the reads in our samples), do you (or anyone else who see's this post) know how using tophat to map bacterial reads would work out? since tophat was designed to look for spliced reads?
      Last edited by bob-loblaw; 02-14-2013, 09:38 AM.

      Comment


      • #4
        The size limit on the index is a problem. You could go ahead and combine them and see if what they told you was true. If it is you will only get an error message.

        As for using Tophat on bacterial reads. Tophat will try to align reads first to the genome before looking for splicing. Ideally, all the bacterial reads will align to the bacterial genome in this first round and not be splice. I won't say that wont happen, because inevitably some will have some sort of mismatch and show up spliced.

        Have you tried aligning reads to the bacterial genome and then to the human? Or has it only been human than bacterial?

        Comment


        • #5
          Originally posted by chadn737 View Post
          The size limit on the index is a problem. You could go ahead and combine them and see if what they told you was true. If it is you will only get an error message.

          As for using Tophat on bacterial reads. Tophat will try to align reads first to the genome before looking for splicing. Ideally, all the bacterial reads will align to the bacterial genome in this first round and not be splice. I won't say that wont happen, because inevitably some will have some sort of mismatch and show up spliced.

          Have you tried aligning reads to the bacterial genome and then to the human? Or has it only been human than bacterial?
          I haven't tried aligning reads to the bacterial genome then to human, but originally we were using bowtie2 to map human reads (which only mapped a few thousand reads per file compared to the tens of millions that tophat mapped for the same file). Then when we did the bowtie2 to map bacterial reads we got about 5 or 10 times as many bacterial reads being mapped as we did when we used tophat to align human reads. (So few human reads were being aligned by bowtie2 it gives me an indication of what doing bowtie2 for bacterial reads before tophat for human would result in). Basically I think no matter which alignment we do first we'll have the same problem, that if bacterial goes first then we'll get a lot of false positives, and vice versa for if human goes first. Thanks for all your help here! I'll defiantly be trying a tophat run with a database of both human and bacterial as soon as I can!

          Finally if I could ask you one more question, what about the no discordant options that I mentioned in the OP? Do you think I use that parameter when running tophat? Or should I just go with the default settings?
          Last edited by bob-loblaw; 02-15-2013, 03:21 AM.

          Comment


          • #6
            Another problem that just popped into my head, if tophat tries to align all reads first without looking for splicing, won't I just have the same problem as before that a lot of human reads will be falsely identified as being bacterial? Or do you know if Tophat will first try to align everything without splicing, then with splicing and only return the best hit?

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Genetic Variation in Immunogenetics and Antibody Diversity
              by seqadmin



              The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
              11-06-2024, 07:24 PM
            • seqadmin
              Choosing Between NGS and qPCR
              by seqadmin



              Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
              10-18-2024, 07:11 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 11:09 AM
            0 responses
            22 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Today, 06:13 AM
            0 responses
            20 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 11-01-2024, 06:09 AM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 10-30-2024, 05:31 AM
            0 responses
            21 views
            0 likes
            Last Post seqadmin  
            Working...
            X