Announcement

Collapse
No announcement yet.

Mapping Human RNA Seq: Transcriptome vs. Genome

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mapping Human RNA Seq: Transcriptome vs. Genome

    Would anyone out there like to share their opinions about the relative merits and pitfalls of using the human transcriptome vs. the human genome as a reference for mapping some Solid RNA-Seq runs? I am guessing that this probably comes down to questions about the relative quality of the transcriptome sequence vs. the genome sequence (in other words, how complete is the transcriptome build relative to the genome build) and the relative role of splice-prediction algorithms (e.g. tophat) and their effects on read mapping. Any thoughts out there? To be honest, I don't know a whole lot on how "complete" the human transcriptome is supposed to be (# of tissues, life stages, etc.). I'm just looking for what would be the "best" way to do this. I could run both, but thought I'd start with first principles and go from there as these bam files are huge and a pain to store.

    Thanks

  • #2
    Sorry to bump an old question, but I'm also wondering about this at the moment and I can't seem to find an answer anywhere.

    What are the merits of using the human transcriptome vs human genome for RNA-Seq mapping?

    Comment


    • #3
      Transcriptome:
      + better specificity, easier to resolve isoforms, need less seq depth (probably)
      - restricted to known transcripts

      Genome:
      + can find new things
      - need to sequence more to do accurate isoform assignment, will miss more known splice junctions

      Comment


      • #4
        Originally posted by Derek-C View Post
        Sorry to bump an old question, but I'm also wondering about this at the moment and I can't seem to find an answer anywhere.

        What are the merits of using the human transcriptome vs human genome for RNA-Seq mapping?

        I am of the opinion that it is better to align to the genome. With STAR it can be done very quickly.

        The question is, do you believe the transcriptome annotation is really complete? We know from the ENCODE project that something like 80% of the genome is transcribed. If you only align reads to the transcriptome, you could be forcing some reads to align to known transcripts, some of which could have been better placed on an unannotated region of the genome, thus reducing ambiguity.

        Keep in mind that hardly any genome is really complete... in fact, you should align not only to the chromosomes, but to all available random contigs and "decoy" sequences. So if genomes are never really complete - how can we expect the transcriptome to be anything close to complete?

        The only advantage to transcriptome alignment is speed and memory savings... but I think with STAR this is not so much an issue anymore.

        Comment


        • #5
          Originally posted by kopi-o View Post
          Transcriptome:
          + better specificity, easier to resolve isoforms, need less seq depth (probably)
          - restricted to known transcripts

          Genome:
          + can find new things
          - need to sequence more to do accurate isoform assignment, will miss more known splice junctions
          If you input a GTF file into STAR you can have it index the known splice junctions for you...

          Comment


          • #6
            One problem about mapping to the transcriptome is that you can mistake transcription of paralogous genes, see Schrider et al.'s PLoS One paper critiquing Cheung's Science paper on RNA editing. Since ~70% of the human genome is transcribed, you may miss a lot of information mapping to the transcriptome.

            Comment


            • #7
              Not to bump an old thread, but it seems maybe still an open question. I think cufflinks for example can use both the transcriptome annotation and the genome to resolve certain problems with pseudogenes and homologous genes, which seems like should be a better approach, I am partial to mapping to the transcriptome at least for differential expression. It seems like a different question "Is there evidence for a transcript that hasn't been seen before?", furthermore these questions can be verified with lab work. There is also a theory that the transcripts should be able to be assembled before mapping, which should remove most of the dominant allele bias, though I don't think the assemblers are quite upto it yet.

              Comment


              • #8
                SO finally, it is good or bad to use transcriptome references for differential gene expression study?

                Comment


                • #9
                  I think now you can do both at the same time. HISAT2 builds suffix indexes with annotations built in, so whichever mapping best explains the data are chosen.

                  Comment


                  • #10
                    I kinda take issue with both approaches. With alignment to genome I always miss some alignments because aligning RNA-Seq to the genome is relatively difficult. STAR misses some alignments that GSNAP picks up and, on occasion, even bowtie2 picks up alignments STAR misses (not spliced ones, of course). Furthermore when I take reads that failed to align to the genome and map them directly to the transcriptome many of those reads align. And this is true even within low error rates. If I go the other way - map to the transcriptome first - I run some risk of mapping reads to genes that would be more ambiguously mapped to the genome. I have no idea how much of a problem that is in part because I'm not confident in any aligner's ability to find all possible alignments of a read to the genome. With some data I may map to the genome first and throw out reads with MAPQ==0 and then take the remaining aligned and unaligned reads to map to the transcriptome. In the end the transcriptome probabilistic methods (RSEM, eXpress, Kallisto, Salmon) have been shown to produce more accurate gene expression than genome approaches (cufflinks, stringtie, etc). The necessity for accurate expression to detect accurate differential expression is up for debate. I'd guess it's not as big of a deal. However when it comes to publication we like to report TPM expressions for genes since it's the closest thing to a standard that we have in RNA-Seq and in order to get accurate TPM you have to use some type of probabilistic isoform level expression estimation and it's the direct to transcriptome methods that seem to work the best.
                    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                    Salk Institute for Biological Studies, La Jolla, CA, USA */

                    Comment

                    Working...
                    X