Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FPKM and mapping efficiency

    Cufflinks addresses some biases in the calculation. I don't know enough about it to say much, but it looks like perhaps the most advanced user-friendly method of FPKM calculation at this time.

    My concern is it doesn't address mapping efficiency. Thus, your parameters and software used for read mapping could have a large effect on the calculated FPKM values. Has anyone addressed this?

    It seems like you could figure out read mapping efficiency with single-end reads by generating a file of every possible read in the genome and a mapping that and then dividing by the gene length. Maybe this is a little too simplistic.

    Does anyone have any thoughts on this?
    --------------
    Ethan

  • #2
    I rarely work with RNA-seq data and I do not use cufflinks, but I am not sure how much mapping efficiency matters. The difference between mapping algorithms/settings is mostly caused by difference in sensitivity. After normalization, FPKM should largely stay the same except a few regions with high diversity.

    I do not think a mask is useful in general, either, unless you are comparing data of very different read lengths or using a mapper without a proper mapping quality. This is at least true for variant calling.

    Comment


    • #3
      I don't work with Cufflinks either, but it seems like a reasonable tool to compute FPKM.

      This is the example that concerned me. Nanog has several pseudogenes. If you throw away reads that map to more then one location, I was told that nothing maps to Nanog. Thus, even though Nanog transcription is activated during the transformation from differentiated cells to iPS cells, you do not see it. If this is true, which I was told it is (I've never looked myself), in this case a gene that is highly expressed appears to be indictable.

      Still, bottom line is I don't really know, but would like to hear others opinions.
      --------------
      Ethan

      Comment


      • #4
        Hi Ethan,

        I do not think it is possible to calculate mapping efficiency for RNA-seq data, since reads are spliced and can span hundreds of kilo-bases. In principle, we could do that just for the transcriptome, but then, of course, we would be blind to anything except annotations.

        Alignments do have a big effect on the transcript assembly. We actually looked at the precisely Nanog locus on ENCODE H1ES data. The attached figures show the Cufflinks assembly with Tophat or STAR alignment. In this case, Tophat misses one of the junctions because it maps the contiguously with mismatches to a pseudogene, so Cufflinks cannot assemble the full-length transcript. However, there are still reads mapping to this locus so it will return non-zero FPKM. STAR recovers this junction and allows Cufflinks to reconstruct the whole transcript. Note that these are pretty old results, from Fall 2010, and Tophat may have improved since then.

        In any case, it is probably prudent to try a few different aligners for problematic genes.
        Attached Files

        Comment


        • #5
          I have had the exact same experience with Nanog in RNA-seq!

          I do think "mapping efficiency" (which is often referred to as "mappability") matters in RNA-seq; I have read a manuscript (not published yet) which argued pretty convincingly that it should be corrected for (and showed a nice way to do it). Methods like NEUMA and some others attempt to do this. The manuscript I mentioned showed that Cufflinks does have a certain systematic bias due to mappability effects.

          Comment


          • #6
            Probably I misunderstood "mapping efficiency" (I took it as sort of sensitivity). Anyway, I was talking about a global effect. For the vast majority of genes, changing mappers/settings would not lead to a big effect. Nonetheless, if you look at a particular gene having multiple paralogs, the mapping algorithm and the way to compute FPKM may matter a lot. I know a few groups still prefer their in-house pipelines so that they can fully understand and fix potential artifacts.

            Comment


            • #7
              cufflinks is fine but keep in mind it's also giving you a more "processed" result than a simple read counter like htseq-count. if you want to use it i recommend using the -b (providing your genome's FASTA source) option because without it i've seen cufflinks give some very odd expression levels to genes that are not justified based on the actual reads aligned to those genes. the -b option seems to fix the over-estimates. there are still under-estimates but at least those seem to be justified in some way. For example if a gene has coverage at only 80% of its exons. If I count reads aligning to that gene and compute the RPKM of it manually i get a higher value than what cufflinks produces while 90+ percent of the rest of the genes have roughly equal expression between my own calculation and theirs. so cufflinks is counting the fact that the gene doesn't have balanced and complete coverage against its FPKM value.
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment


              • #8
                Yes, "mappability" was the word I was looking for and not "mapping efficiency". Oops.

                Anyway, it appears to be a little more complex of an issue then I have time or skills to undertake. But it appears some more computationally oriented people are on the issue. Until then, cufflinks or just dividing by transcript length should be good enough for my purposes. Thanks everyone for the insight!
                --------------
                Ethan

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM
                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 05-14-2024, 07:03 AM
                0 responses
                20 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-10-2024, 06:35 AM
                0 responses
                44 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-09-2024, 02:46 PM
                0 responses
                54 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-07-2024, 06:57 AM
                0 responses
                42 views
                0 likes
                Last Post seqadmin  
                Working...
                X