Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ETHANol
    Senior Member
    • Feb 2010
    • 308

    FPKM and mapping efficiency

    Cufflinks addresses some biases in the calculation. I don't know enough about it to say much, but it looks like perhaps the most advanced user-friendly method of FPKM calculation at this time.

    My concern is it doesn't address mapping efficiency. Thus, your parameters and software used for read mapping could have a large effect on the calculated FPKM values. Has anyone addressed this?

    It seems like you could figure out read mapping efficiency with single-end reads by generating a file of every possible read in the genome and a mapping that and then dividing by the gene length. Maybe this is a little too simplistic.

    Does anyone have any thoughts on this?
    --------------
    Ethan
  • lh3
    Senior Member
    • Feb 2008
    • 686

    #2
    I rarely work with RNA-seq data and I do not use cufflinks, but I am not sure how much mapping efficiency matters. The difference between mapping algorithms/settings is mostly caused by difference in sensitivity. After normalization, FPKM should largely stay the same except a few regions with high diversity.

    I do not think a mask is useful in general, either, unless you are comparing data of very different read lengths or using a mapper without a proper mapping quality. This is at least true for variant calling.

    Comment

    • ETHANol
      Senior Member
      • Feb 2010
      • 308

      #3
      I don't work with Cufflinks either, but it seems like a reasonable tool to compute FPKM.

      This is the example that concerned me. Nanog has several pseudogenes. If you throw away reads that map to more then one location, I was told that nothing maps to Nanog. Thus, even though Nanog transcription is activated during the transformation from differentiated cells to iPS cells, you do not see it. If this is true, which I was told it is (I've never looked myself), in this case a gene that is highly expressed appears to be indictable.

      Still, bottom line is I don't really know, but would like to hear others opinions.
      --------------
      Ethan

      Comment

      • alexdobin
        Senior Member
        • Feb 2009
        • 161

        #4
        Hi Ethan,

        I do not think it is possible to calculate mapping efficiency for RNA-seq data, since reads are spliced and can span hundreds of kilo-bases. In principle, we could do that just for the transcriptome, but then, of course, we would be blind to anything except annotations.

        Alignments do have a big effect on the transcript assembly. We actually looked at the precisely Nanog locus on ENCODE H1ES data. The attached figures show the Cufflinks assembly with Tophat or STAR alignment. In this case, Tophat misses one of the junctions because it maps the contiguously with mismatches to a pseudogene, so Cufflinks cannot assemble the full-length transcript. However, there are still reads mapping to this locus so it will return non-zero FPKM. STAR recovers this junction and allows Cufflinks to reconstruct the whole transcript. Note that these are pretty old results, from Fall 2010, and Tophat may have improved since then.

        In any case, it is probably prudent to try a few different aligners for problematic genes.
        Attached Files

        Comment

        • kopi-o
          Senior Member
          • Feb 2008
          • 319

          #5
          I have had the exact same experience with Nanog in RNA-seq!

          I do think "mapping efficiency" (which is often referred to as "mappability") matters in RNA-seq; I have read a manuscript (not published yet) which argued pretty convincingly that it should be corrected for (and showed a nice way to do it). Methods like NEUMA and some others attempt to do this. The manuscript I mentioned showed that Cufflinks does have a certain systematic bias due to mappability effects.

          Comment

          • lh3
            Senior Member
            • Feb 2008
            • 686

            #6
            Probably I misunderstood "mapping efficiency" (I took it as sort of sensitivity). Anyway, I was talking about a global effect. For the vast majority of genes, changing mappers/settings would not lead to a big effect. Nonetheless, if you look at a particular gene having multiple paralogs, the mapping algorithm and the way to compute FPKM may matter a lot. I know a few groups still prefer their in-house pipelines so that they can fully understand and fix potential artifacts.

            Comment

            • sdriscoll
              I like code
              • Sep 2009
              • 436

              #7
              cufflinks is fine but keep in mind it's also giving you a more "processed" result than a simple read counter like htseq-count. if you want to use it i recommend using the -b (providing your genome's FASTA source) option because without it i've seen cufflinks give some very odd expression levels to genes that are not justified based on the actual reads aligned to those genes. the -b option seems to fix the over-estimates. there are still under-estimates but at least those seem to be justified in some way. For example if a gene has coverage at only 80% of its exons. If I count reads aligning to that gene and compute the RPKM of it manually i get a higher value than what cufflinks produces while 90+ percent of the rest of the genes have roughly equal expression between my own calculation and theirs. so cufflinks is counting the fact that the gene doesn't have balanced and complete coverage against its FPKM value.
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment

              • ETHANol
                Senior Member
                • Feb 2010
                • 308

                #8
                Yes, "mappability" was the word I was looking for and not "mapping efficiency". Oops.

                Anyway, it appears to be a little more complex of an issue then I have time or skills to undertake. But it appears some more computationally oriented people are on the issue. Until then, cufflinks or just dividing by transcript length should be good enough for my purposes. Thanks everyone for the insight!
                --------------
                Ethan

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by SEQadmin2


                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                  Here are nine questions we think about, in roughly the order they matter, before...
                  06-18-2026, 07:11 AM
                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, Yesterday, 11:10 AM
                0 responses
                7 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-17-2026, 06:09 AM
                0 responses
                42 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                104 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                125 views
                0 reactions
                Last Post SEQadmin2  
                Working...