Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ETHANol
    Senior Member
    • Feb 2010
    • 308

    FPKM and mapping efficiency

    Cufflinks addresses some biases in the calculation. I don't know enough about it to say much, but it looks like perhaps the most advanced user-friendly method of FPKM calculation at this time.

    My concern is it doesn't address mapping efficiency. Thus, your parameters and software used for read mapping could have a large effect on the calculated FPKM values. Has anyone addressed this?

    It seems like you could figure out read mapping efficiency with single-end reads by generating a file of every possible read in the genome and a mapping that and then dividing by the gene length. Maybe this is a little too simplistic.

    Does anyone have any thoughts on this?
    --------------
    Ethan
  • lh3
    Senior Member
    • Feb 2008
    • 686

    #2
    I rarely work with RNA-seq data and I do not use cufflinks, but I am not sure how much mapping efficiency matters. The difference between mapping algorithms/settings is mostly caused by difference in sensitivity. After normalization, FPKM should largely stay the same except a few regions with high diversity.

    I do not think a mask is useful in general, either, unless you are comparing data of very different read lengths or using a mapper without a proper mapping quality. This is at least true for variant calling.

    Comment

    • ETHANol
      Senior Member
      • Feb 2010
      • 308

      #3
      I don't work with Cufflinks either, but it seems like a reasonable tool to compute FPKM.

      This is the example that concerned me. Nanog has several pseudogenes. If you throw away reads that map to more then one location, I was told that nothing maps to Nanog. Thus, even though Nanog transcription is activated during the transformation from differentiated cells to iPS cells, you do not see it. If this is true, which I was told it is (I've never looked myself), in this case a gene that is highly expressed appears to be indictable.

      Still, bottom line is I don't really know, but would like to hear others opinions.
      --------------
      Ethan

      Comment

      • alexdobin
        Senior Member
        • Feb 2009
        • 161

        #4
        Hi Ethan,

        I do not think it is possible to calculate mapping efficiency for RNA-seq data, since reads are spliced and can span hundreds of kilo-bases. In principle, we could do that just for the transcriptome, but then, of course, we would be blind to anything except annotations.

        Alignments do have a big effect on the transcript assembly. We actually looked at the precisely Nanog locus on ENCODE H1ES data. The attached figures show the Cufflinks assembly with Tophat or STAR alignment. In this case, Tophat misses one of the junctions because it maps the contiguously with mismatches to a pseudogene, so Cufflinks cannot assemble the full-length transcript. However, there are still reads mapping to this locus so it will return non-zero FPKM. STAR recovers this junction and allows Cufflinks to reconstruct the whole transcript. Note that these are pretty old results, from Fall 2010, and Tophat may have improved since then.

        In any case, it is probably prudent to try a few different aligners for problematic genes.
        Attached Files

        Comment

        • kopi-o
          Senior Member
          • Feb 2008
          • 319

          #5
          I have had the exact same experience with Nanog in RNA-seq!

          I do think "mapping efficiency" (which is often referred to as "mappability") matters in RNA-seq; I have read a manuscript (not published yet) which argued pretty convincingly that it should be corrected for (and showed a nice way to do it). Methods like NEUMA and some others attempt to do this. The manuscript I mentioned showed that Cufflinks does have a certain systematic bias due to mappability effects.

          Comment

          • lh3
            Senior Member
            • Feb 2008
            • 686

            #6
            Probably I misunderstood "mapping efficiency" (I took it as sort of sensitivity). Anyway, I was talking about a global effect. For the vast majority of genes, changing mappers/settings would not lead to a big effect. Nonetheless, if you look at a particular gene having multiple paralogs, the mapping algorithm and the way to compute FPKM may matter a lot. I know a few groups still prefer their in-house pipelines so that they can fully understand and fix potential artifacts.

            Comment

            • sdriscoll
              I like code
              • Sep 2009
              • 436

              #7
              cufflinks is fine but keep in mind it's also giving you a more "processed" result than a simple read counter like htseq-count. if you want to use it i recommend using the -b (providing your genome's FASTA source) option because without it i've seen cufflinks give some very odd expression levels to genes that are not justified based on the actual reads aligned to those genes. the -b option seems to fix the over-estimates. there are still under-estimates but at least those seem to be justified in some way. For example if a gene has coverage at only 80% of its exons. If I count reads aligning to that gene and compute the RPKM of it manually i get a higher value than what cufflinks produces while 90+ percent of the rest of the genes have roughly equal expression between my own calculation and theirs. so cufflinks is counting the fact that the gene doesn't have balanced and complete coverage against its FPKM value.
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment

              • ETHANol
                Senior Member
                • Feb 2010
                • 308

                #8
                Yes, "mappability" was the word I was looking for and not "mapping efficiency". Oops.

                Anyway, it appears to be a little more complex of an issue then I have time or skills to undertake. But it appears some more computationally oriented people are on the issue. Until then, cufflinks or just dividing by transcript length should be good enough for my purposes. Thanks everyone for the insight!
                --------------
                Ethan

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Pathogen Surveillance with Advanced Genomic Tools
                  by seqadmin




                  The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                  03-24-2025, 11:48 AM
                • seqadmin
                  New Genomics Tools and Methods Shared at AGBT 2025
                  by seqadmin


                  This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                  The Headliner
                  The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                  03-03-2025, 01:39 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-20-2025, 05:03 AM
                0 responses
                49 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-19-2025, 07:27 AM
                0 responses
                57 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-18-2025, 12:50 PM
                0 responses
                50 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-03-2025, 01:15 PM
                0 responses
                200 views
                0 reactions
                Last Post seqadmin  
                Working...