Seqanswers Leaderboard Ad

**lh3** · 05-01-2012, 04:24 AM

I rarely work with RNA-seq data and I do not use cufflinks, but I am not sure how much mapping efficiency matters. The difference between mapping algorithms/settings is mostly caused by difference in sensitivity. After normalization, FPKM should largely stay the same except a few regions with high diversity.

I do not think a mask is useful in general, either, unless you are comparing data of very different read lengths or using a mapper without a proper mapping quality. This is at least true for variant calling.

**ETHANol** · 05-01-2012, 06:08 AM

I don't work with Cufflinks either, but it seems like a reasonable tool to compute FPKM.

This is the example that concerned me. Nanog has several pseudogenes. If you throw away reads that map to more then one location, I was told that nothing maps to Nanog. Thus, even though Nanog transcription is activated during the transformation from differentiated cells to iPS cells, you do not see it. If this is true, which I was told it is (I've never looked myself), in this case a gene that is highly expressed appears to be indictable.

Still, bottom line is I don't really know, but would like to hear others opinions.

**alexdobin** · 05-01-2012, 09:57 AM

Hi Ethan,

I do not think it is possible to calculate mapping efficiency for RNA-seq data, since reads are spliced and can span hundreds of kilo-bases. In principle, we could do that just for the transcriptome, but then, of course, we would be blind to anything except annotations.

Alignments do have a big effect on the transcript assembly. We actually looked at the precisely Nanog locus on ENCODE H1ES data. The attached figures show the Cufflinks assembly with Tophat or STAR alignment. In this case, Tophat misses one of the junctions because it maps the contiguously with mismatches to a pseudogene, so Cufflinks cannot assemble the full-length transcript. However, there are still reads mapping to this locus so it will return non-zero FPKM. STAR recovers this junction and allows Cufflinks to reconstruct the whole transcript. Note that these are pretty old results, from Fall 2010, and Tophat may have improved since then.

In any case, it is probably prudent to try a few different aligners for problematic genes.

Attached Files

**kopi-o** · 05-01-2012, 02:06 PM

I have had the exact same experience with Nanog in RNA-seq!

I do think "mapping efficiency" (which is often referred to as "mappability") matters in RNA-seq; I have read a manuscript (not published yet) which argued pretty convincingly that it should be corrected for (and showed a nice way to do it). Methods like NEUMA and some others attempt to do this. The manuscript I mentioned showed that Cufflinks does have a certain systematic bias due to mappability effects.

**lh3** · 05-01-2012, 02:33 PM

Probably I misunderstood "mapping efficiency" (I took it as sort of sensitivity). Anyway, I was talking about a global effect. For the vast majority of genes, changing mappers/settings would not lead to a big effect. Nonetheless, if you look at a particular gene having multiple paralogs, the mapping algorithm and the way to compute FPKM may matter a lot. I know a few groups still prefer their in-house pipelines so that they can fully understand and fix potential artifacts.

**sdriscoll** · 05-01-2012, 04:39 PM

cufflinks is fine but keep in mind it's also giving you a more "processed" result than a simple read counter like htseq-count. if you want to use it i recommend using the -b (providing your genome's FASTA source) option because without it i've seen cufflinks give some very odd expression levels to genes that are not justified based on the actual reads aligned to those genes. the -b option seems to fix the over-estimates. there are still under-estimates but at least those seem to be justified in some way. For example if a gene has coverage at only 80% of its exons. If I count reads aligning to that gene and compute the RPKM of it manually i get a higher value than what cufflinks produces while 90+ percent of the rest of the genes have roughly equal expression between my own calculation and theirs. so cufflinks is counting the fact that the gene doesn't have balanced and complete coverage against its FPKM value.

**ETHANol** · 05-02-2012, 02:37 AM

Yes, "mappability" was the word I was looking for and not "mapping efficiency". Oops.

Anyway, it appears to be a little more complex of an issue then I have time or skills to undertake. But it appears some more computationally oriented people are on the issue. Until then, cufflinks or just dividing by transcript length should be good enough for my purposes. Thanks everyone for the insight!

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, 05-14-2024, 07:03 AM	0 responses 20 views 0 likes	Last Post by seqadmin 05-14-2024, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 44 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 54 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 42 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

FPKM and mapping efficiency

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News