Seqanswers Leaderboard Ad

**savova** · 04-25-2012, 08:51 AM

I need an answer to this too...

**sdriscoll** · 04-25-2012, 10:09 AM

FPKMS are simply rate measurements. You could have a gene with an FPKM of 100 that only got 20 reads. It all depends on that last part of the normalization: per million mapped reads.

There is no logical bottom end cutoff for FPKM where you can say "these genes are not expressed", other than 0 of course.

If you mean that most of the genes in your results seem right bu a subset of them seem to have higher FPKMS than others with similar amounts of coverage then you're probably seeing an artifact from the cufflinks pipeline. I have seen that many times myself for small genes like those single exon ones. It doesn't make much sense. I recommend trying the -b option on cufflinks and/or cuffdiff. That uses the bias correction pipeline within cufflinks and it seems to fix those erroneous FPKMS.

**savova** · 04-25-2012, 12:37 PM

I have a different problem - all my RPKM values in one dataset are shifted by 10,000 with respect to another! Both are described to have been prepared the same way. I am pasting my message to cufflinks developers:

I wanted to compare this dataset

http://0-www.ncbi.nlm.nih.gov.ilsprod.lib.neu.edu/geo/query/acc.cgi?acc=GSE29119

to available Encode datasets on other cell lines:

Early Error

http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeCaltechRnaSeq

My expression analysis with Cufflinks is weird. In particular, it seems that the
whole RPKM distribution is shifted up for the first dataset samples (HMEC and
HCC1954) . For example, the minimum of both HMEC and H1HESC is 0, but the maximum
is 3*10^9 and 3*10^4 respectively. So in log space, the average RPKM for
the other cell lines is around 2-3, while for HMEC and HCC1955 it's 10-12. At this
point I went all the way back to fastq, realigned to Hg19 with bowtie,
and used cufflinks to compute RPKM - the difference remains. Any ideas why?

It is true that one library may have more reads. But isn't FPKM supposed to normalize for the number of total reads in the library and if so how can the entire distribution be shifted?

2) On another note, I also do not understand how I am getting some really small non-zero values from both datasets when the total number of reads would not seem to permit this:

total reads HMEC_expression:
2.2983e+10

min HMEC_expression >0
3.0939e-312

I would really appreciate your help.

**sdriscoll** · 04-25-2012, 12:52 PM

I've seen cuffdiff blow the read count normalizations but not cufflinks. In my case I saw a 10 fold increase in the baseline of one group's mean expression verses the other causing almost all genes to be tagged as significantly misexpressed.

Have you tried testing the different normalization options that Cufflinks provides? Have you tried the --compatible-hits-norm option or the -N option for upper quartile normalization.

You can also look in the isoforms.fpkm_tracking files and check the "length" and "coverage" columns. You can roughly compute the number of raw reads aligned to each gene by multiplying those columns together. Sum the column of products to get a rough "total bases aligned to genes" count and divide the column by that number to roughly normalize the counts. Try that at each sample and see if you still have that massive offset between samples.

**savova** · 04-25-2012, 01:27 PM

thanks, i will try this. but I am now worried this software works erratically. do you have any idea why such blowing of the normalization occurs? can i trust results from other people computed with this software?

**sdriscoll** · 04-25-2012, 01:45 PM

I don't use it as my primary quantification tool nor my primary differential expression tool. I've never seen DESeq or edgeR blow the normalization step. We are only talking about a division step so it doesn't make sence for any software to mess it up. To me Cufflinks is very desirable but I don't trust it so I don't use it. I have explored it quite a lot though because I very much want to be able to use it.

In your case it COULD be a result of the normalization being based on total reads aligned instead of the more robust upper quartile method. But you should check the coverages to make sure. If your manual normalizations give you the same result then you've got some small population of highly expressed genes biasing the normalization. The -N option should fix that or normalizing by the upper quartile of the read counts of the genes. I'd also try the -b option because it seems to help fix some other things that Cufflinks does that make me not trust it. I still dont trust it though. Maybe im just not smart enough to understand it.

**caballien** · 01-04-2013, 01:33 PM

very low fpkm?

sdriscoll-

Nearly all of my fpkm values are very low. The median of all of my replicates is ~0.1 and I have between 50 and 60 million mapped reads per sample. Very few genes are above 10. See the attached graph boxplot2.pdf and testdensity.pdf. Are these values too low, or as you said caused by a larger denominator and thus are okay? Also, I've attached a .pdf of a volcano plot, which is strange because I have ~870 significantly differentially expressed genes, but they all show up at the top of the graph where they don't belong (pvalues are not that small). Perhaps cummeRbund is just doing something improperly.

The sequencing is from RNA-seq from ribosomal depleted RNA, could this lower the fpkms? I did mask all repetitive regions when using cuffdif.

The sequencing was performed on a HiSeq. The data was generated through the Tuxedo package -Tophat 2, cufflinks,cuffmerge,cummeRbund.

Attached Files

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Cufflinks FPKM range

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News