Unconfigured Ad

**KAP** · 04-11-2013, 10:17 PM

Does anyone know the reason for this? I've been wondering about this too...

**rboettcher** · 04-11-2013, 11:08 PM

I experienced the same behaviour (higher p-values for increasing group sizes) and therefore switched to edgeR for my differential expression analysis.
You should check out the DEXSeq paper, where Simon Anders et al. compare their method to cuffdiff (pre v2) and conclude that cuffdiff is not able to handle biological variability well.
This pretty much explains the whole problem, since more biological replicates also mean a higher variance within one group that can not be explained by Poisson.

Cheers

**KAP** · 04-17-2013, 04:08 AM

Thanks a lot for your reply, rboettcher! I'll check out the paper...

**rboettcher** · 05-01-2013, 04:01 AM

I just saw that Trapnell et al. released a new Cufflinks/Cuffdiff version (2.1 http://cufflinks.cbcb.umd.edu/index.html) and address exactly the issues we discussed here in their description. I have not tested the new release, so if anyone has some results to share I'd be happy to see them.

Best

**rnaEni** · 07-10-2013, 06:16 AM

Dear all,
I am using cufflinks version 2.1.1 and cummeRbund, and I still have the same problem with p-value(and q-value as well).
I tried to analyze promoter, splicing, tss, genes,cds, relcds of two different datasets but the p-value remains the same for every ids.
I post here the output of most signicant splicing of one dataset:
> sig_splicing
TSS_group_id gene_id sample_1 sample_2 status value_1 value_2
1368 TSS11228 XLOC_005502 A B OK 0 0
1556 TSS11398 XLOC_005592 A B OK 0 0
5025 TSS1452 XLOC_000719 A B OK 0 0
7141 TSS16424 XLOC_008057 A B OK 0 0
10187 TSS19166 XLOC_009832 A B OK 0 0
11285 TSS20153 XLOC_010503 A B OK 0 0
15014 TSS2351 XLOC_001163 A B OK 0 0
21310 TSS29177 XLOC_015217 A B OK 0 0
23911 TSS31517 XLOC_016311 A B OK 0 0
29587 TSS36626 XLOC_019160 A B OK 0 0
34385 TSS40944 XLOC_020984 A B OK 0 0
38979 TSS45079 XLOC_023079 A B OK 0 0
39335 TSS454 XLOC_000234 A B OK 0 0
41484 TSS47334 XLOC_024205 A B OK 0 0
41494 TSS47343 XLOC_024207 A B OK 0 0
48942 TSS54046 XLOC_027685 A B OK 0 0
50520 TSS55467 XLOC_028292 A B OK 0 0
50918 TSS55825 XLOC_028432 A B OK 0 0
60068 TSS6406 XLOC_003242 A B OK 0 0
62395 TSS66154 XLOC_033418 A B OK 0 0
JS_dist test_stat p_value q_value significant
1368 0.832555 0 5e-05 0.029265 yes
1556 0.689652 0 5e-05 0.029265 yes
5025 0.460169 0 5e-05 0.029265 yes
7141 0.224018 0 5e-05 0.029265 yes
10187 0.684703 0 5e-05 0.029265 yes
11285 0.661421 0 5e-05 0.029265 yes
15014 0.627367 0 5e-05 0.029265 yes
21310 0.458814 0 5e-05 0.029265 yes
23911 0.404639 0 5e-05 0.029265 yes
29587 0.535728 0 5e-05 0.029265 yes
34385 0.660841 0 5e-05 0.029265 yes
38979 0.531889 0 5e-05 0.029265 yes
39335 0.634366 0 5e-05 0.029265 yes
41484 0.204522 0 5e-05 0.029265 yes
41494 0.268069 0 5e-05 0.029265 yes
48942 0.567654 0 5e-05 0.029265 yes
50520 0.692903 0 5e-05 0.029265 yes
50918 0.603009 0 5e-05 0.029265 yes
60068 0.199754 0 5e-05 0.029265 yes
62395 0.603494 0 5e-05 0.029265 yes

If you already fixed this bug please let me know.
Regards

**haggardd** · 03-12-2014, 10:18 AM

^^bump for rnaEni

I have a similar problem (?) about CuffDiff output.

I have 50bp single-end RNA-seq data from a control and treatment. I have 3 replicates per treatment. I ran Cufflinks>Cuffmerge>CuffDiff and I get a similar result where the only significant genes, a total of about 42, have the most significant P-value CuffDiff allows (5E-05) and the q-value is 0.043 or something. Every other gene (even if it has a p-value of 0.0001) does not have a significant q-value (in this specific case, I get a q-value of 0.07). Does this make sense? I would expect to find more significant genes, but maybe I'm just trying to be too optimistic for this treatment.

On another note, if I add samples (with replicates) with old RNA-seq data with the same study design, to the cuffdiff run (with -L option to do all pairwise comparisons) the number of significant genes jumps to about 210. Does this also make sense? I would not expect the number to jump up that much because I would think the multiple test correction would be larger since CuffDiff is running way more DE tests with 4 samples instead of just 2.

Thoughts?

Thanks

**dpryan** · 03-12-2014, 03:51 PM

Originally posted by haggardd View Post

...the only significant genes, a total of about 42, have the most significant P-value CuffDiff allows (5E-05) and the q-value is 0.043 or something. Every other gene (even if it has a p-value of 0.0001) does not have a significant q-value (in this specific case, I get a q-value of 0.07). Does this make sense?

Sure, a p-value of 0.0001 isn't very likely to be significant when you consider that you're running 10's of thousands of tests at the same time. That's kind of the point of trying to account for multiple testing.

I would expect to find more significant genes, but maybe I'm just trying to be too optimistic for this treatment.

There are a lot of reason to not find as many changes as you expect, the most likely simply being that naive expectations aren't very reliable. There's also the fact that we don't know anything about your experimental design, which may have its own issues.

On another note, if I add samples (with replicates) with old RNA-seq data with the same study design, to the cuffdiff run (with -L option to do all pairwise comparisons) the number of significant genes jumps to about 210. Does this also make sense?

Sure, you not only have group but also batch differences. The latter can be pretty large.

**haggardd** · 03-12-2014, 08:19 PM

Thanks for the speedy reply dpryan.

I understand how multiple testing corrections work but I just found it odd that genes with a p-value that small would not have a significant q-value. This was primarily based on my previous experience with other RNA-seq data sets where a p-value of even 0.01 had a significant q-value (<0.05). But every dataset is different...

Also, what I didn't mention is that I keep seeing many genes with the same exact p- and q-value. Here's some examples (sorry it looks messy):

gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant
XLOC_042470 8:35520190-35520547 A B OK 835.767 0 #NAME? #NAME? 5.00E-05 0.0413262 yes
XLOC_045591 Zv9_NA634:7034-12375 A B OK 17.5896 0 #NAME? #NAME? 5.00E-05 0.0413262 yes
XLOC_003075 10:9797294-9798871 A B OK 2.29584 3.71833 0.695633 2.65798 0.0001 0.0723208 no
XLOC_004091 11:13161030-13168864 A B OK 1.80821 0.342578 -2.40005 -4.05064 0.0001 0.0723208 no
XLOC_008401 14:24980494-25008061 A B OK 1.51264 2.56794 0.763543 2.27828 0.0001 0.0723208 no
XLOC_039714 7:75191965-75197308 A B OK 17.7657 23.718 0.416886 2.45736 0.0001 0.0723208 no
XLOC_042443 8:32215711-32217186 A B OK 21.0697 28.2631 0.423748 2.59296 0.0001 0.0723208 no
XLOC_044102 9:23214760-23215741 A B OK 156.312 118.895 -0.394742 -2.33721 0.0001 0.0723208 no
XLOC_008940 14:31506583-31512448 A B OK 1.99747 1.38924 -0.523873 -2.09315 0.00015 0.0982472 no
XLOC_021634 20:44509173-44509769 A B OK 1.96707 4.6672 1.24651 3.83803 0.00015 0.0982472 no
XLOC_024274 22:7561581-7578214 A B OK 1.88902 3.07738 0.704064 2.46213 0.00015 0.0982472 no
XLOC_034952 5:66144043-66144594 A B OK 0 5.13361 inf #NAME? 0.00015 0.0982472 no
XLOC_041687 8:25134454-25148890 A B OK 3.58671 5.59972 0.642694 1.98236 0.00015 0.0982472 no
XLOC_019070 2:36641516-36653086 A B OK 1.98706 3.04699 0.616751 1.95934 0.0002 0.119703 no
XLOC_024666 22:37648352-37682613 A B OK 4.41966 6.45222 0.545861 2.50274 0.0002 0.119703 no
XLOC_025299 23:17954114-17955768 A B OK 9.64916 13.021 0.432364 1.79451 0.0002 0.119703 no
XLOC_026936 24:37059294-37061674 A B OK 14.8685 11.1599 -0.413935 -2.099 0.0002 0.119703 no
XLOC_037155 6:8161171-8185521 A B OK 3.49529 4.95005 0.502031 2.12399 0.0002 0.119703 no
XLOC_000264 1:24035505-24073635 A B OK 4.04379 6.0967 0.592321 1.9311 0.00025 0.139976 no
XLOC_010116 15:454559-541027 A B OK 662.216 163.827 -2.01512 -33.7345 0.00025 0.139976 no
XLOC_026603 23:46192824-46193823 A B OK 3.84477 6.04824 0.653617 2.71933 0.00025 0.139976 no
XLOC_042098 8:3896110-3969017 A B OK 4.34482 7.91808 0.865855 2.36129 0.00025 0.139976 no
XLOC_043936 9:4736133-4862966 A B OK 3.07409 4.28154 0.477969 1.98661 0.00035 0.192856 no

Now I know things have changed for Cufflinks 2.1.1 where it estimates the p-values differently than previous versions, but it just seems weird to see so many genes (this isn't all of them) with the same p- and q-values. It might just be a result of this change in the new version of Cufflinks. If it is please let me know. Again, past experience dictates this feeling since I've never seen that many duplicate p- or q-values before...

I might not have given an adequate account of the third comment so I'll try and explain myself better.

In my original post I said that running CuffDiff with just the control and treatment samples (lets say A1, A2, A3 and B1, B2, B3) gave ~40 genes with a q-value < 0.05. If I just add more data into CuffDiff by using older RNA-seq data (let's call these C1, C2, C3 and D1, D2, D3) and using the -L option to do all the possible pairwise comparisons (mainly out of curiosity) and again only look at the pairwise comparison of the samples A and B my significance gene lists jumps to ~200. And now genes with a p-value of even 0.01 now have significant q-values (albeit, they're ~0.04). It just seems odd.

I guess what I am trying to say is I would expect increasing the total number of significance tests (jumping from ~32,000 to +100,000) would lead to a much more stringent FDR when testing for differential expression. Unless I am completely off kilter (which is definitely likely) with how CuffDiff would treat FDR calculations when it does a bunch of pairwise comparisons. Could it be that what is driving the change is only the fact that I've added more data to the point that now CuffDiff estimates the pooled gene dispersion with higher accuracy so the variance between samples is decreased?

Maybe I'm just talking myself in circles. I dunno

Again, thanks for all the help and for keeping it constructive.

**dpryan** · 03-13-2014, 05:33 AM

I agree that it seems odd to have so many identical p-value. You'd have to look into the internals of cufflinks to see if this seems reasonable or not (or just use something else, like DESeq2 or edgeR, that's easier to debug).

Regarding the FDR adjustment, remember that the BH method is dependent upon the distribution of p-values. So the more DE genes you have the lower the bar ends up being for a raw p-value to still be significant after adjustment. Consequently, a raw p-value of 0.01 might be significant after adjustment in one experiment (perhaps you're looking at cancer) and not another (almost anything I work on at the moment). Similarly, batch effects will often introduce a good number of differences, which will change the distribution and thereby the threshold for what still ends up being significant after correction.

Wipe memories of Bonferroni (where what you expected would have been the case) from your mind

Regarding the dispersion calculations, how cuffdiff will end up doing this will depend on how you had things setup. By default, it uses a pooled estimate approach, so the estimate will change with added groups. If you used "per-condition" then this probably wouldn't be an issue (of course you then need more replicates to get a good estimate, so...).

**wanfahmi** · 08-24-2015, 04:04 AM

Hello

I faced similar problem but how to solve the problem anyway?

I just ran cuffdiff across 30 human tissues to detect differentially expressed across tissues. The file output from splicing.diff is a file telling you that a differential splice variant between 2 tissues for all 30 tissues, is that right? But why in my case, I filtered the 'yes' for significantly expressed, the p-value most of the genes is 5e-5 (0.00005)? Is that the output from the cuffdiff? How can I get the real number of p-value so that I can rank and get the gene list from it?

my command line was: $ cuffdiff -o diffout -b ../genome.fa -p 8 -L <a list of 30 human tissues> -u merged_asm/merged.gtf <list of bam files>

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, Today, 06:09 AM	0 responses 15 views 0 reactions	Last Post by SEQadmin2 Today, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 39 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 47 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

Cuffdiff p-value distributions and replicates

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News