Hi, has anyone looked at the distribution of Cuffdiff's p-values?
As I understand it, p-values should be evenly distributed (ie flat histogram) for the null hypothesis, which should be the case for most genomics data (as only a very small number of genes change between control/treatment).
Here are p-values from some of my cuffdiff gene_exp.diff files:
I noticed that there seems to be a pattern if you group by how many replicates you pass in:
I also re-ran a 3 rep sample with just 1 rep per condition and ended up with a similar graph to the 1-rep ones above.
Looking at cuffdiff's differential.cpp:
It seems that more variance -> higher p-values, so if more reps = higher variance, this is shifting the values away from 0...
Q: Is there something wrong here? (with my samples or cuffdiff's treatment of p-values?)
I'm using Cuffdiff 2.0.1
I'm only graphing genes where status == OK, and remove those with p-value=1, which comes from a different path in differential.cpp and doesn't use the t-tests. The graph code is basically:
As I understand it, p-values should be evenly distributed (ie flat histogram) for the null hypothesis, which should be the case for most genomics data (as only a very small number of genes change between control/treatment).
Here are p-values from some of my cuffdiff gene_exp.diff files:
I noticed that there seems to be a pattern if you group by how many replicates you pass in:
I also re-ran a 3 rep sample with just 1 rep per condition and ended up with a similar graph to the 1-rep ones above.
Looking at cuffdiff's differential.cpp:
Code:
double curr_log_fpkm_var = (curr.FPKM_variance) / (curr.FPKM * curr.FPKM); double prev_log_fpkm_var = (prev.FPKM_variance) / (prev.FPKM * prev.FPKM); double numerator = log(prev.FPKM / curr.FPKM); double denominator = sqrt(prev_log_fpkm_var + curr_log_fpkm_var); stat = numerator / denominator;
Q: Is there something wrong here? (with my samples or cuffdiff's treatment of p-values?)
I'm using Cuffdiff 2.0.1
I'm only graphing genes where status == OK, and remove those with p-value=1, which comes from a different path in differential.cpp and doesn't use the t-tests. The graph code is basically:
Code:
gene_exp_df = pd.DataFrame.from_csv(gene_exp, sep='\t') status_ok_df = gene_exp_df[gene_exp_df['status'] == 'OK'] (histogram, _) = np.histogram(p_values['p_value'], HISTOGRAM_BINS, density=True) histogram = histogram[:-1] # Last column is massive and distorts graph bins = (np.arange(HISTOGRAM_BINS) / float(HISTOGRAM_BINS))[:-1] # 0->1 pyplot.plot(bins, histogram, '--', label=label)
Comment