Seqanswers Leaderboard Ad

**billstevens** · 07-23-2012, 02:40 PM

bump?

Anyone else getting strange q values??

**mbblack** · 07-24-2012, 05:09 AM

Without replicates, I don't see how Cuffdiff could possible compute any sort of rational variance model for your three conditions. I would say your p and q values are meaningless, as the computations behind them cannot possible compute rational significance statistics with no replicates at all.

**billstevens** · 07-25-2012, 04:09 PM

Well, without replicates, cuffdiff uses the other conditions AS replicates. Its a neat trick to have a measure of variability without replicates.

But here are my statistics with 2 replicates of each condition.

Between my WT and knockout, only 295 genes have a q value less than 0.99.
Between my Control and my knockout, only 105 genes have a q value less than 0.99.
Between my WT and my control, only 7 genes have a q value less than 0.99.

Has no one else seen this???

**billstevens** · 07-25-2012, 04:12 PM

I'm using Tophat 1.4.1, Bowtie 1, and Cufflinks 2.0.2.

**mbblack** · 07-26-2012, 06:54 AM

Originally posted by billstevens View Post

Well, without replicates, cuffdiff uses the other conditions AS replicates. Its a neat trick to have a measure of variability without replicates.

I know about this, but I've never bought into it. I simply do not believe that this approach can even remotely compensate for a lack of true biological replicates in estimating variance within a treatment group.

I have not explored Cuffdiff in awhile, but I can say than in the past when I have (and we always use a minimum of 4-5 true biological replicates), I always saw far fewer significant genes with Cuffdiff than with other tools. edgeR has been my preference recently since its GLM features work well with our experiments that typically use multiple treatments each with multiple replicates. But DEseq also always produced more significant genes than Cuffdiff, as have ANOVA with various normalizations (RPKM for one, and KDMM is a recent one I've been exploring, either log2 transformed or fitted to a neg. binomial).

But I repeat that in the absence of any biological replicates at all, I would not trust any statistical results from any differential expression analysis. I know of no algorithmic trick that can compensate for that lack of appropriate data (or at least I have not seen any yet that I believe actually accomplish the effect). With samples N of 1, you simply do not have the statistical discriminating power to compute meaningful nor reliable q-values.

**billstevens** · 07-26-2012, 12:28 PM

Originally posted by mbblack View Post

But I repeat that in the absence of any biological replicates at all, I would not trust any statistical results from any differential expression analysis. I know of no algorithmic trick that can compensate for that lack of appropriate data (or at least I have not seen any yet that I believe actually accomplish the effect). With samples N of 1, you simply do not have the statistical discriminating power to compute meaningful nor reliable q-values.

OK, I agree. But what I posted above was data WITH replicates. It was data in triplicate and I still got normal p values and crazy q values.

I've posted a picture of my first 500 gene tests. The p values are normal, but all the q values are at 1.

This can't be real, right?

Attached Files

P & Q values.jpg (70.7 KB, 167 views)

**aeveland** · 12-19-2012, 08:51 AM

Hi billstevens, have you resolved your issue?? if so can you offer any advice? I am seeing similar results using Cufflinks 2.0.2 for the first time. I have a control vs. a treatment with 3 and 4 biol. reps, respectively. I have very good read coverage and also am using Bowtie 1 and Tophat 1.4.1.
Any advice is appreciated!!

**jp.** · 07-24-2013, 05:12 PM

Hi there
I read all the posts here, and got question.
1. If there is no replication data, then p_and q_value has no meaning and , therefore, has to ignore them. However, p_value of 0.05 and q_value 0.9 will be okay normally. Am I on right track ?
2. I have no replication data, so these values are of not much use for me, Am I on right track ?
3. My sig_diff_genes has low FPKM value with high error(sd)bars, which one you think considerable in attachment or all ?

**mbblack** · 07-25-2013, 03:58 AM

Originally posted by billstevens View Post

OK, I agree. But what I posted above was data WITH replicates. It was data in triplicate and I still got normal p values and crazy q values.

I've posted a picture of my first 500 gene tests. The p values are normal, but all the q values are at 1.

This can't be real, right?

Actually, it is entirely possible to have all q-values very large or equal to one. The FDR correction is related to both the number of simultaneous tests being performed, and also to the actual observed distribution of p-values. What that data is saying is that despite the p-values, when adjusted for multiple testing, there is actually no evidence of differential expression. That can happen, for example, when your minimum p-value is no smaller than 1/n where n is the number of genes you have p-values for (so if you test 2500 genes and your smallest p-value is no smaller than 0.0004, it is possible for all FDR corrected q-values to be 1). My understanding is that other oddities of p-value distribution can create the same effect - in that case, despite the individual p-values, the FDR correction is telling you that taking the entire series of simultaneous tests together, you have no evidence of significant differences.

The details of all the conditions under which that can happen are unclear to me, as I do not have the specific statistics background to follow it, but it is known to happen. And, it is one of the reasons one applies a multiplicity correction in the first place, as the problem is that individual p-values in situations of large numbers of simultaneous tests can be very misleading.

**mbblack** · 07-25-2013, 04:11 AM

Originally posted by jp. View Post

Hi there
I read all the posts here, and got question.
1. If there is no replication data, then p_and q_value has no meaning and , therefore, has to ignore them. However, p_value of 0.05 and q_value 0.9 will be okay normally. Am I on right track ?
2. I have no replication data, so these values are of not much use for me, Am I on right track ?
3. My sig_diff_genes has low FPKM value with high error(sd)bars, which one you think considerable in attachment or all ?
[ATTACH]2409[/ATTACH]

Yes, in the absence of replication, statistical assessment of differential expression of individual genes is meaningless.

In the presence of replication, one would ordinarily focus on the q-values and ignore the p-values. The issue is controlling for potential false positives given the large number of simultaneous tests being evaluated. Only the q-value gives you information about that. As to what statistical cutoff you use, that is purely arbitrary and you have to decide (and be prepared to defend) your choice. A common practice is to use an FDR limit of < 0.05 for statistical significance. But it will depend entirely on you, your data, and your specific study (e.g. an exploratory study may be comfortable with an FDR < 0.1, while a clinical study may insist on limiting results to FDR < 0.01). You have to decide what level of potential false positives you are willing to accept in your designation of "significant" data results.

In a situation where you had to revert to p-values as your measure of significance (say, if comparing a subset of your data to a subset of another, where you did not have the full distribution of p-values or q-values for the other dataset), one would ordinarily try to adopt a more stringent p-value cutoff, not a more relaxed one. In that sort of situation, one might go with a p-value < 0.01, since you are unable to effectively control for false positives (so want to be more stringent in what you consider "significant").

3. In the absence of replicates, how does one even compute a standard error? Again, a standard error requires you to have an estimate of the variance about the mean of the value you are measuring. You do not have a mean FPKM, you just have a single raw number. So you have no idea at all about what the population variance is about that number (i.e. you have no idea whether that number is a meaningful representation of the population mean FPKM as you have no estimate of the mean).

Just because an algorithm or program can throw error bars on a graph does not mean they are real or meaningful. Excel, for example, will toss SD bars on any set of numbers, but they are meaningless if you do not actually have the data to compute a mean, a variance, and hence an actual standard error.

**jp.** · 07-25-2013, 04:35 AM

Dear mbblack
May I please expect your expert reply on my above post ?

Originally posted by mbblack View Post

Yes, in the absence of replication, statistical assessment of differential expression of individual genes is meaningless.

**mbblack** · 07-25-2013, 04:48 AM

Originally posted by jp. View Post

Dear mbblack
May I please expect your expert reply on my above post ?

I already did reply

As to what you quoted in this post:

In order to determine a measure of statistical significance, you must have two things - a mean and a variance. Without them, you simply cannot compute statistical significance for any pairwise differential gene test.

The only way to get a mean and a variance is to have replicated measures of your population sample(s). So if you want to test for differential gene expression between two samples or populations (classic case would be untreated and treated groups), you have to have a very bare minimum of two replicates to be able to compute a mean and a variance about that mean (and that would be a barely acceptable number - you would have much more meaningful and reliable results with more - 3-5 at least).

When you perform RNAseq on a single organism, all you get for any given gene is a single number - a count of transcripts (and a normalized count if you perform a normalization). You have no idea how well (or not) that number represents the frequency of that gene in your population, and therefore you cannot assign any statistical significance to any observed difference between that single number and any other single number.

Statistical significance is really all about adequately capturing or accounting for variance. If your study does not include an appropriate sampling strategy to estimate that variance, then there is really nothing that statistics can offer you.

There is absolutely nothing about sequence data that changes the fact that your study has to include appropriate sampling of the groups you wish to assess statistically. In fact, if anything, I would argue that RNAseq actually requires even more thorough population sampling then other genomic survey techniques. Biological variation in sequence count data can be huge, especially for rare transcripts (especially relative to the technical variation of sequencing). If you want to robustly test across that variation, then proper sampling is essential, not optional.

**jp.** · 07-25-2013, 01:39 PM

Thank you mbblack for your kind and very valuable reply.
As I understood, replicates in RNA-seq are essential. Apart from RNA-seq, however, may I ask about the replicates in whole genome sequencing of human samples. I am planning to with 3-technical replicates in WGS along with 2-biological replicates. do you think its too much for WGS?
Sincerely
jp.

Originally posted by mbblack View Post

I already did reply

As to what you quoted in this post:
In order to determine a measure of statistical significance, you must have two things - a mean and a variance. So if you want to ........... is essential, not optional.

**rskr** · 07-25-2013, 04:10 PM

Originally posted by mbblack View Post

I already did reply

As to what you quoted in this post:

In order to determine a measure of statistical significance, you must have two things - a mean and a variance. Without them, you simply cannot compute statistical significance for any pairwise differential gene test.

The only way to get a mean and a variance is to have replicated measures of your population sample(s). So if you want to test for differential gene expression between two samples or populations (classic case would be untreated and treated groups), you have to have a very bare minimum of two replicates to be able to compute a mean and a variance about that mean (and that would be a barely acceptable number - you would have much more meaningful and reliable results with more - 3-5 at least).

When you perform RNAseq on a single organism, all you get for any given gene is a single number - a count of transcripts (and a normalized count if you perform a normalization). You have no idea how well (or not) that number represents the frequency of that gene in your population, and therefore you cannot assign any statistical significance to any observed difference between that single number and any other single number.

Statistical significance is really all about adequately capturing or accounting for variance. If your study does not include an appropriate sampling strategy to estimate that variance, then there is really nothing that statistics can offer you.

There is absolutely nothing about sequence data that changes the fact that your study has to include appropriate sampling of the groups you wish to assess statistically. In fact, if anything, I would argue that RNAseq actually requires even more thorough population sampling then other genomic survey techniques. Biological variation in sequence count data can be huge, especially for rare transcripts (especially relative to the technical variation of sequencing). If you want to robustly test across that variation, then proper sampling is essential, not optional.

Is it correct to call the between sample differences in measurement variance? To my understanding variance is supposed to be "location-invariant".

Topics	Statistics	Last Post
Bacterial Timeline Study Suggests Oxygen Use Preceded Photosynthesis by seqadmin Started by seqadmin, Today, 12:59 PM	0 responses 6 views 0 reactions	Last Post by seqadmin Today, 12:59 PM
New Software Simplifies 3D Gene Expression Mapping by seqadmin Started by seqadmin, Yesterday, 10:17 AM	0 responses 8 views 0 reactions	Last Post by seqadmin Yesterday, 10:17 AM
AI Tool Creates High-Resolution 3D Maps of the Mouse Brain by seqadmin Started by seqadmin, 03-20-2025, 05:03 AM	0 responses 49 views 0 reactions	Last Post by seqadmin 03-20-2025, 05:03 AM
Studying Microbial Gene Transfer with RNA Barcoding by seqadmin Started by seqadmin, 03-19-2025, 07:27 AM	0 responses 60 views 0 reactions	Last Post by seqadmin 03-19-2025, 07:27 AM

Seqanswers Leaderboard Ad

P and Q values with Cufflinks 2.0?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News