There are three conditions for my samples, no replicate. If I am trying to do differential expression analysis, what program I should go with? Any sample code to share? Thanks.
Unconfigured Ad
Collapse
X
-
Without replicates you have no statistical power to differentiate expression levels. You can use any one of several programs to run an analysis but the statistics you would get back from them will be meaningless without replicates.
All you can do is look at raw differences between your conditions, rank them by magnitude and then start work confirming them by some other independent experimental means.
If you take "differential gene expression analysis" to mean distinguishing the relative significance of differences in observed expression by statistical ranking then replicates are a requirement, not an option. Otherwise, you are just comparing the raw difference between two numbers, with no way of assessing the significance of one difference versus any other.
Honestly, at this point I wish there was a sticky in this forum to make that clear. I'm sure the MACQ-III (aka SEQC) papers which will be submitted by years end will make the point clear.Last edited by mbblack; 07-24-2012, 04:10 AM.Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.
-
-
When you say "any programs"? Does it mean EdgeR, Deseq, Bayseq?
I feel EdgeR and Deseq are only for comparing two conditions. Here I have three. So I have to go with Bayseq?
I am really new in RNA-seq analyses. Please bear with me if the question sounds too simple.
Comment
-
-
edgeR can compare multiple pairwise comparisons using the General Linear Models feature - see the latest versions documentation. DEseq will only handle pairwise comparisons, so you would have to analyze each pairwise group separately. I have not used Bayseq in a very long time, so I do not know about its current version's capabilities.Originally posted by capricy View PostWhen you say "any programs"? Does it mean EdgeR, Deseq, Bayseq?
I feel EdgeR and Deseq are only for comparing two conditions. Here I have three. So I have to go with Bayseq?
I am really new in RNA-seq analyses. Please bear with me if the question sounds too simple.
With your data you could just compute RPKM, log2 transform them (or fit a neg. binomial), treat them as normally distributed and do T-tests - you will get the same senseless statistics out.
As I say, the main problem you have is that without replicates, you are pursuing a pointless exercise in analyzing such data and your statistics will be meaningless. Your fundamental problem is that with no replicates, you have zero degrees of freedom for any of your pairwise contrasts, so no statistical power at all. None of those tools you list can possible compute meaningful variance models with no replicates (no algorithm could), and thus your statistics cannot be trusted at all.
Yes you can compute P-values, but the fact that an algorithm will take your numbers, run to completion and spit out a result does not mean the results are accurate nor reliable (it just means the algorithm works).
Sorry to come off sounding harsh, but the truth is you NEED replicates to compute statistically meaningful estimates of differential expression - there is no way around that fact.Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.
Comment
-
-
Well, there are so many people with experiments without replicates that we felt the need to offer some way of salvaging such botched experiments and giving people a tool to get a little bit of results out of it. Read carefully what I wrote in the DESeq vignette about it.Originally posted by capricy View PostBut edgeR manual does have a section specifically talking about no-replicate situation. Are they trying to mislead people purposely?
No offense, just want to get better understanding...
As for comparisons: The point of a differential expression analysis is to find out whether a gene goes up or down under some treatment. This formulation makes no sense if you have more than two condition. There, you may want to know, either, whether any of the treatment causes an effect, or, which, of the treatments causes an effect. The former is done with an ANODEV analysis (GLM with and without condition factor), the latter with several pair-wise analyses.
Simon
Comment
-
-
Taken directly from the edgeR manual (section 2.7):Originally posted by capricy View PostBut edgeR manual does have a section specifically talking about no-replicate situation. Are they trying to mislead people purposely?
No offense, just want to get better understanding...
That's a pretty strong qualification for proceeding with the analysis in the absense of replicates. And I do know from my own trial analyses where I have replicates, but have analyzed randomly chosen non-replicated sample sets drawn from those data, that one gets very different results from those obtained with the replicates included (using edgeR - I've never actually bothered to try this with any other tools and only did it with edgeR out of curiosity to see how different things would be).In these cases there are no replicate libraries from which to estimate biological variability. In this situation, the data analyst is faced with the following choices, none of which are ideal. We do not recommend any of these choices as a satisfactory alternative for biological replication. Rather, they are the best that can be done at the analysis stage, and options 2-4 may be better than assuming that biological variability is absent.
So, in my opinion at least, there is no algorithmic substitute for biological replication and I would not trust any differential gene expression results presented from an experiment that did not include them. I work in toxicogenomics (an awkward term, but it's the buzz word in use these days), and I know no publication nor regulatory agency that would give a moments notice to any such analysis presented without replication.
There really is no substitute for doing the experiment correctly in the first place.Last edited by mbblack; 07-31-2012, 04:13 AM.Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.
Comment
-
-
No, why would I do that? It is a statistical issue as far as I am concerned. As I have said, no algorithmic solution can make up for the lack of proper biological replicates to give a reliable measure of variation. Out of curiosity I wanted to see how wrong an analysis would be without them, and my curiosity is satisfied - no replicates means, IMO, no reliable nor valid results.
What is going on is the simple fact that math and stats alone cannot compensate for inadequate data. Your statistical results are limited by the data they have to work with - the old "garbage in, garbage out" rule still applies. Collect inadequate data and you get inadequate results.Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.
Comment
-
-
The point is, in order to say a gene is differentially expressed, you need to know how variable it's expression is within your experimental and your reference populations. Then you can assess the level of variation in expression between your treatment or experimental group, relative to the variations in expression in your control or reference group, and assess statistically whether, in aggregate, the level of expression in one group is different from the other. Actually, what you want to know is whether the difference between the groups is not merely different, but that the difference, relative to the variation within each group, is unlikely to have occurred by chance.
The only way to do that is to know what the variation in expression is within each group. And the only way to know that is to sample it by collecting data from biological replicates. That is the only way to actually estimate the real biological variability within a population - sample it directly. No algorithm or simulation can ever compensate for real world sampling of that variation.
This is an age old issue - I've seen it back in the 1980's with allozyme's and RFLPs and population genetic studies, in the early 2000's with array data, and now we're debating it all over again with sequence data.
Biology is all about variation, and you cannot compensate for inadequate sampling of that variation by mathematical manipulation. It has never worked for other data in the past, and sequence data is no different.
So yes, if I was reviewing a paper with a differential gene expression analysis without biological replicates, I would likely reject it - bad science is worse than no science to my mind (especially avoidable bad science). The only instance where I might relax that would be those extreme cases where replication was, literally, impossible, such as there perhaps only being one sample in existence.Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.
Comment
-
-
"The only instance where I might relax that would be those extreme cases where replication was, literally, impossible, such as there perhaps only being one sample in existence."
I might unfortunately be working on this kind of cases: human samples not available in the US and only come from some countries; so I am seeking help here...
Comment
-
-
If that is your situation, then you can try running either edgeR and/or DEseq and see what you get. You will have to do a lot of experimental verification of any genes of interest though, to really be sure of anything you see.Originally posted by capricy View Post"The only instance where I might relax that would be those extreme cases where replication was, literally, impossible, such as there perhaps only being one sample in existence."
I might unfortunately be working on this kind of cases: human samples not available in the US and only come from some countries; so I am seeking help here...
Since the stats will be very iffy at best, you may not want to bother with them. You could just normalize by whatever means you prefer and compare the magnitude of the normalized counts by gene side by side in your samples, and pick genes with the greatest relative differences for qPCR verification - sort of akin to just ranking genes by max. fold change and picking the ones with the greatest differences. That might get you to interesting genes more reliably, under the circumstances, then bothering with FDR or p-values which you know are highly questionable at best.
In the absence of replicates, or any chance of replicates, I think I might go that route instead - not bother with the stats per se. Focus more on a good fitting normalization technique and then compare normalized values for your rank ordering to pick genes. You could impose a minimum cutoff as well, say, ignore all genes that had a raw mapped count of less that 10 to make sure you are only looking at well represented genes to begin with.
I think you just need to be a bit creative about how you pick interesting genes when your data set is inherently non-optimal.Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.
Comment
-
Latest Articles
Collapse
-
by SEQadmin2
I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.
Here are nine questions we think about, in roughly the order they matter, before...-
Channel: Articles
06-18-2026, 07:11 AM -
-
by SEQadmin2
Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.
The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
...-
Channel: Articles
06-02-2026, 10:05 AM -
ad_right_rmr
Collapse
News
Collapse
| Topics | Statistics | Last Post | ||
|---|---|---|---|---|
|
Started by SEQadmin2, Today, 05:37 AM
|
0 responses
5 views
0 reactions
|
Last Post
by SEQadmin2
Today, 05:37 AM
|
||
|
Started by SEQadmin2, 06-26-2026, 11:10 AM
|
0 responses
16 views
0 reactions
|
Last Post
by SEQadmin2
06-26-2026, 11:10 AM
|
||
|
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population
by SEQadmin2
Started by SEQadmin2, 06-17-2026, 06:09 AM
|
0 responses
50 views
0 reactions
|
Last Post
by SEQadmin2
06-17-2026, 06:09 AM
|
||
|
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism
by SEQadmin2
Started by SEQadmin2, 06-09-2026, 11:58 AM
|
0 responses
110 views
0 reactions
|
Last Post
by SEQadmin2
06-09-2026, 11:58 AM
|
Comment