Seqanswers Leaderboard Ad

**Simon Anders** · 03-19-2011, 04:31 AM

Lior,

reading again through my post #11 of Tuesday, I realize that it sounds as if I criticize you for not being open in exposing the methods underlying your software. You are perfectly right that it is not only standard practice but also desirable to make software available to the community as soon as possible even if the methods paper with the full explanation is still in preparation. Hence, please accept my apologies for sounding impatient; this was inappropriate.

I still stand by my claim that any test based on the Poisson distribution is unsuitable to assess whether differential expression may be attributed to the experimental treatment of interest, for the reasons I gave, e.g., in post #6 in this thread. In our DESeq paper we follow Robinson and Smyth in arguing that a proper estimation of overdispersion is essential for this. However, from your previous statements in our discussion on SeqAnswers half a year ago, I understood that the new capabilities of cufflinks are not about dispersion estimation but about bias reduction, and that this is because you do not see the need for the former independent of the latter.

Hence, my proper answer to Jeremy's question should have been that, yes, nothing has changed in the state of matters since our previous discussion. I still think that cuffdiff leaves an important issue unadressed, and, as you just emphasized again, you still disagree and think that my concern was neither valid for the initial nor the current version of cuffdiff. Judging from his posts, Cole seems to have a somewhat different stance. Your and Cole's description of the new biological replicates functionality seem to slightly differ in precisely the point I consider crucial. I hope I'll now find my answer by reading your brand-new paper.

As a side note: When we made DESeq available, end of 2009, we uploaded a draft of our paper to the Nature Precedings server, allowing users to read about the details of our method already then. I agree that this is not standard practice and that there are good reasons for authors to refrain from distributing preprints as liberally. However, at least for us, it turned out to be very beneficial, because it stimulated discussion about and use of DESeq, so that the rather long review process of our paper did not hurt us too much.

Simon

**townway** · 03-30-2011, 01:38 PM

Hi Simon,

I am working on RNA-seq data and try to use you scripts to go over it. But my data is from cell lines which treated with drugs differing in time. I wonder whether DESseq can handle time series data, if so, which part I should change?

Thank you

Originally posted by Simon Anders View Post

Lior,

As a side note: When we made DESeq available, end of 2009, we uploaded a draft of our paper to the Nature Precedings server, allowing users to read about the details of our method already then. I agree that this is not standard practice and that there are good reasons for authors to refrain from distributing preprints as liberally. However, at least for us, it turned out to be very beneficial, because it stimulated discussion about and use of DESeq, so that the rather long review process of our paper did not hurt us too much.

Simon

**Simon Anders** · 03-30-2011, 11:29 PM

Originally posted by townway View Post

I am working on RNA-seq data and try to use you scripts to go over it. But my data is from cell lines which treated with drugs differing in time. I wonder whether DESseq can handle time series data, if so, which part I should change?

DESeq allows you to perform pairwise comparisons, and, to my knowledge, the same is true for all other tools out there. So, you can pick pairs of time points and compare these. Using GLMs, you can also compare differences between pairs of time points for one drug with differences for another drug or for the untreated controls.

But which comparisons (contrasts) are useful to analyze data? Figuring this out is, in my opinion, the main challenge of time course data.

Of course, all these pairwise comparisons are a bit pedestrian, if you have more than a three or four time points. You might be more interested in curve fits, and this is a very different statistical task, with which I have little experience. I haven't seen yet any such analysis published for RNA-Seq data, but there is lots of paper on microarray time courses. Maybe the article by Hafemeister et al in the current issue of Bioinformatics is a good starting point. Translating such methods to the RNA-Seq settings is certainly something that needs to be done now.

S

**ecofriendly** · 05-03-2011, 08:25 AM

I have a question for Simon and others regarding how to use shot noise when interpreting the DESeq output, i.e. differentially expressed genes within a given p-value cutoff.

It seems that Figure 1 of the DESeq vignette is important because it tells you at what expression level the result becomes uninterpretable - i.e. read counts are low and shot noise is high. In my dataset, which compares control and treatment across two biological replicates, shot noise is bigger than biological noise up until mean ~100 counts. Then in Figure 2, I noticed that the red line of estimated variance seems to fit the data, except at the far left of the figure, where the gene counts are lowest. These data say to me that I shouldn't trust gene counts that are lower than about 100, because there is a lot of unexplained variation for some genes at this level.

Yet in the output, I'm finding genes that nbinom test calls significant, even at very low expression levels. For example, there is one gene in my 5% FDR cutoff DGE list that changes from 5 counts to 22 counts. Yet I don't trust that this gene change is significant, given my observations about shot noise.

So my question is, why does DESeq allow gene changes with large shot noise associated with them to be called significant? And should I exclude gene changes at these low counts/ high shot noise from my downstream analysis?

If this question was answered elsewhere please direct me to that response.

Thanks very much,
Elena

**Simon Anders** · 05-04-2011, 03:44 AM

Hi Elena

first of all: the whole point of DESeq is to take both shot noise and biological noise into account when testing for differential expression. So, there is no need to do any cut-offs on expression strengths.

For weakly expressed genes, shot noise is stronger, and hence, DESeq wants to see sttronger fold changes before it calls a change significant. This is why in Fig. 3 of our paper (which is the same as Fig. 4 of the vignette but using data that is a better example), the boundary between significant (red) and non-significant (black) changes rises sharply to the left.

In your case, going from 5 to 22 is such a strong change (a more than 4-fold increase) that it is significant despite the low counts.

I don't understand, by the way, your sentence "because there is a lot of unexplained variation for some genes at this level." What do you mean by "unexplained variance"?

Simon

**ecofriendly** · 05-04-2011, 08:06 AM

Dear Simon,

First of all, thanks for your response and clarification. I see your point regarding Fig. 3 in the paper, which is consistent with my own MvA plots - that at low expression levels, the fold change has to be bigger for DESeq to call that change significant.

To be clear, I just meant that when making the SCV plot as described in Fig. 1 of the vignette, I get a graph that looks different from the example shown. Basically, my mean curve doesn't follow a bell shape on the left side; instead of tailing off, the SCV values remain high on the left, and it looks as though there is a second, smaller peak. Is this something that other users have seen? How does one explain this?

Thanks again for your help! It is much appreciated.

Elena

**Simon Anders** · 05-04-2011, 08:49 AM

I really need to change this plot; it seems to confuse people. The black curve (labelled "base mean") , which I suppose you are talking about, does not show SCV values (i.e., the y axis does not apply to it). Rather it is only there to indicate which expression stengths actually occur in your sample in order to tell you which parts of the colored curves (which do show SCV values) are of interest.

Maybe post your plot here, then I can try to clarify it.

**ecofriendly** · 05-04-2011, 09:38 AM

Hi Simon, Please see the attachment.

**ecofriendly** · 05-04-2011, 09:42 AM

Not sure if the attachment went through...trying again

Attached Files

SCVplot.pdf (56.4 KB, 248 views)

**Simon Anders** · 05-04-2011, 10:08 AM

The SCV plot is perfectly fine. The black curve just shows that you have genes with expression strength ranging from 1 count up to 300 or 1000 counts. For genes with more than around 100 counts, the biological noise is very low. For lowly expressed genes, you not only have strong shot noise but also strong biological noise.

It is a priori surprising that the biological noise should depend so strongly on the gene's expression strength (after all, a coefficient of variation is already normalized for expression strength). However, I've seen this before; it is quite common. We are currently investigating the hypothesis that this may happen whenever the library preparation PCR was started with very low initial cDNA concentration.

In your case, however, you have good replicability for you stronger genes, i.e., you should get good results.

**Cole Trapnell** · 05-05-2011, 09:11 AM

For those of you reading this thread that wished to use DESeq with Cufflinks: please check out http://cufflinks.cbcb.umd.edu for release 1.0.0, which includes a major overhaul of replicate support in Cuffdiff. Cuffdiff now models overdispersion of fragment counts at the transcript model, building on ideas introduced by DESeq and edgeR to greatly improve accuracy in calling differentially expressed genes and transcripts.

Thanks to all the commenters on this thread and elsewhere for helpful feedback!

**ashuchawla** · 07-15-2013, 09:35 AM

Need Help with Time Series analysis of RNA-Seq Data

Dear Simon or anybody with RNA-Seq data analysis expertise,

I wanted to ask you if there have been any updates on DESeq or another tool since this post which could enable the analysis of RNA-Seq Data for samples across time( day 0, day 3, day6, day 9) without having to do pair wise comparisons. I have total 10 samples and pairwise comparisons would take me a long time. I need to know gene regulation pattern across time for these samples and if this could be done using all samples at one time. Any help would be highly appreciated. I started working on RNASeq analysis only a month ago and do not have a lot of experience.

Thanks
Ashu

Originally posted by Simon Anders View Post

DESeq allows you to perform pairwise comparisons, and, to my knowledge, the same is true for all other tools out there. So, you can pick pairs of time points and compare these. Using GLMs, you can also compare differences between pairs of time points for one drug with differences for another drug or for the untreated controls.

But which comparisons (contrasts) are useful to analyze data? Figuring this out is, in my opinion, the main challenge of time course data.

Of course, all these pairwise comparisons are a bit pedestrian, if you have more than a three or four time points. You might be more interested in curve fits, and this is a very different statistical task, with which I have little experience. I haven't seen yet any such analysis published for RNA-Seq data, but there is lots of paper on microarray time courses. Maybe the article by Hafemeister et al in the current issue of Bioinformatics is a good starting point. Translating such methods to the RNA-Seq settings is certainly something that needs to be done now.

S

**Simon Anders** · 07-15-2013, 10:14 AM

Yes, there have been a lot of updates to DESeq, especially the release of DESeq2.

And there have always been plenty of methods to analyse time series data. My post above was not to claim that it cannot be done. Rather, it can only be done once you know what it is, i.e., what you actually want.

You say you "need to know gene regulation pattern across time". What do you mean exactly by "pattern"?

DESeq is a tool to test for statistical significance of differential expression. You ask a specific question and you get a p value, i.e., a yes/no answer (or rather: a yes / can't say answer). Once you can tell me the yes/no question, I can tell you how to use DESeq for it.

The issue here is that people keep asking me about once a week "how do you analyse time course data?" but when I ask back "what is your precise question?" I never get an answer.

Not that I'm surprised: In my experience, analysing time course data is rarely about answering yes/no question but rather about answering "which?" questions and hence, they are a job not for methods of statistical hypothesis testing but of machine learning.

Simon

**Simon Anders** · 07-15-2013, 10:19 AM

BTW, what did you mean by "pairwise comparisons would take me a long time"? Can't be more than a few minutes calculation time. The issue is rather: what would the result tell you?

**ashuchawla** · 07-15-2013, 10:26 AM

I have been told that there will be genes that would be down regulated for all 9 days, some would be up-regulated for all 9 days and some would change from down to up or vice versa. I understand that I could get a list of DE genes upon pairwise comparisons of my samples across the times d0, d3, d6 and d9. I have categorized the ones with negative log2foldchange(in pairwise comparison in my previous project) as down regulated and the ones that are positive as up regulated. I wanted a way to do that for this project as well but if I have to do it pairwise , it will take me a lot of time. I have BAM files for all samples and I also have the HT-Seq counts for all of them. What would be the best move for me next?

Please let me know if you have any further questions...

Thanks a million,
Ashu

Originally posted by Simon Anders View Post

Yes, there have been a lot of updates to DESeq, especially the release of DESeq2.

And there have always been plenty of methods to analyse time series data. My post above was not to claim that it cannot be done. Rather, it can only be done once you know what it is, i.e., what you actually want.

You say you "need to know gene regulation pattern across time". What do you mean exactly by "pattern"?

DESeq is a tool to test for statistical significance of differential expression. You ask a specific question and you get a p value, i.e., a yes/no answer (or rather: a yes / can't say answer). Once you can tell me the yes/no question, I can tell you how to use DESeq for it.

The issue here is that people keep asking me about once a week "how do you analyse time course data?" but when I ask back "what is your precise question?" I never get an answer.

Not that I'm surprised: In my experience, analysing time course data is rarely about answering yes/no question but rather about answering "which?" questions and hence, they are a job not for methods of statistical hypothesis testing but of machine learning.

Simon

Topics	Statistics	Last Post
New Software Simplifies 3D Gene Expression Mapping by seqadmin Started by seqadmin, Today, 10:17 AM	0 responses 7 views 0 reactions	Last Post by seqadmin Today, 10:17 AM
AI Tool Creates High-Resolution 3D Maps of the Mouse Brain by seqadmin Started by seqadmin, 03-20-2025, 05:03 AM	0 responses 49 views 0 reactions	Last Post by seqadmin 03-20-2025, 05:03 AM
Studying Microbial Gene Transfer with RNA Barcoding by seqadmin Started by seqadmin, 03-19-2025, 07:27 AM	0 responses 59 views 0 reactions	Last Post by seqadmin 03-19-2025, 07:27 AM
Mapping the snoRNAome in Zebrafish to Advance Disease Research by seqadmin Started by seqadmin, 03-18-2025, 12:50 PM	0 responses 50 views 0 reactions	Last Post by seqadmin 03-18-2025, 12:50 PM

Seqanswers Leaderboard Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News