Seqanswers Leaderboard Ad

**super0925** · 05-19-2014, 03:42 AM

Originally posted by dpryan View Post

Read help(results)

Another question , is it critical if I don't filter the low counts by genefilter package? If I save the tags that are expressed in at least one of condition (like totourial in edgeR), is it enough?

**super0925** · 05-23-2014, 03:18 AM

Originally posted by dpryan View Post

Read help(results)

Hi D
A quick question.
So far I have processed a DE analysis with 3 replicates in 3 conditions as I told you. The collaborator is interested with the DE genes but not the transcripts. He will verify the DE genes may be by qPCR.
What so far I get are the DE gene lists from Texudo/ DESeq2/ edgeR/ baySeq separately. And I got the Venn plot of the overlap among the DE genes lists.
Is anything else I can do? e.g. decide the best pipeline, or some other DE analysis except only offering list?
Or go to downstream analysis i.e. GO,pathway, etc.
Pls give me some suggestion.
Thank you!

**dpryan** · 05-23-2014, 03:39 AM

Base what you do next on the biological goals. Sometimes it makes sense to just offer an annotated list. If the list of DE genes is rather long, it'd be nice to mine out some links in pubmed for each of the genes, so that the researcher him/herself doesn't have to do it. If this experiment is simply fishing for changes, then GO and pathway analysis might be useful. If there's an expected effect, then doing those in a more targeted way (cf. camera() and roast() in limma/edgeR) would make more sense.

**super0925** · 05-23-2014, 04:59 AM

Originally posted by dpryan View Post

Base what you do next on the biological goals. Sometimes it makes sense to just offer an annotated list. If the list of DE genes is rather long, it'd be nice to mine out some links in pubmed for each of the genes, so that the researcher him/herself doesn't have to do it. If this experiment is simply fishing for changes, then GO and pathway analysis might be useful. If there's an expected effect, then doing those in a more targeted way (cf. camera() and roast() in limma/edgeR) would make more sense.

So far I use Q-value<0.05 as threshold.
If the overlap is like this (just based from Venn plot)
which method or methods do you recommend? If we use 145 sharing ones I don't know is it too conservative?

Attached Files

123.jpg (48.8 KB, 3 views)

**dpryan** · 05-23-2014, 05:03 AM

There's no way to judge accuracy from a Venn diagram. Which version of cufflinks did you use? Lately it tends to be more conservative than the others, so that seems off. What often happens is that the differences (e.g., DESeq2 vs. edgeR) are toward the margins of significance, where you get an adjusted p-value of 0.08 in DESeq2 and 0.11 in edgeR (or vice versa), which isn't surprising. One thing to check is if DESeq2 flagged a number of the edgeR/cuffdiff only genes as having outlier samples. This is a really nice feature and can help avoid false-positive findings.

**super0925** · 05-23-2014, 05:13 AM

Hi D
(1) I will try the latest Cufflinks to repeat the work. because the Cuffdiff result is too liberal!
(2) I haven't totally got what you said.
if DESeq2 "flagged" a number of the edgeR/cuffdiff only genes as having outlier samples

what does it mean and how to do it?

Before In did the DE analysis, I filter and save the tags that are expressed in at least one of condition in edgeR (but not DESeq2, cause you said the lastest version of DESeq2 could filter the counts automatically). Is that filter the outliers?

Sorry for the naive question.

**dpryan** · 05-23-2014, 05:23 AM

There are two kinds of filtering that can occur. Previously, I was referring to "independent filtering", wherein filtering is performed to maximize power (see the genefilter package and accompanying paper). The second type of filtering, that I was alluding to just above, uses cook's distance (basically, if, when looking at a given gene, a single sample has too much leverage (in the statistical sense), then the fit is unreliable and we ignore statistical tests) to find genes where there may be outlier samples. DESeq2 won't normally (you can disable this, which turns out to be necessary on occasion) test these, so that can also produce a difference in the results between the various packages. If a number of the genes found to be DE with edgeR but not DESeq2 were flagged and excluded in this manner, then that'd be good to know, since then the edgeR results would be less reliable (likewise with cuffdiff).

See section 4.3 in the DESeq2 vignette.

**super0925** · 05-27-2014, 01:42 AM

Originally posted by dpryan View Post

There are two kinds of filtering that can occur. Previously, I was referring to "independent filtering", wherein filtering is performed to maximize power (see the genefilter package and accompanying paper). The second type of filtering, that I was alluding to just above, uses cook's distance (basically, if, when looking at a given gene, a single sample has too much leverage (in the statistical sense), then the fit is unreliable and we ignore statistical tests) to find genes where there may be outlier samples. DESeq2 won't normally (you can disable this, which turns out to be necessary on occasion) test these, so that can also produce a difference in the results between the various packages. If a number of the genes found to be DE with edgeR but not DESeq2 were flagged and excluded in this manner, then that'd be good to know, since then the edgeR results would be less reliable (likewise with cuffdiff).

See section 4.3 in the DESeq2 vignette.

Hi D, I still have some questions:

1.I haven't totally get your meaning. Do you mean firstlly I detect the outliers (by section4.3 in DESeq2 vignette). If the genes found to be DE with edgeR but not DESeq2 were flagged 'TRUE ' as outliers, the edgeR results would be less reliable. and Vise versa?
For example, suppose there 10 DE genes detected by edgeR but not DESeq2, and 20 DE genes detected by DESeq2 but not edgeR. and for first 10 genes, 8 are flagged as outlier, and for second 20 genes, 10 are flagged as outlier. So we could say edgeR (8/10=80%) is less reliable than DESeq2(10/20=50%). Am I right?

2. Besides this outlier finding method, are there any other mothods (the more the better) could do the comparison in computer (not qPCR) as well? Currently I did the (1) overlap among the piepelines, and (2)outlier decetion as you know. I found a paper talking about the spearman correlation among the DE methods, is that also good? or the best ways are overlap and outlier.

3. Before I do the DE analysis, I didn't use your genefilter package but use edgeR mannual recommend command which filter and save the tags that are expressed in at least one of condition in edgeR (but not DESeq2, cause you said the lastest version of DESeq2 could filter the counts automatically), is that OK?

4. Are these outlier genes useful/meaningful for DE analysis? I mean are these interesting results (large counts gene is good?) or bad effect on the DE analysis we must remove them?

**super0925** · 05-27-2014, 02:16 AM

Originally posted by dpryan View Post

There are two kinds of filtering that can occur. Previously, I was referring to "independent filtering", wherein filtering is performed to maximize power (see the genefilter package and accompanying paper). The second type of filtering, that I was alluding to just above, uses cook's distance (basically, if, when looking at a given gene, a single sample has too much leverage (in the statistical sense), then the fit is unreliable and we ignore statistical tests) to find genes where there may be outlier samples. DESeq2 won't normally (you can disable this, which turns out to be necessary on occasion) test these, so that can also produce a difference in the results between the various packages. If a number of the genes found to be DE with edgeR but not DESeq2 were flagged and excluded in this manner, then that'd be good to know, since then the edgeR results would be less reliable (likewise with cuffdiff).

See section 4.3 in the DESeq2 vignette.

Hi D, I found that after I filter and save the tags that are expressed in at least one of condition in edgeR, there are only 9000 genes out of 25000 genes in my dataset.
The command:
keep <- rowSums(cpm(countstable )>1) >= 3 # mine data is 2 condition, each condition has 3 samples.
countstable <- countstable[keep,]
(question 1) Is my command OK?

however,the DESeq2's result still has 25000 genes.
So (question 2) Do I need to filter the countstable before DE analysis in DESeq2??
because I think when I am doing the pipeline comparison , I cannot separately dectect the outliers on 9000 genes vs 25000 genes.

**dpryan** · 05-27-2014, 03:04 AM

Originally posted by super0925 View Post

Hi D, I still have some questions:

1.I haven't totally get your meaning. Do you mean firstlly I detect the outliers (by section4.3 in DESeq2 vignette). If the genes found to be DE with edgeR but not DESeq2 were flagged 'TRUE ' as outliers, the edgeR results would be less reliable. and Vise versa?
For example, suppose there 10 DE genes detected by edgeR but not DESeq2, and 20 DE genes detected by DESeq2 but not edgeR. and for first 10 genes, 8 are flagged as outlier, and for second 20 genes, 10 are flagged as outlier. So we could say edgeR (8/10=80%) is less reliable than DESeq2(10/20=50%). Am I right?

More or less, yes. edgeR doesn't do outlier detection (as far as I recall, at least), so you just see if any of its DE genes were flagged by you'd just want to know that 8/10 of edgeR's DE genes are likely unreliable.

2. Besides this outlier finding method, are there any other mothods (the more the better) could do the comparison in computer (not qPCR) as well? Currently I did the (1) overlap among the piepelines, and (2)outlier decetion as you know. I found a paper talking about the spearman correlation among the DE methods, is that also good? or the best ways are overlap and outlier.

I can't think of anything other than the ways you mentioned off-hand. Using the spearman correlation is an interesting idea. Presumably one would rank the union of DE genes by their spearman correlation coefficient, perhaps applying some threshold.

3. Before I do the DE analysis, I didn't use your genefilter package but use edgeR mannual recommend command which filter and save the tags that are expressed in at least one of condition in edgeR (but not DESeq2, cause you said the lastest version of DESeq2 could filter the counts automatically), is that OK?

That's fine, you just have less power with those results (BTW, genefilter isn't my package, I just recommend its usage).

4. Are these outlier genes useful/meaningful for DE analysis? I mean are these interesting results (large counts gene is good?) or bad effect on the DE analysis we must remove them?

Sometimes the outliers are still useful, but usually not. It's often good to just have a look at the underlying data to see what's going on.

**dpryan** · 05-27-2014, 03:08 AM

Originally posted by super0925 View Post

Hi D, I found that after I filter and save the tags that are expressed in at least one of condition in edgeR, there are only 9000 genes out of 25000 genes in my dataset.
The command:
keep <- rowSums(cpm(countstable )>1) >= 3 # mine data is 2 condition, each condition has 3 samples.
countstable <- countstable[keep,]
(question 1) Is my command OK?

That's the common way to do things in edgeR. Realistically, the results shouldn't change much if one compares filtering like this before testing or doing so after (and using the same threshold). I think DESeq2's method makes more sense, but that's me.

however,the DESeq2's result still has 25000 genes.
So (question 2) Do I need to filter the countstable before DE analysis in DESeq2??
because I think when I am doing the pipeline comparison , I cannot separately dectect the outliers on 9000 genes vs 25000 genes.

DESeq2 will just give a bunch of NA pvalues for those it filters. The results will be there, just not a test statistic. This makes life a bit easier if you ever need to look at multiple experiments together, since then you don't have to deal with genes being in one results file but not the other (plus, you get an idea of what's getting filtered out).

**super0925** · 05-27-2014, 04:35 AM

Originally posted by dpryan View Post

That's the common way to do things in edgeR. Realistically, the results shouldn't change much if one compares filtering like this before testing or doing so after (and using the same threshold). I think DESeq2's method makes more sense, but that's me.

DESeq2 will just give a bunch of NA pvalues for those it filters. The results will be there, just not a test statistic. This makes life a bit easier if you ever need to look at multiple experiments together, since then you don't have to deal with genes being in one results file but not the other (plus, you get an idea of what's getting filtered out).

Let me summarize what I got:

1. DESeq2 don't requrie for filter before DE analysis. If I insist, I could do it as well.

2. Outlier list comparison steps:
step1. After DESeq2 processing, we could run the command in section 4.3 , select the genes flagged as 'TRUE' to build the outlier gene list (list 1).
step2. We get another list (list 2), which is the genes predicted as DE in edgeR or Cuffdiff while not in DESeq2.
step3: We check the how many genes in list 2 while also in the list 1. The proption is bigger, the method (edgeR or Cuffdiff) is more unreliable.

3. So far we cannot use "outlier list" to judge DESeq2 is good or not because only DESeq2 package give our function detecting the outliers, but we could observe the reliability of edgeR or Cuffdiff.

Am I right?

**super0925** · 05-27-2014, 05:15 AM

strange result in outlier

Originally posted by dpryan View Post

That's the common way to do things in edgeR. Realistically, the results shouldn't change much if one compares filtering like this before testing or doing so after (and using the same threshold). I think DESeq2's method makes more sense, but that's me.

DESeq2 will just give a bunch of NA pvalues for those it filters. The results will be there, just not a test statistic. This makes life a bit easier if you ever need to look at multiple experiments together, since then you don't have to deal with genes being in one results file but not the other (plus, you get an idea of what's getting filtered out).

Hi D
The strange is that after I use the count outlier detection by DESeq2 (if I don't filter before DE analysis)
Why outliers have fold change but not outliers have NA fold change and P-value?
and why the first two genes have P-value but not have adjust P-value?
Command in R are bold
W<-res$stat
maxCooks<-apply(assays(dds)[["cooks"]],1,max)
idx<-!is.na(W)

head(res[which(idx=="TRUE"),],100)
DataFrame with 100 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSBTAG00000000005 1.0597616 0.6512851 0.4803849 1.3557569 0.1751765
ENSBTAG00000000008 0.8508167 0.2153765 0.4600089 0.4682007 0.6396411
ENSBTAG00000000010 2.9979379 0.2802807 0.5104307 0.5491062 0.5829326
... ... ... ... ... ...
ENSBTAG00000000191 1.4469456 0.03588896 0.4931081 0.07278113 0.9419803
ENSBTAG00000000195 0.4348506 -0.15788350 0.3280363 -0.48129882 0.6303041
ENSBTAG00000000197 3.0236299 -0.08213013 0.5141906 -0.15972703 0.8730961
padj
<numeric>
NA
NA
0.9626562
... ...
NA
NA
0.9892849

head(res[which(idx=="FALSE"),],100)
DataFrame with 100 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSBTAG00000000003 0 NA NA NA NA
ENSBTAG00000000011 0 NA NA NA NA
ENSBTAG00000000020 0 NA NA NA NA
... ... ... ... ... ...
ENSBTAG00000000437 0 NA NA NA NA
ENSBTAG00000000438 0 NA NA NA NA
ENSBTAG00000000441 0 NA NA NA NA
padj
<numeric>
NA
NA
NA
... ...
NA
NA
NA

**super0925** · 06-02-2014, 05:12 AM

Originally posted by dpryan View Post

That's the common way to do things in edgeR. Realistically, the results shouldn't change much if one compares filtering like this before testing or doing so after (and using the same threshold). I think DESeq2's method makes more sense, but that's me.

DESeq2 will just give a bunch of NA pvalues for those it filters. The results will be there, just not a test statistic. This makes life a bit easier if you ever need to look at multiple experiments together, since then you don't have to deal with genes being in one results file but not the other (plus, you get an idea of what's getting filtered out).

Hi D
I have done some pipeline comparison analysis last week. All the DE genes are given by default setting.
I have given the (1) overlap Venn plot, (2) jaccard index heatmap plot among 5 methods and the (3) spearman correlation clustering plot among the TOP 100 genes in each methods.
Do you think it is fine?
What is your suggestion in this case? Are edgeR/DESeq more reliable from the result?
Thank you so much!

PS: I didn't do the outlier detect validation cause you didn't reply for my strange result in the post #132 and #133
I could do that as well after I get clear the problem.

Attached Files

**dpryan** · 06-02-2014, 06:28 AM

Well, you can't derive any information about reliability of the tools from this, you'd need to have known-DE genes and then see how well the tools find them. For the most part, the images are telling you about the similarity in methods, except for cuffdiff, which has more discordant than expected results (though perhaps it's the correct one, there's only one way to find out). I wouldn't recommend putting any more time in the comparisons, you won't get anything more informative out without performing validations on the findings.

Regarding post #132, yes, your understanding is correct.

Regarding post #133, note that the baseMean for genes with NA in all of the fields is 0. That should tell you why everything is NA. For genes with a p-value but no adjusted p-value, they were most likely filtered to increase power.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News