HTseq to DeSeq/EdgeR to Heatmap

super0925 replied

06-19-2014, 04:57 AM
Originally posted by dpryan View Post

1. At least with htseq-count, the -m intersection_strict wouldn't count a read that overhangs a feature (i.e., overlaps an exon but continues into an intron). So you could use that.
2. They'll be mapped, but not counted.

Thank you D,
This is very useful. My supervisor asked me for statistic of how many reads counts are partly or totally overlap with introns in each genes.
sth. like
gene1 counts 200 counts overlap introns 50
...

I will try what you suggest to see the result.

Another question, if I want to know how many reads are not mRNA (e.g. ribosome RNA), do you have any suggestion to do that?thank you!
Leave a comment:
dpryan replied

06-19-2014, 04:20 AM
1. At least with htseq-count, the -m intersection_strict wouldn't count a read that overhangs a feature (i.e., overlaps an exon but continues into an intron). So you could use that.
2. They'll be mapped, but not counted.
Leave a comment:
super0925 replied

06-19-2014, 12:50 AM
Originally posted by dpryan View Post

I assume that the reads would only partly overlap an intron, since otherwise they wouldn't normally get counted. I wouldn't recommend removing them. While one could argue that they represent unprocessed RNAs, which you aren't interested in, they may also just represent the difficulty of mapping near splice boundaries and, in any case, would be presumed to be present at similar levels across samples in either case.

Thank you D.
1.
I will not remove 'partly overlap' intron reads but if I want to know the propotion of these 'partly overlap' intron reads, how could I do that?

2.
for the 'full overlap' intron reads, is that equal to unmapped reads? Am I right?

Last edited by super0925; 06-19-2014, 12:59 AM.
Leave a comment:
dpryan replied

06-18-2014, 01:37 PM
I assume that the reads would only partly overlap an intron, since otherwise they wouldn't normally get counted. I wouldn't recommend removing them. While one could argue that they represent unprocessed RNAs, which you aren't interested in, they may also just represent the difficulty of mapping near splice boundaries and, in any case, would be presumed to be present at similar levels across samples in either case.
Leave a comment:
super0925 replied

06-18-2014, 08:51 AM
Thank you D.
I got it.
Another question, suppose a gene has 200 reads counts mapped, how many reads out of 200 reads has the overlap with introns? How do I know that? Do I need to remove these 'intron overlap' reads before doing DE analysis ?
Cheers

Last edited by super0925; 06-18-2014, 09:23 AM.
Leave a comment:
dpryan replied

06-17-2014, 01:43 AM
Sure, you can use whatever thresholds you want. A FDR of 0.1 is the typical threshold, but of course that still gives you ~10% false positives. If you wanted to use 0.01 or something else then there's nothing innately wrong with that. Using a fold-change threshold is occasionally done. It's certainly the case that a 5% change is unlikely to be biologically meaningful for most genes, whereas a 50% change likely is, so you'll occasionally see 1.5x or 2x thresholds used.
Leave a comment:
super0925 replied

06-17-2014, 01:37 AM
Originally posted by dpryan View Post

There's no way to judge accuracy from a Venn diagram. Which version of cufflinks did you use? Lately it tends to be more conservative than the others, so that seems off. What often happens is that the differences (e.g., DESeq2 vs. edgeR) are toward the margins of significance, where you get an adjusted p-value of 0.08 in DESeq2 and 0.11 in edgeR (or vice versa), which isn't surprising. One thing to check is if DESeq2 flagged a number of the edgeR/cuffdiff only genes as having outlier samples. This is a really nice feature and can help avoid false-positive findings.

Hi D
Just a quick question about Cuffdiff.
As we know we selected the significant DE genes in Cuffdiff by FDR Q-value< 0.05. But if I still think it is too liberal, could we have more conservative threshold? As you know , P or Q -value = 0.05 is a well known threshold.
Could we add log 2 fold-change as another threshold as well? which level do you prefer ?
Cheers
Leave a comment:
dpryan replied

06-03-2014, 03:42 AM
Originally posted by super0925 View Post

Hi Devon
Thank you for your explantion.
(1)My unstanding is I don't need to consider about the genes with p-value or adj-pvalue are set to 'NA'. All of them could be filtered by package. Am I right?

No, if only the raw and adjusted p-values are NA, then these would fall into #2 of the section I quoted from the vignette.

(2)But I still confused which 'DE list' I need to compare with edgeR/etc.

See above.

res[which(idx=="TRUE")

These are just genes for which there's a count in at least one sample.
Leave a comment:
super0925 replied

06-03-2014, 02:35 AM
Originally posted by dpryan View Post

I'll just quote from the vignette, which should be clear enough:

These wouldn't be significant with any of the tests.

If edgeR/etc. find these to be DE, then be cautious believing that.

These are filtered to increase power.

Hi Devon
Thank you for your explantion.
(1)My unstanding is I don't need to consider about the genes with p-value or adj-pvalue are set to 'NA'. All of them could be filtered by package. Am I right?
(2)But I still confused which 'DE list' I need to compare with edgeR/etc. I mean the "If edgeR/etc. find these to be DE, then be cautious believing that."
Is that the first list in #133 , res[which(idx=="TRUE")?
Or all the genes with P-value or adj-p value set to NA?
Thanks a lot!
Last edited by super0925; 06-03-2014, 02:47 AM.
Leave a comment:
dpryan replied

06-02-2014, 12:48 PM
I'll just quote from the vignette, which should be clear enough:

Note that some values in the results table can be set to NA, for either one of the following reasons:
If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p value and adjusted p value will all be set to NA.

If a row contains a sample with an extreme count outlier then the p value and adjusted p value are set to NA. These outlier counts are detected by Cook's distance. Customization of this outlier filtering and description of functionality for replacement of outlier counts and refitting is described in Section 3.5.

If a row is ltered by automatic independent filtering, based on low mean normalized count, then only the adjusted p value is set to NA. Description and customization of independent filtering is described in Section 3.8.

These wouldn't be significant with any of the tests.

If edgeR/etc. find these to be DE, then be cautious believing that.

These are filtered to increase power.
Leave a comment:
super0925 replied

06-02-2014, 07:43 AM
[QUOTE=dpryan;141837]Have a read through section 1.4.2 (I think) of the DESeq2 vignette.

You misunderstood, those genes were already filtered for power, which is why there's no adjusted p-value but there is a raw p-value. You're just comparing the list of DE genes anyway, so that's fine.

[QUOTE]Or is this the list the "outlier list", which would be searched in the DE genes excluded by DESeq2 but within edgeR, and to observe edgeR is reliable or not?[QUOTE]

If both the adjusted AND raw p-value are NA, then there was at least one likely outlier sample for that gene, so it was filtered for that reason. If edgeR and the others call those DE then you should look closer at the data to determine if DESeq2 is doing things correctly or not.

[QUOTE]Another question, the second list, that is the genes are not outlier, all of them baseMean are 0, is this normal?

As I mentioned, the baseMean of 0 should tell you something. Look at the raw counts for those, they'll be ignored by all of the tools.

Sorry Devon, I am sorry I am confused. Which list do I need to compare with edgeR/Cuffdiff ? i.e., how many genes in the list are also in the "special DE gene list ", which only predicted by edgeR/Cuffdiff.
The percentage may represnt the reliable of that method, as you mentioned.

Which list in section 4.3 of DESeq2 vignette or in post #133?
res[which(idx=="TRUE"),] or res[which(idx=="FALSE"),]
Leave a comment:
dpryan replied

06-02-2014, 07:25 AM
Have a read through section 1.4.2 (I think) of the DESeq2 vignette.

Originally posted by super0925 View Post

For the outlier list , we found some genes have p-value but without adj-p value, you said I could filter them. I want to ask what list to filter?

You misunderstood, those genes were already filtered for power, which is why there's no adjusted p-value but there is a raw p-value. You're just comparing the list of DE genes anyway, so that's fine.

[QUOTE]Or is this the list the "outlier list", which would be searched in the DE genes excluded by DESeq2 but within edgeR, and to observe edgeR is reliable or not?[QUOTE]

If both the adjusted AND raw p-value are NA, then there was at least one likely outlier sample for that gene, so it was filtered for that reason. If edgeR and the others call those DE then you should look closer at the data to determine if DESeq2 is doing things correctly or not.

[QUOTE]Another question, the second list, that is the genes are not outlier, all of them baseMean are 0, is this normal?[QUOTE]

As I mentioned, the baseMean of 0 should tell you something. Look at the raw counts for those, they'll be ignored by all of the tools.
Leave a comment:
super0925 replied

06-02-2014, 07:14 AM
Originally posted by dpryan View Post

Well, you can't derive any information about reliability of the tools from this, you'd need to have known-DE genes and then see how well the tools find them. For the most part, the images are telling you about the similarity in methods, except for cuffdiff, which has more discordant than expected results (though perhaps it's the correct one, there's only one way to find out). I wouldn't recommend putting any more time in the comparisons, you won't get anything more informative out without performing validations on the findings.

Regarding post #132, yes, your understanding is correct.

Regarding post #133, note that the baseMean for genes with NA in all of the fields is 0. That should tell you why everything is NA. For genes with a p-value but no adjusted p-value, they were most likely filtered to increase power.

Hi D, Thank you! I will not make more effort on pipeline comparison.

Regarding post #133, for the outlier list (I think it is (res[which(idx=="TRUE"),]) , we found some genes have p-value but without adj-p value, you said I could filter them to increase power. I want to ask in which list to filter them?
If the list is the DE gene list, it is fine, because I only save the genes with adj-p value<0.05.
Or is this the list the "outlier list", which would be searched in the DE genes excluded by DESeq2 but within edgeR, and to observe edgeR is reliable or not?
Or I need to filter them before doing DE analysis?

Another question, the second list, that is the genes are not outlier, all of them baseMean are 0, is this normal?

Thank you!

Last edited by super0925; 06-02-2014, 07:26 AM.
Leave a comment:
dpryan replied

06-02-2014, 06:28 AM
Well, you can't derive any information about reliability of the tools from this, you'd need to have known-DE genes and then see how well the tools find them. For the most part, the images are telling you about the similarity in methods, except for cuffdiff, which has more discordant than expected results (though perhaps it's the correct one, there's only one way to find out). I wouldn't recommend putting any more time in the comparisons, you won't get anything more informative out without performing validations on the findings.

Regarding post #132, yes, your understanding is correct.

Regarding post #133, note that the baseMean for genes with NA in all of the fields is 0. That should tell you why everything is NA. For genes with a p-value but no adjusted p-value, they were most likely filtered to increase power.
Leave a comment:
super0925 replied

06-02-2014, 05:12 AM
Originally posted by dpryan View Post

That's the common way to do things in edgeR. Realistically, the results shouldn't change much if one compares filtering like this before testing or doing so after (and using the same threshold). I think DESeq2's method makes more sense, but that's me.

DESeq2 will just give a bunch of NA pvalues for those it filters. The results will be there, just not a test statistic. This makes life a bit easier if you ever need to look at multiple experiments together, since then you don't have to deal with genes being in one results file but not the other (plus, you get an idea of what's getting filtered out).

Hi D
I have done some pipeline comparison analysis last week. All the DE genes are given by default setting.
I have given the (1) overlap Venn plot, (2) jaccard index heatmap plot among 5 methods and the (3) spearman correlation clustering plot among the TOP 100 genes in each methods.
Do you think it is fine?
What is your suggestion in this case? Are edgeR/DESeq more reliable from the result?
Thank you so much!

PS: I didn't do the outlier detect validation cause you didn't reply for my strange result in the post #132 and #133
I could do that as well after I get clear the problem.
Attached Files

QGspearClust.jpg (13.3 KB, 4 views)

QGindex.jpg (19.1 KB, 3 views)

Untitled.jpg (63.8 KB, 7 views)
Leave a comment:

Previous 1 2 3 4 5 6 7 13 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News