non-typical p values distribution running DESeq

Michael Love replied

07-01-2014, 06:03 AM
Yes, David's right.

Here's a pairs plot of your counts in the log scale

Code:

y <- log10(counts(dds)+1) pairs(y, panel = function(...) smoothScatter(..., nrpoints = 0, add = TRUE),lower.panel=NULL)

For the samples other than 'water', we can see the diagonal line that would specify a log fold change of 0 between the two samples. This is the line that DESeq and edgeR use for defining a scaling factor for normalizing for sequencing depth.

However, for water vs others, a simple scaling factor automatically detected from the data will not work.

For the scatterplot of 1 vs 3 and 1 vs 10, there seems to be a faint line of genes on the diagaonal. Maybe you can investigate what is special about these genes. It is possible that nearly all the genes are differentially expressed (upregulated in the treated groups), but then the experiment really should use spike in controls for normalization.

I wonder if the experimental protocol might have been different for the water samples?

Another option for analysis would be to remove the water samples and use the 'contrast' argument to just compare the treatment groups against each other.
Attached Files

pairs.jpeg (141.1 KB, 6 views)
Leave a comment:
gringer replied

06-30-2014, 05:16 PM
That plot also doesn't look wonderful, presumably because you've got p-val on the X axis. MA plot is usually log fold change on Y, and average log expression on the X.

Are you able to do a scatter plot of the raw counts for each experiment, preferably log-transformed or using the VST from DESeq/DESeq2? If you're not getting a line that distributes around y=x with those plots, it's probably not a good idea trying to shoehorn in a differential expression analysis.
Leave a comment:
Michael Love replied

06-30-2014, 04:55 PM
The last plot you posted, hm_norm_counts, was not an MA plot. it was logFC ~ adjusted p value.

An MA plot is logFC ~ log of mean counts, or mean of log counts.

Also, note the scale of the y axis is much larger than the previous plots.

Regarding the goal of reducing the number of sig genes: I don't know if you've reduced the number of significant genes for better or for worse. We can easily reduce the number of genes, either by reducing the FDR threshold or increasing the lfcThreshold argument.

We know for sure, from the PCA plot, that the differences between the samples are very large compared to the variation between biological replicates.

Could you send the dds object to me privately, so I can have a look?

My email is listed here:

maintainer("DESeq2")
Leave a comment:
alyamahmoud replied

06-30-2014, 03:19 PM
is it wrong to apply a hierarchal model ? it reduces the number of sig genes drastically but the the MA plot looks better I think ?
Leave a comment:
alyamahmoud replied

06-30-2014, 03:19 PM
They are not multiple species, only one species (same as reference) but under different environmental conditions (different pH ranges, anaerobic, water vs wt that is aerobic)
Leave a comment:
Michael Love replied

06-30-2014, 04:21 AM
collapseReplicates() is for technical replicates only. We obviously do not recommend collapsing biological replicates, as you throw away information from the experiment.

The "problem" of too many p-values or a p-value distribution with a spike at 0 means that you have many large differences across the conditions.

Can you say more about the genes here? If you are sequencing multiple species, what is the relation of each species to the reference genome/transcriptome to which the reads were aligned?
Leave a comment:
alyamahmoud replied

06-30-2014, 12:50 AM
hierarchial model

Is there any objection to applying a hierarchial model on the normalized counts ? I tried limma analysis on the normalized counts, the MA plot is also attached.
Attached Files

hm_norm_counts.pdf (32.0 KB, 44 views)
Leave a comment:
gringer replied

06-29-2014, 01:08 AM
genome

Just a general question related to this (based on the mapping IDs you have provided), is there anyone here on seqanswers who has successfully done a DESeq / DESeq2 run on E. coli?
Leave a comment:
alyamahmoud replied

06-28-2014, 05:08 PM
If I use collapseReplicates the number of DEG decreases massively, however, this doesn't improve the ma plot (attached)!!

any help ?
Attached Files

maplot_deseq2_collapsed_after_size.pdf (55.8 KB, 38 views)
Leave a comment:
alyamahmoud replied

06-28-2014, 11:08 AM
Hi Michael

These are biological replicates.

The dispersion is calculated based on variance within conditions, so the dispersion is not necessarily large though you have large differences across conditions.

I am not sure I get what you mean here.

There is a set of genes that the biologist know should be varying and these are non-metagenomics samples; single species per sample.

What would you suggest ?
Leave a comment:
Michael Love replied

06-27-2014, 12:02 PM
note that the replicates are right on top of each other in the PCA plot. Are these technical or biological replicates?

The dispersion is calculated based on variance within conditions, so the dispersion is not necessarily large though you have large differences across conditions.

I'm not so familiar with microbial analysis. I'd guess, like others mentioned above, that you have many genes with counts for only one species. And there is not a clear group of genes which are not DE across the conditions. This makes normalization difficult, as the automatic methods within DESeq or edgeR are based on the assumption that there are enough genes that are not DE, such that robust measures like median or trimmed mean can find the center of the distribution of log ratios of samples.

Is there a set of genes that the biologists suspect might be equally expressed across the groups?
Leave a comment:
alyamahmoud replied

06-27-2014, 11:40 AM
Hi Michael

Thanks for your reply.

I attached the PCA plot.

Yes, these are very different conditions for some bacterial species, paired-end standard RNAseq, % of mapped reads >= 97%.

How can I handle this data without spike-in controls ? How much effect does this massive dispersion have on the final DEG ?
Attached Files

pca_deseq2.pdf (5.0 KB, 50 views)
Leave a comment:
Michael Love replied

06-27-2014, 09:56 AM
Indeed, something looks strange about this experiment. Try running the PCA steps in the vignette to see how the samples are distributed. It looks like the different conditions are very different from each other for many rows. This can make the normalization difficult without spike-in controls.

Can you describe the experiment in more details? Is this standard RNA-Seq data?
Leave a comment:
gringer replied

06-26-2014, 10:37 PM
Do you have mapping percentages for your genome? If they're wildly different, that might point to sample differences that could greatly influence the results.

What your MA plots seem to be showing is that one of the populations is producing normalised counts for genes that are effectively zero.

You can look at the count input data (or DESEq-normalised data) to see if this is the case -- just looking at total counts per experiment should show odd patterns. If the differences are as much as they look from the MA plot, you won't need DESeq at all to find differences.
Leave a comment:
alyamahmoud replied

06-26-2014, 10:26 PM
I will check with the biologist who gave me the data, but I don't think they are coming from different population.

How reliable/ir-reliable would be the results of the DEG accordingly ?
Leave a comment:

Previous 1 2 template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News