HTseq to DeSeq/EdgeR to Heatmap

dpryan replied

03-13-2014, 05:41 AM
Sure, so there are two approaches. The computational approach would be to generate some data where you know the underlying genes that are DE and then feed that into the programs and see how their output matches what you "know" to be the case. This method suffers from the down-side that it's completely dependent on how well your simulation data matches what's realistic...and getting this really correct is not always trivial.

The other method is the wet-lab approach. Take a handful of candidate genes that the various packages have called DE and then do some qPCR. To get a better idea how predictive that is, use different samples than you did for the sequencing. Also, don't bother with genes with a small fold-change, you're unlikely to reliably see small differences with qPCR.
Leave a comment:
super0925 replied

03-13-2014, 05:32 AM
Originally posted by dpryan View Post

1. I've never heard that cuffdiff can't handle unbalanced designs before, though I guess I've never tried.
2. I'm not sure how true that is (also, it'll be completely version dependent).
3. That's true, DESeq/edgeR are only targeted toward DE analysis. The actual reality is that it's often biologically unclear what it means when a given gene switches isoform use...

RPKM/FPKM are largely going to be replaced by TPM (transcripts per million). The benefit of RPKM/FPKM and TPM is that you can use an EM algorithm to come up with values, which is really helpful if you're interested in looking at transcripts. FPKM/RPKM values have some downsides in that they lead you to believe that you've properly corrected for library size differences, when you actually haven't robustly. Counts have the benefit that you're not masking information by dividing by transcript/gene/whatever length. Some people mistakenly believe that the count-based method is "biased" toward longer genes. This is incorrect since that's not a real bias (in the common rather than technical meaning), that's a benefit. Etc.

You're understanding of the false positive rate is incorrect. A false positive is a feature (gene, transcript, whatever) declared to be significantly different that isn't actually. There's also "true positive", "false negative" and "true negative". These are general statistical terms that you'd do well to familiarize yourself with. To actually find these values you either need to perform a simulation or, better yet, use real data and then perform independent validation (whether you use independent samples or not will depend on the goals of the validation, though using completely independent samples will yield more predictive results).

Thank you , I got the False positive meaning . It is helpful.
But I am still confused how to get the FPR value (what you said "use real data and then perform independent validation")? because from the output of edgeR/DESeq I only find the DE gene list with their P-value. Could you please illustrate it in detail?

Last edited by super0925; 03-13-2014, 05:39 AM.
Leave a comment:
dpryan replied

03-13-2014, 04:59 AM
1. I've never heard that cuffdiff can't handle unbalanced designs before, though I guess I've never tried.
2. I'm not sure how true that is (also, it'll be completely version dependent).
3. That's true, DESeq/edgeR are only targeted toward DE analysis. The actual reality is that it's often biologically unclear what it means when a given gene switches isoform use...

RPKM/FPKM are largely going to be replaced by TPM (transcripts per million). The benefit of RPKM/FPKM and TPM is that you can use an EM algorithm to come up with values, which is really helpful if you're interested in looking at transcripts. FPKM/RPKM values have some downsides in that they lead you to believe that you've properly corrected for library size differences, when you actually haven't robustly. Counts have the benefit that you're not masking information by dividing by transcript/gene/whatever length. Some people mistakenly believe that the count-based method is "biased" toward longer genes. This is incorrect since that's not a real bias (in the common rather than technical meaning), that's a benefit. Etc.

You're understanding of the false positive rate is incorrect. A false positive is a feature (gene, transcript, whatever) declared to be significantly different that isn't actually. There's also "true positive", "false negative" and "true negative". These are general statistical terms that you'd do well to familiarize yourself with. To actually find these values you either need to perform a simulation or, better yet, use real data and then perform independent validation (whether you use independent samples or not will depend on the goals of the validation, though using completely independent samples will yield more predictive results).
Leave a comment:
super0925 replied

03-13-2014, 03:00 AM
Originally posted by dpryan View Post

Yes, C1_R1 and C1_R2 are biological replicates (i.e., independent samples) of the C1 group. At least that's what appears to be the case from a cursory look through the paper. That would also make sense given how the data ought to normally be processed.

Hi dpryan
For my summary compare the software, what I get to know so far is that:
1. Cuffdiff cannot do unbalance analysis:
e.g. 3 control samples vs 4 case samples
or different time point or replicates. It could only do the pairwise analysis but edgeR/DESeq could do it.
2. Cuffdiff is more conservative than DESeq/edgeR (if you don't adjust FDR )
3. Cuffdiff provides more information e.g. TSS, Transcripts, CDS but DESeq/edgeR only export DE genes.
Am I right? I am not sure.
And I haven't summarize advantage vs disadvantage of RFKM vs Count.
Do you have any ideas?

The second question is that : I found many paper's are talking false positive rate . It is another method to measure the software. So far I only know we could draw Venny Plot to observe the sharing significant Differential expressed(DE) genes.
What is the false positive? I think it represent number of non DE genes divided by total number of genes.
But so far I don't know how to find the FPR from the output of edgeR/Cuffdiff, cos they only output the gene ID and their P-value.
Thank you!
Leave a comment:
dpryan replied

03-12-2014, 08:22 AM
Yes, C1_R1 and C1_R2 are biological replicates (i.e., independent samples) of the C1 group. At least that's what appears to be the case from a cursory look through the paper. That would also make sense given how the data ought to normally be processed.
Leave a comment:
super0925 replied

03-12-2014, 08:20 AM
Originally posted by dpryan View Post

SRR031714 and SRR031715 are technical replicates (this is not obvious, have a look here), so merging the alignments together prior to counting makes sense.

In both of the cases you mentioned they processed each biological sample as a unit, so they end up being the same. Remember that in RNAseq technical replicates can simply be merged together.

So do you mean C1_R1 and C1_R2 are two samples but not the replicates of one sample in condition1(C1)?
And the replicates in each sample are better to be merged in Tophat before doing the Cufflinks. Thank you!
Leave a comment:
dpryan replied

03-12-2014, 06:33 AM
SRR031714 and SRR031715 are technical replicates (this is not obvious, have a look here), so merging the alignments together prior to counting makes sense.

In both of the cases you mentioned they processed each biological sample as a unit, so they end up being the same. Remember that in RNAseq technical replicates can simply be merged together.
Leave a comment:
super0925 replied

03-12-2014, 04:25 AM
Originally posted by dpryan View Post

That won't help the situation at all. Yes, the Ensembl annotations are usually better, but that's not the source of the problem, which is (1) summarizing expression on a different feature between the two and (2) the introduction of novel features by cufflinks. The appropriate solution is to simply use the same annotation file with htseq-count as is used by cuffdiff (i.e., merged.gtf).

Hi I have another naive question.
In the classic paper introduced Cufflinks，http://www.nature.com/nprot/journal/....2012.016.html
tophat is used as
tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq
tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq
tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq
tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq
they mapping the reads per condition per replicate ，and then cuffmerge，to generate the merged.gtf and then do the Cuffdiff，

But on the classic paper introduced edgeR, http://www.nature.com/nprot/journal/....2013.099.html
tophat is used as
tophat2 -G genes.gtf -p 5 -o Untreated genome \ SRR031714_1.fastq,SRR031715_1.fastq SRR031714_2.fastq,SRR031715_2.fastq
tophat2 -G genes.gtf -p 5 -o treated genome \ SRR031724_1.fastq,SRR031725_1.fastq SRR031724_2.fastq,SRR031725_2.fastq

I think SRR031714,SRR031715 are replicates.

They generated the mapping result per sample and then go to HTseq to calculate the Counts and then do the edgeR.

So my question is what is the difference between these two tophat strategy (per sample per replicate VS per sample ) and is it really essential to mapping with replicate 1 by 1 when we use tophat-cufflinks-cuffmerge-cuffdiff pipeline?

Last edited by super0925; 03-12-2014, 04:29 AM.
Leave a comment:
super0925 replied

03-12-2014, 02:53 AM
Originally posted by Zapages View Post

I am currently debating, which statistical tests will be best to use. I am thinking of doing a negative binomial test as I am comparing the expression presence between two organs. Cuffdiff already does a statistical test via its algorithm.

On top of this, I am thinking of doing a normalization of the counts to see how the data appears in normalized data sets. I will be using the Benjamini-Hochberg false discovery rate of 0.01 for EdgeR based on a recent paper . I am going to apply the same principle on DeSEQ as well.

Overall goal is to illustrate that these three genes based on three independent RNA-Seq Analysis tools yield the same results of these certain genes are being highly expressed in Organ A versus Organ B. I am working on other in situ models and analysis to indicate that these genes share a common expression.

Also I created the counts using HTSEQ.

Thank you, I really appreciate it here. I am actually looking at about 3 to 5 genes. There a few duplicates of certain genes. Is following something like this good idea for recalling the toptags. I think I have been using the iplant's edgeR and it has not provided me with tags information. So I am going to try it on my laptop (16 GB of ram), that should be fine?

Source for EdgeR: http://rnajournal.cshlp.org/content/....DC1/edgeR.pdf

Thank you again for all the great help.

Where exactly the confidence intervals, SD, Dispersion estimates in the Cuffdiff output files?

What I get to know so far is that:
1. Cuffdiff cannot do unbalance analysis:
e.g. 3 control samples vs 4 case samples
or different time point or replicates. It could only do the pairwise analysis but edgeR/DESeq could do it.
2. Cuffdiff is more conservative than DESeq/edgeR (if you don't adjust FDR )
3. Cuffdiff provides more information e.g. TSS, Transcripts, CDS but DESeq/edgeR only export DE genes.
Am I right? I am not sure.
And I haven't summarize advantage vs disadvantage of RFKM vs Count.
Do you have any ideas?

Last edited by super0925; 03-12-2014, 02:56 AM.
Leave a comment:
Zapages replied

03-11-2014, 09:13 AM
Originally posted by dpryan View Post

That won't help the situation at all. Yes, the Ensembl annotations are usually better, but that's not the source of the problem, which is (1) summarizing expression on a different feature between the two and (2) the introduction of novel features by cufflinks. The appropriate solution is to simply use the same annotation file with htseq-count as is used by cuffdiff (i.e., merged.gtf).

That should work as well. Thank you for pointing it out.

Anyway, I have used the Ensembl reference files (fasta and GTF) for my D. melanogaster data sets and it appears to be fine with flybase genome transcripts as their indicators for different genes in both DESeq and Cuffdiff.
Leave a comment:
dpryan replied

03-11-2014, 07:21 AM
Originally posted by Zapages View Post

Please use HT-Seq and use it on the BAM file outputs from Tophat2 with the GTF reference file from Ensembl. This is for EdgeR.

Furthermore, please use Ensembl Fasta reference chromosome and reference GTF for Tophat2/Cufflinks/Cuffmerge/Cuffdiff/cummeRbund.

This will allow you to have the same transcript ids. (FBtr#) across both methods.

I hope this helps a bit.

That won't help the situation at all. Yes, the Ensembl annotations are usually better, but that's not the source of the problem, which is (1) summarizing expression on a different feature between the two and (2) the introduction of novel features by cufflinks. The appropriate solution is to simply use the same annotation file with htseq-count as is used by cuffdiff (i.e., merged.gtf).
Leave a comment:
Zapages replied

03-11-2014, 06:45 AM
Originally posted by super0925 View Post

Hi I found a big problem that gene ID output from Cuffdiff and egdeR are different so that I cannot find the common DE genes.
I use the data set with the Drosophila melanogaster genome.
The edgeR output gene ID (e.g. FBgn0000370,FBgn0000500,...)
But the Cuffdiff output gene ID is something like (XLOC_000028,XLOC_000038,...) and gene symbol (e.g.,KH1,RpLP1,...).
Could you please recommend any software or command to translate them automatically? or Could you have any good idea for me to generate the same gene ID?
Thank you!

Please use HT-Seq and use it on the BAM file outputs from Tophat2 with the GTF reference file from Ensembl. This is for EdgeR.

Furthermore, please use Ensembl Fasta reference chromosome and reference GTF for Tophat2/Cufflinks/Cuffmerge/Cuffdiff/cummeRbund.

This will allow you to have the same transcript ids. (FBtr#) across both methods.

I hope this helps a bit.

Last edited by Zapages; 03-11-2014, 06:51 AM.
Leave a comment:
super0925 replied

03-07-2014, 06:36 AM
Originally posted by Zapages View Post

I am currently debating, which statistical tests will be best to use. I am thinking of doing a negative binomial test as I am comparing the expression presence between two organs. Cuffdiff already does a statistical test via its algorithm.

On top of this, I am thinking of doing a normalization of the counts to see how the data appears in normalized data sets. I will be using the Benjamini-Hochberg false discovery rate of 0.01 for EdgeR based on a recent paper . I am going to apply the same principle on DeSEQ as well.

Overall goal is to illustrate that these three genes based on three independent RNA-Seq Analysis tools yield the same results of these certain genes are being highly expressed in Organ A versus Organ B. I am working on other in situ models and analysis to indicate that these genes share a common expression.

Also I created the counts using HTSEQ.

Thank you, I really appreciate it here. I am actually looking at about 3 to 5 genes. There a few duplicates of certain genes. Is following something like this good idea for recalling the toptags. I think I have been using the iplant's edgeR and it has not provided me with tags information. So I am going to try it on my laptop (16 GB of ram), that should be fine?

Source for EdgeR: http://rnajournal.cshlp.org/content/....DC1/edgeR.pdf

Thank you again for all the great help.

Where exactly the confidence intervals, SD, Dispersion estimates in the Cuffdiff output files?

Hi I found a big problem that gene ID output from Cuffdiff and egdeR are different so that I cannot find the common DE genes.
I use the data set with the Drosophila melanogaster genome.
The edgeR output gene ID (e.g. FBgn0000370,FBgn0000500,...)
But the Cuffdiff output gene ID is something like (XLOC_000028,XLOC_000038,...) and gene symbol (e.g.,KH1,RpLP1,...).
Could you please recommend any software or command to translate them automatically? or Could you have any good idea for me to generate the same gene ID?
Thank you!

Last edited by super0925; 03-07-2014, 06:40 AM.
Leave a comment:
super0925 replied

02-24-2014, 02:10 AM
Thank you！ I will try your method to compare the performance of EdgeR/DeSeq/Tuxedo
Leave a comment:
sindrle replied

02-23-2014, 04:02 PM
Hi!
Depends on the version you use, but check out the _tracking - files
Leave a comment:

Previous 1 2 9 10 11 12 13 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News