Sure, so there are two approaches. The computational approach would be to generate some data where you know the underlying genes that are DE and then feed that into the programs and see how their output matches what you "know" to be the case. This method suffers from the down-side that it's completely dependent on how well your simulation data matches what's realistic...and getting this really correct is not always trivial.
The other method is the wet-lab approach. Take a handful of candidate genes that the various packages have called DE and then do some qPCR. To get a better idea how predictive that is, use different samples than you did for the sequencing. Also, don't bother with genes with a small fold-change, you're unlikely to reliably see small differences with qPCR.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by dpryan View Post1. I've never heard that cuffdiff can't handle unbalanced designs before, though I guess I've never tried.
2. I'm not sure how true that is (also, it'll be completely version dependent).
3. That's true, DESeq/edgeR are only targeted toward DE analysis. The actual reality is that it's often biologically unclear what it means when a given gene switches isoform use...
RPKM/FPKM are largely going to be replaced by TPM (transcripts per million). The benefit of RPKM/FPKM and TPM is that you can use an EM algorithm to come up with values, which is really helpful if you're interested in looking at transcripts. FPKM/RPKM values have some downsides in that they lead you to believe that you've properly corrected for library size differences, when you actually haven't robustly. Counts have the benefit that you're not masking information by dividing by transcript/gene/whatever length. Some people mistakenly believe that the count-based method is "biased" toward longer genes. This is incorrect since that's not a real bias (in the common rather than technical meaning), that's a benefit. Etc.
You're understanding of the false positive rate is incorrect. A false positive is a feature (gene, transcript, whatever) declared to be significantly different that isn't actually. There's also "true positive", "false negative" and "true negative". These are general statistical terms that you'd do well to familiarize yourself with. To actually find these values you either need to perform a simulation or, better yet, use real data and then perform independent validation (whether you use independent samples or not will depend on the goals of the validation, though using completely independent samples will yield more predictive results).
But I am still confused how to get the FPR value (what you said "use real data and then perform independent validation")? because from the output of edgeR/DESeq I only find the DE gene list with their P-value. Could you please illustrate it in detail?Last edited by super0925; 03-13-2014, 05:39 AM.
Leave a comment:
-
1. I've never heard that cuffdiff can't handle unbalanced designs before, though I guess I've never tried.
2. I'm not sure how true that is (also, it'll be completely version dependent).
3. That's true, DESeq/edgeR are only targeted toward DE analysis. The actual reality is that it's often biologically unclear what it means when a given gene switches isoform use...
RPKM/FPKM are largely going to be replaced by TPM (transcripts per million). The benefit of RPKM/FPKM and TPM is that you can use an EM algorithm to come up with values, which is really helpful if you're interested in looking at transcripts. FPKM/RPKM values have some downsides in that they lead you to believe that you've properly corrected for library size differences, when you actually haven't robustly. Counts have the benefit that you're not masking information by dividing by transcript/gene/whatever length. Some people mistakenly believe that the count-based method is "biased" toward longer genes. This is incorrect since that's not a real bias (in the common rather than technical meaning), that's a benefit. Etc.
You're understanding of the false positive rate is incorrect. A false positive is a feature (gene, transcript, whatever) declared to be significantly different that isn't actually. There's also "true positive", "false negative" and "true negative". These are general statistical terms that you'd do well to familiarize yourself with. To actually find these values you either need to perform a simulation or, better yet, use real data and then perform independent validation (whether you use independent samples or not will depend on the goals of the validation, though using completely independent samples will yield more predictive results).
Leave a comment:
-
Originally posted by dpryan View PostYes, C1_R1 and C1_R2 are biological replicates (i.e., independent samples) of the C1 group. At least that's what appears to be the case from a cursory look through the paper. That would also make sense given how the data ought to normally be processed.
For my summary compare the software, what I get to know so far is that:
1. Cuffdiff cannot do unbalance analysis:
e.g. 3 control samples vs 4 case samples
or different time point or replicates. It could only do the pairwise analysis but edgeR/DESeq could do it.
2. Cuffdiff is more conservative than DESeq/edgeR (if you don't adjust FDR )
3. Cuffdiff provides more information e.g. TSS, Transcripts, CDS but DESeq/edgeR only export DE genes.
Am I right? I am not sure.
And I haven't summarize advantage vs disadvantage of RFKM vs Count.
Do you have any ideas?
The second question is that : I found many paper's are talking false positive rate . It is another method to measure the software. So far I only know we could draw Venny Plot to observe the sharing significant Differential expressed(DE) genes.
What is the false positive? I think it represent number of non DE genes divided by total number of genes.
But so far I don't know how to find the FPR from the output of edgeR/Cuffdiff, cos they only output the gene ID and their P-value.
Thank you!
Leave a comment:
-
Yes, C1_R1 and C1_R2 are biological replicates (i.e., independent samples) of the C1 group. At least that's what appears to be the case from a cursory look through the paper. That would also make sense given how the data ought to normally be processed.
Leave a comment:
-
Originally posted by dpryan View PostSRR031714 and SRR031715 are technical replicates (this is not obvious, have a look here), so merging the alignments together prior to counting makes sense.
In both of the cases you mentioned they processed each biological sample as a unit, so they end up being the same. Remember that in RNAseq technical replicates can simply be merged together.
And the replicates in each sample are better to be merged in Tophat before doing the Cufflinks. Thank you!
Leave a comment:
-
SRR031714 and SRR031715 are technical replicates (this is not obvious, have a look here), so merging the alignments together prior to counting makes sense.
In both of the cases you mentioned they processed each biological sample as a unit, so they end up being the same. Remember that in RNAseq technical replicates can simply be merged together.
Leave a comment:
-
Originally posted by dpryan View PostThat won't help the situation at all. Yes, the Ensembl annotations are usually better, but that's not the source of the problem, which is (1) summarizing expression on a different feature between the two and (2) the introduction of novel features by cufflinks. The appropriate solution is to simply use the same annotation file with htseq-count as is used by cuffdiff (i.e., merged.gtf).
In the classic paper introduced Cufflinks,http://www.nature.com/nprot/journal/....2012.016.html
tophat is used as
tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq
tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq
tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq
tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq
they mapping the reads per condition per replicate ,and then cuffmerge,to generate the merged.gtf and then do the Cuffdiff,
But on the classic paper introduced edgeR, http://www.nature.com/nprot/journal/....2013.099.html
tophat is used as
tophat2 -G genes.gtf -p 5 -o Untreated genome \ SRR031714_1.fastq,SRR031715_1.fastq SRR031714_2.fastq,SRR031715_2.fastq
tophat2 -G genes.gtf -p 5 -o treated genome \ SRR031724_1.fastq,SRR031725_1.fastq SRR031724_2.fastq,SRR031725_2.fastq
I think SRR031714,SRR031715 are replicates.
They generated the mapping result per sample and then go to HTseq to calculate the Counts and then do the edgeR.
So my question is what is the difference between these two tophat strategy (per sample per replicate VS per sample ) and is it really essential to mapping with replicate 1 by 1 when we use tophat-cufflinks-cuffmerge-cuffdiff pipeline?Last edited by super0925; 03-12-2014, 04:29 AM.
Leave a comment:
-
Originally posted by Zapages View PostI am currently debating, which statistical tests will be best to use. I am thinking of doing a negative binomial test as I am comparing the expression presence between two organs. Cuffdiff already does a statistical test via its algorithm.
On top of this, I am thinking of doing a normalization of the counts to see how the data appears in normalized data sets. I will be using the Benjamini-Hochberg false discovery rate of 0.01 for EdgeR based on a recent paper . I am going to apply the same principle on DeSEQ as well.
Overall goal is to illustrate that these three genes based on three independent RNA-Seq Analysis tools yield the same results of these certain genes are being highly expressed in Organ A versus Organ B. I am working on other in situ models and analysis to indicate that these genes share a common expression.
Also I created the counts using HTSEQ.
Thank you, I really appreciate it here. I am actually looking at about 3 to 5 genes. There a few duplicates of certain genes. Is following something like this good idea for recalling the toptags. I think I have been using the iplant's edgeR and it has not provided me with tags information. So I am going to try it on my laptop (16 GB of ram), that should be fine?
Source for EdgeR: http://rnajournal.cshlp.org/content/....DC1/edgeR.pdf
Thank you again for all the great help.
Where exactly the confidence intervals, SD, Dispersion estimates in the Cuffdiff output files?
What I get to know so far is that:
1. Cuffdiff cannot do unbalance analysis:
e.g. 3 control samples vs 4 case samples
or different time point or replicates. It could only do the pairwise analysis but edgeR/DESeq could do it.
2. Cuffdiff is more conservative than DESeq/edgeR (if you don't adjust FDR )
3. Cuffdiff provides more information e.g. TSS, Transcripts, CDS but DESeq/edgeR only export DE genes.
Am I right? I am not sure.
And I haven't summarize advantage vs disadvantage of RFKM vs Count.
Do you have any ideas?Last edited by super0925; 03-12-2014, 02:56 AM.
Leave a comment:
-
Originally posted by dpryan View PostThat won't help the situation at all. Yes, the Ensembl annotations are usually better, but that's not the source of the problem, which is (1) summarizing expression on a different feature between the two and (2) the introduction of novel features by cufflinks. The appropriate solution is to simply use the same annotation file with htseq-count as is used by cuffdiff (i.e., merged.gtf).
Anyway, I have used the Ensembl reference files (fasta and GTF) for my D. melanogaster data sets and it appears to be fine with flybase genome transcripts as their indicators for different genes in both DESeq and Cuffdiff.
Leave a comment:
-
Originally posted by Zapages View PostPlease use HT-Seq and use it on the BAM file outputs from Tophat2 with the GTF reference file from Ensembl. This is for EdgeR.
Furthermore, please use Ensembl Fasta reference chromosome and reference GTF for Tophat2/Cufflinks/Cuffmerge/Cuffdiff/cummeRbund.
This will allow you to have the same transcript ids. (FBtr#) across both methods.
I hope this helps a bit.
Leave a comment:
-
Originally posted by super0925 View PostHi I found a big problem that gene ID output from Cuffdiff and egdeR are different so that I cannot find the common DE genes.
I use the data set with the Drosophila melanogaster genome.
The edgeR output gene ID (e.g. FBgn0000370,FBgn0000500,...)
But the Cuffdiff output gene ID is something like (XLOC_000028,XLOC_000038,...) and gene symbol (e.g.,KH1,RpLP1,...).
Could you please recommend any software or command to translate them automatically? or Could you have any good idea for me to generate the same gene ID?
Thank you!
Furthermore, please use Ensembl Fasta reference chromosome and reference GTF for Tophat2/Cufflinks/Cuffmerge/Cuffdiff/cummeRbund.
This will allow you to have the same transcript ids. (FBtr#) across both methods.
I hope this helps a bit.Last edited by Zapages; 03-11-2014, 06:51 AM.
Leave a comment:
-
Originally posted by Zapages View PostI am currently debating, which statistical tests will be best to use. I am thinking of doing a negative binomial test as I am comparing the expression presence between two organs. Cuffdiff already does a statistical test via its algorithm.
On top of this, I am thinking of doing a normalization of the counts to see how the data appears in normalized data sets. I will be using the Benjamini-Hochberg false discovery rate of 0.01 for EdgeR based on a recent paper . I am going to apply the same principle on DeSEQ as well.
Overall goal is to illustrate that these three genes based on three independent RNA-Seq Analysis tools yield the same results of these certain genes are being highly expressed in Organ A versus Organ B. I am working on other in situ models and analysis to indicate that these genes share a common expression.
Also I created the counts using HTSEQ.
Thank you, I really appreciate it here. I am actually looking at about 3 to 5 genes. There a few duplicates of certain genes. Is following something like this good idea for recalling the toptags. I think I have been using the iplant's edgeR and it has not provided me with tags information. So I am going to try it on my laptop (16 GB of ram), that should be fine?
Source for EdgeR: http://rnajournal.cshlp.org/content/....DC1/edgeR.pdf
Thank you again for all the great help.
Where exactly the confidence intervals, SD, Dispersion estimates in the Cuffdiff output files?
I use the data set with the Drosophila melanogaster genome.
The edgeR output gene ID (e.g. FBgn0000370,FBgn0000500,...)
But the Cuffdiff output gene ID is something like (XLOC_000028,XLOC_000038,...) and gene symbol (e.g.,KH1,RpLP1,...).
Could you please recommend any software or command to translate them automatically? or Could you have any good idea for me to generate the same gene ID?
Thank you!Last edited by super0925; 03-07-2014, 06:40 AM.
Leave a comment:
-
Thank you! I will try your method to compare the performance of EdgeR/DeSeq/Tuxedo
Leave a comment:
-
Hi!
Depends on the version you use, but check out the _tracking - files
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-25-2024, 11:49 AM
|
0 responses
19 views
0 likes
|
Last Post
by seqadmin
04-25-2024, 11:49 AM
|
||
Started by seqadmin, 04-24-2024, 08:47 AM
|
0 responses
19 views
0 likes
|
Last Post
by seqadmin
04-24-2024, 08:47 AM
|
||
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
62 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
60 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
Leave a comment: