Seqanswers Leaderboard Ad

**Cole Trapnell** · 05-31-2012, 06:04 AM

Originally posted by francicco View Post

I personally do not trust cufflinks 2 results. For instance it gives 0 FPKM to transcript clearly expressed

Developers need to do something, sooner or later...

Can you please send a small data set to [email protected] that reproduces this behavior? We can't do much unless you can see what you're seeing (and being able to see it in the debugger makes fixing the problem vastly easier).

**glados** · 05-31-2012, 10:54 PM

Originally posted by Cole Trapnell View Post

Can you try re-running this analysis with --min-outlier-p 0 to see if it's the inline model checking that's causing the increase in NOTESTs?

Absolutely. Thank you so much for trying to help. I really need this to work soon.

I reran in cuffdiff two different cufflinks files I had with the same parameters (4+4 replicates) and --min-outlier-p 0, and the result was exactly like I got before without --min-outlier-p. In gene-exp_diff 35560 NOTESTs for one cufflinks and 11240 NOTESTs for the other.

The difference between the two cufflinks groups I tested is that the first has been run with --multi-read-correct --upper-quartile-norm --frag-bias-correct both in cufflinks and cuffdiff, and the other group only in cuffdiff (frag-bias-correct in cufflinks). Both has been run with --GTF-guide. I am wondering if this has any influence on number of NOTEST. You can use these parameters in both programs but is better to use them only in one of them, if so which one? However, I still think that 11240 NOTESTs are too high and it gives me practically 0 significant genes which I'm sure is incorrect.

edit: I also tested with using these parameters in cufflinks only and not in cuffdiff, and got the same results as I did when I used them in cuffdiff but not in cufflinks, i.e. 11240 NOTESTs.

Additionally I wonder why the variance on every gene is huge. I have so many replicates I would expect it to become smaller but the error bars always reach the bottom (i.e. fpkm conf_lo = 0 and fpkm_conf_hi is extremely high). I am wondering if this has anything to do with me not getting any significant DE genes.

**Cole Trapnell** · 06-01-2012, 02:49 AM

Originally posted by glados View Post

Absolutely. Thank you so much for trying to help. I really need this to work soon.

I reran in cuffdiff two different cufflinks files I had with the same parameters (4+4 replicates) and --min-outlier-p 0, and the result was exactly like I got before without --min-outlier-p. In gene-exp_diff 35560 NOTESTs for one cufflinks and 11240 NOTESTs for the other.

The difference between the two cufflinks groups I tested is that the first has been run with --multi-read-correct --upper-quartile-norm --frag-bias-correct both in cufflinks and cuffdiff, and the other group only in cuffdiff (frag-bias-correct in cufflinks). Both has been run with --GTF-guide. I am wondering if this has any influence on number of NOTEST. You can use these parameters in both programs but is better to use them only in one of them, if so which one? However, I still think that 11240 NOTESTs are too high and it gives me practically 0 significant genes which I'm sure is incorrect.

edit: I also tested with using these parameters in cufflinks only and not in cuffdiff, and got the same results as I did when I used them in cuffdiff but not in cufflinks, i.e. 11240 NOTESTs.

I'm confused about what you did: are you following the protocol from the Cufflinks website (the Nature Protocols one)? If not, can you provide the full sequence of commands that you ran? Cufflinks doesn't emit NOTESTs - that's a Cuffdiff only thing.

**glados** · 06-01-2012, 03:39 AM

Originally posted by Cole Trapnell View Post

I'm confused about what you did: are you following the protocol from the Cufflinks website (the Nature Protocols one)? If not, can you provide the full sequence of commands that you ran? Cufflinks doesn't emit NOTESTs - that's a Cuffdiff only thing.

Yes I'm following the protocol. Cufflinks on each individual sample's bam-file from tophat2, then cuffmerge on the assemblies text-file with paths to the transcript.gtf-files. After Cuffdiff on the merged.gtf with 2 groups and paths to each sample's bam-file.

What I mean is that the parameters --multi-read-correct --upper-quartile-norm and --frag-bias-correct is available for both cufflinks and cuffdiff, so I've tried using them only in cufflinks, only in cuffdiff and in both. I get much more NOTESTs in the gene_exp.diff-file from cuffdiff when I use these parameters in both cufflinks and cuffdiff (35560) than in only one of them (11240), so I wondered if that had anything to do with it, the number of NOTESTs is still high though.

My cuffdiff command can be something like this

Code:

cuffdiff -o output_path --labels X,Y --num-threads 12 --frag-bias-correct genome.fa --upper-quartile-norm --multi-read-correct merged.gtf X1.bam,X2.bam,X3.bam,X4.bam Y1.bam,Y2.bam,Y3.bam,Y4.bam

**Cole Trapnell** · 06-01-2012, 03:52 AM

Originally posted by glados View Post

Yes I'm following the protocol. Cufflinks on each individual sample's bam-file from tophat2, then cuffmerge on the assemblies text-file with paths to the transcript.gtf-files. After Cuffdiff on the merged.gtf with 2 groups and paths to each sample's bam-file.

What I mean is that the parameters --multi-read-correct --upper-quartile-norm and --frag-bias-correct is available for both cufflinks and cuffdiff, so I've tried using them only in cufflinks, only in cuffdiff and in both. I get much more NOTESTs in the gene_exp.diff-file from cuffdiff when I use these parameters in both cufflinks and cuffdiff (35560) than in only one of them (11240), so I wondered if that had anything to do with it, the number of NOTESTs is still high though.

Hmm. What happens when you cuffcompare the merged GTF files from cuffmerge produced using the different methods? Does Cufflinks produce substantially different assemblies when bias correction + multireads + quartile norm is enabled/disabled?

Based on your comment that the variances are huge, I'm wondering if the problem is with the assembly. Cuffdiff takes into consideration both cross-replicate variability and fragment assignment uncertainty (disambiguating how many reads came from each isoform). In general, the more isoforms a gene has, the more uncertainty there will be in assigning reads to each isoform, and the more uncertainty there will be in the overall gene expression level. That means more variance, so if you have a ton of isoforms (possibly because of a bad assembly), you'll see very few differentially expressed genes.

Another thing to check is whether you still see this when using a reference GTF. Have you tried that as a sanity check?

**glados** · 06-01-2012, 07:07 AM

Originally posted by Cole Trapnell View Post

Hmm. What happens when you cuffcompare the merged GTF files from cuffmerge produced using the different methods? Does Cufflinks produce substantially different assemblies when bias correction + multireads + quartile norm is enabled/disabled?

Based on your comment that the variances are huge, I'm wondering if the problem is with the assembly. Cuffdiff takes into consideration both cross-replicate variability and fragment assignment uncertainty (disambiguating how many reads came from each isoform). In general, the more isoforms a gene has, the more uncertainty there will be in assigning reads to each isoform, and the more uncertainty there will be in the overall gene expression level. That means more variance, so if you have a ton of isoforms (possibly because of a bad assembly), you'll see very few differentially expressed genes.

I used cuffcompare now on the two different merged.gtf-files. The .stats file look the same (also when I add parameter -r) The 5 last columns in the .tracking file just gives a bunch of zeros and .tmap also gives just zeros for fpkm when I'm looking at it. When I look manually at transcripts.gtf in the cuffmerge folder I see that the fpkm values seem to always be between 0 and 1 and everything in between. When I look at transcripts.gtf in one of the cufflinks folders the fpkm values are either 0 or a very big number (millions fpkm). Is that how it should look?

I think the assembly went alright. The reads have been quality filtered and trimmed before tophat. About 75% mapped in tophat 1.4.1 and much more in tophat2, I haven't checked mapping statistics in tophat 2 yet but one sample gives me 86% mapped. I used the -GTF option in tophat. When I look at the bam-file visually in IGV it looks good to me at least. A lot of reads seem to map to the exons. But I'm not an expert on how the assembly is supposed to look.

Another thing to check is whether you still see this when using a reference GTF. Have you tried that as a sanity check?

I'm not sure what you're asking. I have used the --GTF-guide option in cufflinks and the --ref-gtf option in cuffmerge. In tophat I used the --GTF option also. Do you want me to try to run cufflinks with --GTF instead of --GTF-guide?

What is weird is that I did not get this many NOTESTs with Cufflinks 1.3, but instead more FAIL and much less significant genes when I added more replicates in cuffdiff.

**robert-nci** · 06-01-2012, 10:49 AM

Hello all,

I have the similar problems. First, the output of cuffdiff populates zeros for almost all the genes. I analyzed the same dataset with an older version and got non-zero fpkm. I even see the reads on the genes when I uploaded bam files on IGV.
Another dataset with 2000 DEGs shows only 200 DEGs after analyzing with cuffdiff2.

It would be appreciated if developer of cuffdiff help us to figure out these issues.

Thanks,
Robert

**Cole Trapnell** · 06-02-2012, 03:15 PM

Originally posted by glados View Post

I'm not sure what you're asking. I have used the --GTF-guide option in cufflinks and the --ref-gtf option in cuffmerge. In tophat I used the --GTF option also. Do you want me to try to run cufflinks with --GTF instead of --GTF-guide?

What is weird is that I did not get this many NOTESTs with Cufflinks 1.3, but instead more FAIL and much less significant genes when I added more replicates in cuffdiff.

I was asking if you still see few significant genes and many NOTESTs when you use the reference GTF with Cuffdiff instead of the one you assembled.

It sounds like there are two different things going on here that aren't supposed to be happening:

1) When you run Cuffdiff with --frag-bias-correct --multiread-correct and --upper-quartile-norm you see more NOTEST genes than when you leave all three off.

2) You see a very high number of NOTEST genes, and this number grows with more replicates.

I can't reproduce #1 with the datasets I've looked at. I have seen the number of NOTESTs grow with more replicates (see below for why this can happen), but I've not seen the number be so large.

A gene can be marked as NOTEST for one of several reasons:

1) There are not enough reads falling on the gene in either condition. The default threshold is 10 (though the threshold is applied to the common-scale normalized count). Genes with no detectable expression thus get marked NOTEST. You can control this behavior with the -c option.

2) Before testing, Cuffdiff 2 checks that its variance model is a good fit for the gene. For each gene, Cuffdiff 2 has a mean expression across replicates, a variance derived by its model (which give you the confidence intervals), and an expression measurement from each replicate. If one or more of the replicates lies outside of the 99% confidence interval (by default, this is controlled with min-outlier-p), Cuffdiff 2 thinks the variance model is a bad fit for the gene, and thus doesn't perform any testing and marks the gene NOTEST. Cuffdiff 1.3.0 doesn't do this, it's new behavior.

So what might be happening is that as you add more and more replicates, you're increasing the number of genes for which one of these replicates will lie outside of the model's variance estimate, causing the gene to get marked NOTEST. That's why I asked if you had set --min-outlier-p 0, because that should disable this whole model checking behavior. The model checking is meant to improve robustness of the results when you have very few replicates (2 or 3) - with 7 or 8 it's probably not helping much anyways.

A few more questions to help me figure out where the problem is:

1) What happens with you set -c 0? Does the number of NOTESTs go down
2) Can you figure out which of --multiread-correct, --frag-bias-correct, or --upper-quartile-norm is causing the increase in NOTESTs in that 7+8 run?
3) Do the replicates segregate together when you cluster them using CummeRbund's csDendro function? You can check this easily by passing replicates=T to csDendro. I just want to rule out one of the replicates being bad.

**Cole Trapnell** · 06-02-2012, 03:18 PM

Originally posted by robert-nci View Post

Hello all,

I have the similar problems. First, the output of cuffdiff populates zeros for almost all the genes. I analyzed the same dataset with an older version and got non-zero fpkm. I even see the reads on the genes when I uploaded bam files on IGV.
Another dataset with 2000 DEGs shows only 200 DEGs after analyzing with cuffdiff2.

It would be appreciated if developer of cuffdiff help us to figure out these issues.

Thanks,
Robert

Another user has reported this to us, but we haven't been able to reproduce it ourselves. Can you send us a small test set of data that we can use to debug this? If you're willing to share the test data, you can email us instructions for getting it to [email protected]

**Cole Trapnell** · 06-02-2012, 03:21 PM

Originally posted by robert-nci View Post

Another dataset with 2000 DEGs shows only 200 DEGs after analyzing with cuffdiff2.

I should also point out that this doesn't necessarily mean there's a problem - Cuffdiff 2 tends to report fewer DE genes than other tools such as DESeq or edgeR, because these tools don't incorporate read mapping uncertainty into the variance for each gene's expression estimate.

**gesdy** · 06-05-2012, 08:53 AM

Fpkm =0

Hey guys, I had the same problem...
when I used the -b option to correct for sequence bias I got all 0 in my gene_expr.diff file. if I don't use "-b" option FPKM values are normal.
any idea?
thanks!
Mat

**glados** · 06-05-2012, 10:46 AM

Originally posted by Cole Trapnell View Post

A few more questions to help me figure out where the problem is:

1) What happens with you set -c 0? Does the number of NOTESTs go down
2) Can you figure out which of --multiread-correct, --frag-bias-correct, or --upper-quartile-norm is causing the increase in NOTESTs in that 7+8 run?
3) Do the replicates segregate together when you cluster them using CummeRbund's csDendro function? You can check this easily by passing replicates=T to csDendro. I just want to rule out one of the replicates being bad.

I have tried testing a lot this weekend. Here is the result:

1) Yes I got much fewer NOTESTs with -c 0 (1188 instead of 11240). --min-outlier-p 0 didn't affect NOTESTs as I mentioned in an earlier post.

2) I have finally figured out that it is the --frag-bias-correct that gives extremely many NOTESTs (35560) if I use this parameter in both cufflinks then in cuffdiff. Does not seem to affect NOTESTs when used in only one of them or disabled (11240). --multi-read-correct and --upper-quartile-norm does not seem to affect the number of NOTESTs either in cufflinks, cuffdiff, or in both, at least when I have tested it.

3) The csDendro plot with replicates=T looks good for the 2+2 and 3+3 replicate runs. The two conditions end up in different clades with almost equal branch lengths. For 4+4 it's the only plot in cummeRbund that doesn't work for me. I don't know why.. It gives this error message:

Error in plot.window(...) : need finite 'ylim' values
In addition: There were 32 warnings (use warnings() to see them)

4) I have now tried using cuffmerge then cuffdiff on the reference.gtf instead of the transcripts.gtf from cufflinks. The error bars look about the same. They increase when adding more replicates (and the confidence interval increase). I suspect this is why the number of sig. genes go down as well when adding more replicates. From 6000 (2+2) to 500 (3+3) to 70 (4+4). So I will get 0 sig. genes when I come to 7+8.

Summarized:
--frag-bias-correct gave more NOTESTs when used in both cufflinks and cuffdiff. I will avoid doing this, lesson learned.
When adding parameter -c 0, it gave fewer NOTESTs. Should I use this?
Reference.gtf in cuffmerge then cuffdiff did not give better confidence intervals but instead gives the same problem that the error bars increase when adding more replicates.

Any idea what the problem might be? Again, thanks for helping me, I appreciate it a lot.

**gesdy** · 06-06-2012, 06:31 AM

update:
using cuffdiff 1.3 I obtained normal results also with -b option.
same analysis, same data with cuffdiff 2 give me 99% of FPKM =0, if I remove -b it works.
Mat

**robert-nci** · 06-06-2012, 09:28 AM

Just for reference, I have the same observation as gesdy.
Although, I'm not sure whether cuffdiff would produce reliable results at its current state.

**glados** · 06-11-2012, 12:06 AM

Still am trying to figure out why the variance increases greatly with more replicates, and how I can counter it. Suggestions?

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News