Unconfigured Ad

**dpryan** · 07-27-2013, 01:39 AM

DESeq(2), edgeR, or limma/voom are your best bets, so you can ignore the RPKM calculation issue. Regarding cufflinks/cuffdiff, the earlier versions were pretty suboptimal, though the more recent versions seem vastly improved.

**chenjy** · 07-28-2013, 03:25 AM

Originally posted by dpryan View Post

DESeq(2), edgeR, or limma/voom are your best bets, so you can ignore the RPKM calculation issue. Regarding cufflinks/cuffdiff, the earlier versions were pretty suboptimal, though the more recent versions seem vastly improved.

Thanks for you reply!

Can you explain why previous versions of cufflinks/cuffdiff were suboptimal. Is it because cufflinks/cuffdiff works on transcript level, while it is not possible to accurately define all the transcripts based on RNA-Seq? or because the underlying statistic model is not good enough?

**dpryan** · 07-28-2013, 06:25 AM

It seems to have been more of a change in implementation, though I can't say for sure. Early on, different point versions were giving very different results, some of which were non-sensical. From reading the forums here, it seems that the most recent versions are more stable and use a somewhat different approach, though I'm not an expert on cufflinks/cuffdiff.

**NGSfan** · 07-30-2013, 07:38 AM

You should read the paper "Computational methods for transcriptome annotation and quantification using RNA-seq" in Nature Methods.

In that paper the cuffdiff authors argue why transcript expression method is conceptually better than the exon union/intersection method.

**chenjy** · 07-30-2013, 06:44 PM

Originally posted by NGSfan View Post

You should read the paper "Computational methods for transcriptome annotation and quantification using RNA-seq" in Nature Methods.

In that paper the cuffdiff authors argue why transcript expression method is conceptually better than the exon union/intersection method.

Actually, the recent publication on Nature Biotechnology titled "Differential analysis of gene regulation at transcript resolution with RNA-seq" explicitly pointed out the advantage of transcript-level method over union/intersection method. However, Transcript-level expression quantification requires accurately allocating each read to specific transcript. I am not sure whether cuffdiff performs well regarding this. Is it possible to know the exact transcript where the read come from on the basis of RNA-Seq with short reads and uneven distribution across the whole transcript?

**chenjy** · 07-30-2013, 06:56 PM

Originally posted by dpryan View Post

It seems to have been more of a change in implementation, though I can't say for sure. Early on, different point versions were giving very different results, some of which were non-sensical. From reading the forums here, it seems that the most recent versions are more stable and use a somewhat different approach, though I'm not an expert on cufflinks/cuffdiff.

I see. and the new version of Cuffdiff may require further evaluations.

**jparsons** · 07-31-2013, 07:48 AM

Originally posted by chenjy View Post

Actually, the recent publication on Nature Biotechnology titled "Differential analysis of gene regulation at transcript resolution with RNA-seq" explicitly pointed out the advantage of transcript-level method over union/intersection method. However, Transcript-level expression quantification requires accurate allocate each read to specific transcript. I am not sure whether cuffdiff performs well regarding this. Is it possible to know the exact transcript where the read come from on the basis of RNA-Seq with short reads and uneven distribution across the whole transcript?

The paper only provides a theoretical explanation for an advantage: There's no information whatsoever regarding whether or not those type of reads actually exist in data. While I assume they occur, I find it difficult to assume that they occur in greater proportion than the error rate of transcript-level read assignment, especially since i've seen replicate data (from *technical* reps of the exact same library, no less) where cufflinks assigns 100% of reads to transcript X in overlapping transcripts X/Y in rep A but 90% of them to transcript Y in rep B.

**NGSfan** · 07-31-2013, 08:08 AM

Originally posted by jparsons View Post

The paper only provides a theoretical explanation for an advantage: There's no information whatsoever regarding whether or not those type of reads actually exist in data. While I assume they occur, I find it difficult to assume that they occur in greater proportion than the error rate of transcript-level read assignment, especially since i've seen replicate data (from *technical* reps of the exact same library, no less) where cufflinks assigns 100% of reads to transcript X in overlapping transcripts X/Y in rep A but 90% of them to transcript Y in rep B.

Thanks for sharing this! What were the RPKMs of these transcripts?

RNA-seq suffers from random sampling so if the overall gene expression level is low, then the chance is higher that reads will get assigned to another transcript even in technical replicates. Eg. if you only have 2-4 supporting reads for a splice junction, and the next time you sample it, you get 0 supporting reads for that same junction, I can imagine the reads being flipped to another transcript in the other technical replicate.

What was the FPKM status of this transcript? If it is "OK" then that would be troubling...

**jparsons** · 07-31-2013, 08:43 AM

The two transcripts happen to be a regular transcript and its accompanying snoRNA ; it's certainly understandable that the reads are misassigned. At least it's not being called DE, but when the FPKM varies between 0 and 20 (or 20 and 40) it casts considerable doubt on the accuracy of the assignment. There are over 1000 reads in the 12kb region.

Code:

SNORA5A	SNORA5A	-	chr7:45139698-45151317	1a	1aU	OK	42.1291	0	-1.79769e+308	-1.79769e+308	0.402342	1	no
TBRG4	TBRG4	-	chr7:45139698-45151317	1a	1aU	OK	20.5787	22.3127	0.116716	-0.308804	0.75747	1	no

I have no idea why the sequences are 100% overlapping like this; in the GTF, the SNO is only 45143948-45144081.

Anyway, this was one of my more depressing observations while attempting to figure out how the heck to validate RNA-seq quantifications. It may just be a bug in the software (given that the genomic regions are changing,it seems like it must be), but even conceptually, no matter how good(/complicated) you make your statistical models, you're going to misassign some reads.

I'll go ahead and re-run this with 2.11 tonight just for kicks. It was done in 2.02 originally.

**NGSfan** · 07-31-2013, 09:25 AM

Originally posted by jparsons View Post

The two transcripts happen to be a regular transcript and its accompanying snoRNA ; it's certainly understandable that the reads are misassigned. At least it's not being called DE, but when the FPKM varies between 0 and 20 (or 20 and 40) it casts considerable doubt on the accuracy of the assignment. There are over 1000 reads in the 12kb region.

Code:

SNORA5A	SNORA5A	-	chr7:45139698-45151317	1a	1aU	OK	42.1291	0	-1.79769e+308	-1.79769e+308	0.402342	1	no
TBRG4	TBRG4	-	chr7:45139698-45151317	1a	1aU	OK	20.5787	22.3127	0.116716	-0.308804	0.75747	1	no

I have no idea why the sequences are 100% overlapping like this; in the GTF, the SNO is only 45143948-45144081.

Anyway, this was one of my more depressing observations while attempting to figure out how the heck to validate RNA-seq quantifications. It may just be a bug in the software (given that the genomic regions are changing,it seems like it must be), but even conceptually, no matter how good(/complicated) you make your statistical models, you're going to misassign some reads.

I'll go ahead and re-run this with 2.11 tonight just for kicks. It was done in 2.02 originally.

Wow.. that one is a tricky case. They are both transcribed off the minus strand. I looked at it in the genome browser:

http://genome-euro.ucsc.edu/cgi-bin/hgTracks?position=chr7:45138588-45149441&hgsid=192323643&ensGene=pack

But I'm confused about the annotation, the Refseq track shows SNORA5A is sitting between TBRG4 exons. While UCSC genes track shows overlapping of TBRG4 exons with (a larger) SNORA5C.

Ensembl also has the smaller SNORA5A and SNORA5C in between exons.

What are your settings for cuffdiff? are you using --compatible-hits-norm ?
I would hope it would make things agree more with the GTF annotation. The defaults are not always the best and are actually flipped between cufflinks and cuffdiff (I have found from personal experience). Also the defaults claimed are not always the true defaults. For example --max-frag-multihits is listed as unlimited in cuffdiff, but it's actually set to 1 in the program.

**jparsons** · 07-31-2013, 09:51 AM

Unless "default=TRUE" isn't actually true for compatible-hits-norm, yes i am using it. I wasn't using the flag, however.

Interestingly, RSEM also says that the SNO disappears in 1aU. It gets the lengths correct, though.

(Different reference because obviously these programs are picky about references so you can't directly compare them without doing way too much extra work):

Code:

 
sample	gene_id	transcript_id			length	effectivelength	exp.count	TPM	FPKM 
1a:	ENSG00000206838	ENST00000384111		134.00	84.34	1.00	1.58	0.95
1aU:	ENSG00000206838	ENST00000384111		0.00	0.00	0.00	0.00	0.00

1a:	ENSG00000136270	ENST00000258770...	2020.44	1966.21	690.00	44.79	27.03

1aU:	ENSG00000136270	ENST00000258770...	1953.33	1900.96	585.00	47.69	28.96

Cuffdiff 2.11, with or without the compatible-hits-norm flag, give the same quantifications as the older version, although they're now NOTEST rather than OK. Whatever was going on with the length of this poor little SNO is still happening in the latest version of the software.

**NGSfan** · 08-02-2013, 03:06 AM

Originally posted by jparsons View Post

Cuffdiff 2.11, with or without the compatible-hits-norm flag, give the same quantifications as the older version, although they're now NOTEST rather than OK. Whatever was going on with the length of this poor little SNO is still happening in the latest version of the software.

Thanks for the update jparsons. Interesting that it switched to NOTEST in the newer version.

Are you using the same Ensembl GTF with Cuffdiff as you do with RSEM?

Topics	Statistics	Last Post
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 33 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM

Unconfigured Ad

Expression quantification/differential expression gene analysis by RNA-Seq

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News