Seqanswers Leaderboard Ad

**Xi Wang** · 11-30-2009, 09:02 PM

You may use DEGseq to do the anaysis you want to. The input for DEGseq could be mapped reads rather than RPKM.
Have a look:

http://bioinfo.au.tsinghua.edu.cn/software/degseq/

and the related paper:

http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp612v1

best,
Xi

**svl** · 12-01-2009, 01:29 AM

Originally posted by beliefbio View Post

Is it necessary to calculate RPKM for each gene?

Because transcripts (or genes) vary in length (kilobases) and sequence-runs vary in the amount of reads produced, you would somehow like to account for these variations if you want to compare runs/samples. RPKM is a measure that (up to a certain degree of course) accounts for these.

If so what is the best tool to calculate RPKM? ERANGE, TopHat or Cufflinks?

Erange I haven't used yet. Tophat is for mapping not counting (it does count, but the creator of this software has said this will be removed from future versions since Cufflinks now exists), so Cufflinks is meant for RPKM determination.

So, you could map with tophat and then feed the produced "accepted_hits.sam" file to Cufflinks which will count and return RPKM values. But do realize that Tophat does more than just mapping, it tries to find exon-exon splice junctions (and is therefor potentially slow for just mapping).

-svl

update: and btw, when you have the RPKM values from Cufflinks you could also use the mentioned DEGseq for determining which transcripts are differentially expressed.

**beliefbio** · 12-01-2009, 01:37 AM

Thanks a lot svl!!

**kmcarr** · 12-01-2009, 06:22 AM

Originally posted by svl View Post

Because transcripts (or genes) vary in length (kilobases) and sequence-runs vary in the amount of reads produced, you would somehow like to account for these variations if you want to compare runs/samples. RPKM is a measure that (up to a certain degree of course) accounts for these.

If you are examining in differential expression of genes between samples you don't really need to normalize for transcript length. When comparing gene to gene between samples the length of the transcript is constant (let's ignore the possibility of differential isoform expression). In this case you only need to normalize for the total number of reads in each sample pool.

**Xi Wang** · 12-01-2009, 06:35 AM

Originally posted by kmcarr View Post

If you are examining in differential expression of genes between samples you don't really need to normalize for transcript length. When comparing gene to gene between samples the length of the transcript is constant (let's ignore the possibility of differential isoform expression). In this case you only need to normalize for the total number of reads in each sample pool.

I totally agree with your point. DEGseq follows this to identify differentially expressed genes.

**svl** · 12-01-2009, 07:28 AM

Agreed. Looking at other things, like the top (100) expressing genes/transcripts though, is impossible then, so for the sake of future comparison it's nice to use RPKM instead of RPM, it's not hard to calculate anyway. But you're absolutely right

!

**tebuffer** · 12-07-2009, 06:03 PM

CuffCompare

Originally posted by Xi Wang View Post

I totally agree with your point. DEGseq follows this to identify differentially expressed genes.

Cuffcompare (which is part of the Cufflinks) could be used to identify differentially expressed genes.

**yvan.wenger** · 12-09-2009, 08:05 AM

Hello everybody,

Some quick questions about the topic, I number them as they are quite different from each other. Any input appreciated!

1. Can tophat/cufflinks be used with a de-novo transcriptome assembly if no good genome is available (assuming that SOME contigs are actually long isoforms containing most exons)?

2. Is it correct that the model behind cufflinks tries to allocate reads mapping at multiple locations? Thus giving a more precise result in the case where two isoforms are almost identicals (e.g. premature stops)

3. I understand that the RPKM (Reads Per Kilobase exon Model per million mapped reads) is:
3a. number of reads normalized per kilobase exon (to make it more comparable to qPCR results... although with caveats --> good for relative comparison of transcripts abundance in one sample)
3b. per millions mapped reads (to normalize between different sequenced librairies)
(3c. limited to uniquely mapped reads except in the case of cufflinks???)

I think that the point 3a cannot be really detrimental, although it can give a false sense on absolute quantitation for example in case of premature stops if unambiguously mapped reads only are taken into account. Howver it can be useful as mentioned above by svl.

On 3b. This is my main question: I am not that to normalize on the total number of reads mapped is fully satisfying in case where gene expression is massively altered for highly expressed transcripts. Do somebody knows if a package for RNAseq (or adapted from microarrays) allows to do quantile regressions, even better with outlier removal? Or if this method would perform worse than normalization on the total mapped count in certain cases?

Cheers,

Yvan

**jiwu2573** · 01-20-2010, 09:22 PM

Cuffcompare output for DE genes

Originally posted by tebuffer View Post

Cuffcompare (which is part of the Cufflinks) could be used to identify differentially expressed genes.

Can Cuffcompare directly give out the list of differentially expressed genes?

If not, how its output can be used for the identification of DE genes?

**mkatari** · 02-15-2010, 07:26 AM

Originally posted by svl View Post

Agreed. Looking at other things, like the top (100) expressing genes/transcripts though, is impossible then, so for the sake of future comparison it's nice to use RPKM instead of RPM, it's not hard to calculate anyway. But you're absolutely right

!

If you are interested in differential expression then once you calculate the log ratio, you may be more interested in the top 100 induced/repressed transcripts rather than 100 most highly expressed transcripts.

**Cole Trapnell** · 02-15-2010, 04:57 PM

Originally posted by jiwu2573 View Post

Can Cuffcompare directly give out the list of differentially expressed genes?

If not, how its output can be used for the identification of DE genes?

I just wanted to point out that we just released a standalone tool, "cuffdiff", as part of the Cufflinks package to help you test for differential expression and regulation in your samples. Cuffdiff does differential expression on genes and transcripts, and a few other tests you may find helpful.

**Simon Anders** · 02-17-2010, 01:04 PM

Hi,

as already pointed out, it is not necessary to normalize for transcript length. It is even advantageous to not do so, as you can then use a statistical test that takes the specificities of count data into account, which gives you much better power at low count rates.

We have recently released a tool to do this, called DESeq: http://www-huber.embl.de/users/anders/DESeq/

DESeq is based on the so-called negative binomial distribution, which allows a powerful test for count data. Furthermore, it can estimate the variance between the samples from the data and uses this information in the test. The basic idea is older and has, e.g., already been used in the edgeR package (Robinson and Smyth), but we added an improved variance estimation that does a better job if the amount of noise depends on the expression strength as is often the case.

Note that this variance estimation is crucial. It is often claimed (e.g. by the DEGSeq package suggested above) that a Poisson-based test, such as the binomial or the chi-squared test, are suitable, but then, the p value will only tell you whether your difference is stronger than what to expect between _technical_ replicates, which is not biologically meaningful.

**Fabien Campagne** · 02-18-2010, 05:33 AM

You would need biological replicates to assess biological variability. One sample in each group limits your ability to see how much biological variability you should expect in future experiments, irrespective of the statistical test being used.

Regarding benchmarking of statistical methods for RNA-Seq data, I would recommend this paper from the Dudoit lab:

Home - bepress

http://www.bepress.com/ucbbiostat/paper247/

Turnkey institutional repository software featuring professional-grade publishing and faculty profiles tools to openly publish, manage and showcase the full spectrum of your institution’s research, scholarship and expertise

On the practical side of things, we have recently released a set of tools with a program to estimate various statistics of differential expression. It can evaluate RPKMs, Fisher exact tests to compare low counts across groups, but also t-test when you have several samples per group. All statistics are corrected for multiple testing with a Benjamini Hochberg FDR correction. We've tried to make it easy and fast to go from reads to differential expression results.

See the Goby home page at http://icbtools.med.cornell.edu/goby/ and a tutorial at http://icb.med.cornell.edu/wiki/index.php/Goby/DE

**lpachter** · 03-08-2010, 02:10 PM

I'd just like to clarify some of the discussion on this thread regarding how to normalize reads, how to measure expression, and then how to find differential expression.

First of all, RPKM is a unit, not a method. It stands for "reads per kilobase of transcript per million of sequenced reads". As we point out in the Cufflinks paper (to appear shortly) this unit is flawed, as the objects being sequenced are fragments, not reads. We use the unit FPKM (expected fragments per kilobase of transcript per million fragments sequenced). This is not only a technicality- it is crucial to use units that are proportional (i.e. a scalar multiple) of the estimated proportion of each transcript. FPKM has this property, RPKM cannot.

Secondly, regarding expression estimates, a current favored method is to "count" the reads that map to a gene and normalize by length. If the gene is single isoform, this is well-defined, but its problematic with multiple isoforms that may have different lengths, and share different exons. The current favored method I allude to of counting all reads that map somewhere in the locus, and dividing by the number of exonic bases _provably underestimates gene expression_ It is essential to normalize not only by transcript length, but in fact it is essential to probabilistically assign fragments to isoforms. This is what Cufflinks does.

Regarding differential expression tests, one has to keep in mind that in genes with multiple isoforms the relative abundances may chance, making it crucial to have correctly estimated individual expression levels.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

how to study differential expression?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News