RNA-seq and normalization numbers

lpachter replied

04-07-2010, 07:56 AM
I'd like to point out again that in fact raw counts are not advantageous to FPKM.

FPKM is a unit for reporting an estimate of expression. Reports of expression estimates (whether in FPKM or other units) are by definition based on statistical inference from raw read counts.
Leave a comment:
Simon Anders replied

04-06-2010, 10:01 AM
In case this got lost in my lengthy post #12:

The reason why raw counts are advantageous to FPKM values for statistical inference is discussed in this thread, from post #6 onwards: http://seqanswers.com/forums/showthread.php?t=4349
Leave a comment:
Siva replied

04-06-2010, 09:28 AM
Hi RCJ
Thanks. Btw, I did single end sequencing and not paired ends. I don't think that that should be a problem in calculating integer counts per transcript. I have contacted our statistics collaborator regarding FPKM and will get back after I get a reply.

best
Siva
Leave a comment:
RockChalkJayhawk replied

04-06-2010, 05:08 AM
Unfortunately I just noticed that when cufflinks calls transcripts, it doesn't report thier length. Only in the tmap reference files. All the other files only show the genomic coordinates, which isn't as helpful.

I will e-mail the author to see if this functional;ity can be easily added or worked around.
Leave a comment:
RockChalkJayhawk replied

04-06-2010, 04:46 AM
To answer your question, see this page <a href="http://cufflinks.cbcb.umd.edu/howitworks.html"> here</a>

To estimate isoform-level abundances, one must assign fragments to individual transcripts, which may be difficult because a read may align to multiple isoforms of the same gene. Cufflinks uses a statistical model of paired-end sequencing experiments to derive a likelihood for the abundances of a set of transcripts given a set of fragments.

Therefore, cufflinks is using probabilities to assign individual fragments to individual transcripts. Since the units of the counts are in fragments per KB transcript per million reads, you can convert them back to raw integers by the multiplying the length of a given transcript and number of reads in millions.
Leave a comment:
Siva replied

04-05-2010, 07:32 PM
Originally posted by lpachter View Post

Dear Siva,

I'd like to understand why your statistics collaborator cannot use the Cufflinks FPKM values together with their confidence intervals?

Regarding absolute counts, the whole point of Cufflinks is that it is not possible to obtain absolute read counts per transcript, because for many reads there is ambiguity as to which transcript they belong to. Cufflinks is probabilistically assigning reads to transcripts and thereby able to estimate expression of individual transcripts.

Dear L Pachter
Thank you very much for your reply thanks also to RCJ. I understand your point about probabilistically assigning reads to transcripts. I will get back to our statistical group about this. However the procedure suggested by RCJ and seconded by you has me a bit confused. I used to think that the algorithm that Cufflinks uses assigns reads based on probability and it will be different for each transcript and also vary according to genome location. So how can a mere multiplication of FPKM by transcript length and aligned sequences give us tag count per transcript? Am I missing something too obvious here?

thanks
Siva
Leave a comment:
lpachter replied

04-05-2010, 08:58 AM
Thats correct- the procedure RCJ suggests will give you an estimate of the actual tag count for each transcript.
Leave a comment:
RockChalkJayhawk replied

04-05-2010, 08:23 AM
Originally posted by lpachter View Post

Regarding absolute counts, the whole point of Cufflinks is that it is not possible to obtain absolute read counts per transcript, because for many reads there is ambiguity as to which transcript they belong to. Cufflinks is probabilistically assigning reads to transcripts and thereby able to estimate expression of individual transcripts.

Why not just take the FPKM values categorized by cufflinks for each individual transcripts and multiply them by transcript length and number of aligned sequences? That would give you the tag count per transcript.
Leave a comment:
lpachter replied

04-05-2010, 07:50 AM
Dear Siva,

I'd like to understand why your statistics collaborator cannot use the Cufflinks FPKM values together with their confidence intervals?

Regarding absolute counts, the whole point of Cufflinks is that it is not possible to obtain absolute read counts per transcript, because for many reads there is ambiguity as to which transcript they belong to. Cufflinks is probabilistically assigning reads to transcripts and thereby able to estimate expression of individual transcripts.
Leave a comment:
Siva replied

04-04-2010, 08:52 PM
Re: Bowtie-Tophat SAM output for read count assembly

Originally posted by Simon Anders View Post

If you want to test for differential expression, it is a good idea to stay on the level of raw, integer counts, and not use RPKM or related data that is normalized by transcript length. This is because significance depends on the number of actual reads that you count. If you have low count you need to see a high fold-change to call significance.

See this thread for more details: http://seqanswers.com/forums/showthread.php?t=4349 (especially from post #6 onwards)

If you work with count data, your testing procedure needs to be aware of the ratios of sequencing depths of the libraries. This functionality is offered by several tools, namely edgeR, DESeq, and cuffdiff. I recommend DESeq, of course, as it is our work. ;-)

Hi Simon
I have used Bowtie, Tophat and Cufflinks to align and assemble maize RNA-seq data. Cufflinks reports FPKM values which our statistics collaborator is not able to use as it has already been normalized. Can I use the Bowtie, Tophat generated 'SAM' output in some other program to assemble the data in way that I will have absolute read counts per gene or transcript? If so what other programs would you recommend?

thanks
Siva
Leave a comment:
Simon Anders replied

04-03-2010, 01:05 AM
To estimate the library size, simply taking the total number of (mapped or unmapped) reads is, in our experience, not a good idea.

Sometimes, a few very strongly expressed genes are differentially expressed, and as they make up a good part of the total counts, they skew this number. After you divide by total counts, these few strongly expressed genes become equal, and the whole rest looks differentially expressed.

The following simple alternative works much better:

- Construct a "reference sample" by taking, for each gene, the geometric mean of the counts in all samples.

- To get the sequencing depth of a sample relative to the reference, calculate for each gene the quotient of the counts in your sample divided by the counts of the reference sample. Now you have, for each gene, an estimate of the depth ratio.

- Simply take the median of all the quotients to get the relative depth of the library.

This is what the 'estimateSizeFactors' function of our DESeq package does.
Leave a comment:
Simon Anders replied

04-03-2010, 12:55 AM
If you want to test for differential expression, it is a good idea to stay on the level of raw, integer counts, and not use RPKM or related data that is normalized by transcript length. This is because significance depends on the number of actual reads that you count. If you have low count you need to see a high fold-change to call significance.

See this thread for more details: http://seqanswers.com/forums/showthread.php?t=4349 (especially from post #6 onwards)

If you work with count data, your testing procedure needs to be aware of the ratios of sequencing depths of the libraries. This functionality is offered by several tools, namely edgeR, DESeq, and cuffdiff. I recommend DESeq, of course, as it is our work. ;-)

Note that this does not alleviate the bias towards longer genes: If two genes have the same expressions (same number of transcript molecules per volume) in two samples and hence the same fold change, the longer one may be called significant and the shorter one not, because the longer one produces more fragments.

If this bothers you, you have a couple of options:
- use Tag-Seq instead of RNA-Seq
- additionally filter with with a rather large threshold on log fold change
- for GO enrichment test and the like, use the test by Young et al. (2010), which takes the length bias into account: http://genomebiology.com/2010/11/2/R14

Here is a figure, that shows, how the log fold change required for significance (red dots: genes with significant DE; black dots: other genes) depends on the counts when using DESeq for testing:

For more information, see our paper: http://dx.doi.org/10.1038/npre.2010.4282.1

Last edited by Simon Anders; 04-03-2010, 12:57 AM.
Leave a comment:
RockChalkJayhawk replied

04-02-2010, 06:42 PM
The point with RPKM that I do not like, it is that I do not feel that it can handle different coverages.

I totally agree with you Chema. RPKM is biased for testing differential expression for longer genes. See the following paper:

HTML Code:

http://www.biology-direct.com/content/4/1/14

and also other papers by Oshlack and Wakefield.
Leave a comment:
steffenp replied

01-15-2010, 03:55 AM
RPKM for Tag-Seq data

Hello everyone!
My question is also related to the RPKM normalization. Is this measure suited for Tag-Seq. data, where we have only reads at the 3'end and not all over the whole transcript. My suggestion is no. What would be an adequate normalization for this kind of data?
Leave a comment:
doxologist replied

02-27-2009, 06:01 AM
Hi Victor:

It seems that from your problem, looking at percentage of the tags a certain gene occupies would solve the problem... and ALSO like suggested, quantile normalization (assuming that the differentially regulated genes do not make up a large percentage of your genes).
Leave a comment:

Previous 1 2 3 4 template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News