Originally posted by lpachter
View Post
Header Leaderboard Ad
Collapse
RNA-seq and normalization numbers
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by lpachter View PostDear Siva,
I'd like to understand why your statistics collaborator cannot use the Cufflinks FPKM values together with their confidence intervals?
Regarding absolute counts, the whole point of Cufflinks is that it is not possible to obtain absolute read counts per transcript, because for many reads there is ambiguity as to which transcript they belong to. Cufflinks is probabilistically assigning reads to transcripts and thereby able to estimate expression of individual transcripts.
Thank you very much for your reply thanks also to RCJ. I understand your point about probabilistically assigning reads to transcripts. I will get back to our statistical group about this. However the procedure suggested by RCJ and seconded by you has me a bit confused. I used to think that the algorithm that Cufflinks uses assigns reads based on probability and it will be different for each transcript and also vary according to genome location. So how can a mere multiplication of FPKM by transcript length and aligned sequences give us tag count per transcript? Am I missing something too obvious here?
thanks
Siva
Comment
-
To answer your question, see this page <a href="http://cufflinks.cbcb.umd.edu/howitworks.html"> here</a>
To estimate isoform-level abundances, one must assign fragments to individual transcripts, which may be difficult because a read may align to multiple isoforms of the same gene. Cufflinks uses a statistical model of paired-end sequencing experiments to derive a likelihood for the abundances of a set of transcripts given a set of fragments.
Comment
-
Unfortunately I just noticed that when cufflinks calls transcripts, it doesn't report thier length. Only in the tmap reference files. All the other files only show the genomic coordinates, which isn't as helpful.
I will e-mail the author to see if this functional;ity can be easily added or worked around.
Comment
-
Hi RCJ
Thanks. Btw, I did single end sequencing and not paired ends. I don't think that that should be a problem in calculating integer counts per transcript. I have contacted our statistics collaborator regarding FPKM and will get back after I get a reply.
best
Siva
Comment
-
In case this got lost in my lengthy post #12:
The reason why raw counts are advantageous to FPKM values for statistical inference is discussed in this thread, from post #6 onwards: http://seqanswers.com/forums/showthread.php?t=4349
Comment
-
I'd like to point out again that in fact raw counts are not advantageous to FPKM.
FPKM is a unit for reporting an estimate of expression. Reports of expression estimates (whether in FPKM or other units) are by definition based on statistical inference from raw read counts.
Comment
-
Of course, FPKM is a proper unit to report expression, more so than raw counts.
However, if you want to perform a statistical test to decide the question whether expression in two conditions is significantly different, then a test that works on the level of raw counts is more powerful than one that works on the level of normalized expression, for the reasons that we discuss in our paper.
Of course, if I know transcript lengths and library sizes, I can always convert between one type of information and the other. If I have two test procedures, one that only asks for raw read counts, or one that only asks for FPKM values, with neither asking for transcript lengths, this two procedures will necessarily do something different.
Actually, not only DESeq but also cuffdiff works on the level of raw counts, as Cole commented here: http://seqanswers.com/forums/showpos...59&postcount=9
Finally, on the risk of being nit-picking: If one divides counts by transcript length or total reads, one not only changes the unit but also the type of quantity reported. If I tell you either that I traveled 10 miles or 13 km, this is the same quantity in different units. If it took me 2 hours and I only tell you that made 5 miles per hour, this is a different quantity, namely velocity rather than distance. Similarly, FPKM is a unit of expression, i.e., it is deemed to be proportional to molar concentration of transcript molecules. Counts is not a unit of expression. Nevertheless, it is advantageous to compare counts rather then expression values to do inference on differential expression.
Comment
-
Perhaps I'll risk being nitpicky as well regarding a sentence in your nitpick.
You write that "Nevertheless, it is advantageous to compare counts rather then expression values to do inference on differential expression."
Now before you wrote _raw_ counts and now just counts. I agree with the latter (counts) but not with your statement before of using raw counts.
Consider the following simple example:
A gene with two isoforms A and B (with equal lengths), with the property that in experiment 1 isoform A is expressed at double the amount of B, vs. experiment 2 in which B is double A. The number of read counts from this gene may be identical in both experiments. However it may be possible to infer the changes in the expression levels of the individual transcripts. That is because of where the reads occur (e.g. one transcript may contain a splice junction and reads there may help to infer its abundance).
Now given _expression values_ rather than the raw counts, it may be clear that there has been differential expression. But looking at counts alone there would be no way to tell.
Having said this, I'm aware of your paper and I think your approach is valuable (and I agree with your argument that its better than the Poisson assumption). But what needs to be used is the probabilistic assignment of read counts to transcripts, rather than just raw read counts. And these _can_ be obtained from estimates of expression together with transcript lengths (as you point out). But again, they cannot be obtained from raw read counts alone.
Comment
-
Maybe our discussion revolved a bit on a misunderstanding regarding what we are aiming for. I want to get expression of genes, summing over transcripts, while you want to distinguish isoforms. (I may have worked too much with yeast and so got used to not having much alternative splicing going on in my data. )
If I don't care about isoforms or think that my coverage is too low to distinguish isoforms anyway, I expect to get optimal power by simply summing everything up.
Our (and edgeR's) negative binomial approach depends on the counts being "raw" (not normalized by anything) to get the shot noise right. So, if we want to use DESeq to distinguish isoforms, we would have to count reads for each exon and work on this level. (Actually, this might not work too well, as we would run into trouble with exons having different lengths in different isoforms.)
Cuffdiff is, as I understand it, designed to deal with such issues, while our approach ignores them. I expect that DESeq, in compensation for being unsuitable to detect differences in isoform proportions as in your example, achieves much better detection power for differences in total expression (per gene, summing over isoforms), especially at very low counts.
As I am not clear on how biological noise is taken into account by cuffdiff I cannot be fully sure whether this expectation will hold (and I'm quite curious to learn more about cuffdiff once your paper is out).
Comment
-
We are all learning a great deal from your discussions. Thanks, I now clearly understand many aspects of RNA-seq data normalization. Unfortunately, I am unable to convince our statistician to use cufflinks FPKM values even with the confidence intervals.
So I have two questions:
1. I would like to know if we can get transcript length in Cufflinks outputs so I can convert FPKM values to absolute read counts? I would assume this would need a script to automate.
2. Has someone got a script that uses the Bowtie SAM output to directly report counts per transcript and transcript length? Is there a software that I can use?
Siva
Comment
-
Hi Siva
Originally posted by Siva View Post2. Has someone got a script that uses the Bowtie SAM output to directly report counts per transcript and transcript length?
It is part of HTSeq, a Python framework to process HTS data, which I am working on currently. I'm still rounding some corners and will announce its release soon, but you can already use it, of course. Just tell me if you find any bugs or problems. For full documentation, see here: http://www-huber.embl.de/users/anders/HTSeq
Simon
Comment
-
one way to do it
Code:bamToBed -i accepted_hits.sorted.bam > accepted_hits.sorted.bed coverageBed -a accepted_hits.sorted.bed -b mm9.bed | sort -r -n -k7 > counts.txt
bamToBed and coverageBed are from bedtools and mm9.bed is a bed file describing gene positions. I'm using this for genes though, not transcripts.
On second thought, maybe this is too crude.Last edited by mgogol; 04-08-2010, 06:01 AM.
Comment
-
My personal advice to biologists performing RNA-Seq experiments is to care about isoforms. But that issue aside, as the paper by Li et al. from the Dewey lab recently pointed out, probabilistic assignment of reads is necessary even for genes, because many reads map to multiple genes. I therefore contest the statement that DESeq outperforms other methods simply because it uses raw read counts, even in yeast.
Comment
Latest Articles
Collapse
-
Differential Expression and Data Visualization: Recommended Tools for Next-Level Sequencing Analysisby seqadmin
After covering QC and alignment tools in the first segment and variant analysis and genome assembly in the second segment, we’re wrapping up with a discussion about tools for differential gene expression analysis and data visualization. In this article, we include recommendations from the following experts: Dr. Mark Ziemann, Senior Lecturer in Biotechnology and Bioinformatics, Deakin University; Dr. Medhat Mahmoud Postdoctoral Research Fellow at Baylor College of Medicine;...-
Channel: Articles
05-23-2023, 12:26 PM -
-
by seqadmin
Continuing from our previous article, we share variant analysis and genome assembly tools recommended by our experts Dr. Medhat Mahmoud, Postdoctoral Research Fellow at Baylor College of Medicine, and Dr. Ming "Tommy" Tang, Director of Computational Biology at Immunitas and author of From Cell Line to Command Line.
Variant detection and analysis tools
Mahmoud classifies variant detection work into two main groups: short variants (<50...-
Channel: Articles
05-19-2023, 10:03 AM -
-
by seqadmin
With new tools and computational resources being released regularly, it can be hard to determine which are best suited for the analysis process and which older tools continue to be maintained. In an effort to assist the sequencing community, we interviewed three highly skilled bioinformaticians about their recommended tools for several important analysis applications.
Quality control and preprocessing tools
“Garbage in, garbage out” is a popular...-
Channel: Articles
05-16-2023, 10:11 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Exploring French-Canadian Ancestry: Insights into Migration, Settlement Patterns, and Genetic Structure
by seqadmin
Started by seqadmin, 05-26-2023, 09:22 AM
|
0 responses
8 views
0 likes
|
Last Post
by seqadmin
05-26-2023, 09:22 AM
|
||
Started by seqadmin, 05-24-2023, 09:49 AM
|
0 responses
15 views
0 likes
|
Last Post
by seqadmin
05-24-2023, 09:49 AM
|
||
Introducing ProtVar: A Web Tool for Contextualizing and Interpreting Human Missense Variation in Proteins
by seqadmin
Started by seqadmin, 05-23-2023, 07:14 AM
|
0 responses
30 views
0 likes
|
Last Post
by seqadmin
05-23-2023, 07:14 AM
|
||
Started by seqadmin, 05-18-2023, 11:36 AM
|
0 responses
116 views
0 likes
|
Last Post
by seqadmin
05-18-2023, 11:36 AM
|
Comment