Hi,
I tried to run Cufflinks on a Tophat-generated BAM file, using --GTF or --GTF-guide option, with the same reference annotation file (derived from refseq hg19).
The --GTF option estimate isoform expression but does not try to assemble novel transcripts. The --GTF-guide option use the reference annotation file to "guide RABT assembly".
The other options are very classical: -q -p 20 (and I am still using Cufflinks v1.1.0. Please let me know if this issue can be solved by upgrading to v1.2, which I will do soon anyway)
The RABT option assemble >62000 novel transcripts, half of which are shorter than 250bp. Some of the novel transcripts have an extremely high FPKM (>1 million), whereas the maximum FPKM obtained with the simple --GTF option is 1700. I clearly don't see aberrant accumulation of reads at these novel transcripts in IGV (usually, "some" reads, but not a crazy high number).
I have a feeling that in general Cufflinks mis-estimate the expression of short transcripts, and I filtered out those (mainly microRNA, snoRNA and RPL genes) from my reference annotation file.
Could it be a similar issue here?
Did anybody had a similar problem, or understand how to solve it (or suggest an alternative to estimate expression from an alignment file???)
See attached a comparison of size vs. FPKM for both runs.
I tried to run Cufflinks on a Tophat-generated BAM file, using --GTF or --GTF-guide option, with the same reference annotation file (derived from refseq hg19).
The --GTF option estimate isoform expression but does not try to assemble novel transcripts. The --GTF-guide option use the reference annotation file to "guide RABT assembly".
The other options are very classical: -q -p 20 (and I am still using Cufflinks v1.1.0. Please let me know if this issue can be solved by upgrading to v1.2, which I will do soon anyway)
The RABT option assemble >62000 novel transcripts, half of which are shorter than 250bp. Some of the novel transcripts have an extremely high FPKM (>1 million), whereas the maximum FPKM obtained with the simple --GTF option is 1700. I clearly don't see aberrant accumulation of reads at these novel transcripts in IGV (usually, "some" reads, but not a crazy high number).
I have a feeling that in general Cufflinks mis-estimate the expression of short transcripts, and I filtered out those (mainly microRNA, snoRNA and RPL genes) from my reference annotation file.
Could it be a similar issue here?
Did anybody had a similar problem, or understand how to solve it (or suggest an alternative to estimate expression from an alignment file???)
See attached a comparison of size vs. FPKM for both runs.