Seqanswers Leaderboard Ad

**Cole Trapnell** · 03-15-2010, 10:48 AM

Originally posted by kmcarr View Post

Cole,

First, thanks for an excellent software stack.

Was the release you are referring to > 0.8.1? I am using 0.8.1 (the latest available on the web site) and am experiencing this problem. It seems that since 0.8.1 was released on 2/13/2010 and you wrote the above on 2/22/2010 the the fix would be in a version later than 0.8.1. I hate to be a pest; I have no doubt you are very busy and dealing with (L)users is the last thing you need, but I'm a little stymied by this bug.

Thanks again.

P.S. Yes, I just subscribed to the mailing list.

Sorry for being unclear - I meant that I fixed the bug in version control, not that I fixed the bug and released a new version. The fix will be released in the upcoming 0.8.2. I had hoped to have it out early this week, but a number of things came up (including me getting pretty sick), and I'm still in the middle of some changes, which then need to be tested. Thanks for your patience.

**kmcarr** · 03-15-2010, 11:32 AM

Originally posted by Cole Trapnell View Post

Sorry for being unclear - I meant that I fixed the bug in version control, not that I fixed the bug and released a new version. The fix will be released in the upcoming 0.8.2. I had hoped to have it out early this week, but a number of things came up (including me getting pretty sick), and I'm still in the middle of some changes, which then need to be tested. Thanks for your patience.

Thanks for the update Cole; I'll keep an eye out for the new version.

(Hope you're feeling better.)

**jebe** · 03-18-2010, 11:42 AM

cuffcompare tracking file

The output from cuffcompare stdout.tracking, according to the manual and replies on this website is as follows:

qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo >|<conf_hi>|<read coverage>

In my output I have the following:
q1:s5.202|s5.202.0|100|0.000000|0.409935|4.057937|4.537815

I'm just wondering how the conf_lo and conf_hi can be non-zero values, but the fpkm is zero. Has anyone else seen this result?

Thank you.

**xiaomimeow** · 03-24-2010, 09:14 PM

I was able to get cuffdiff to work by going to the source code and writing some lines to handle the divisions by zero. Installing boost and then compiling thereafter took a great deal of trial and error, though.

I am a bit puzzled by the results. There are some genes for which cufflinks/cuffcompare/cuffdiff asserts multiple isoforms (I used the Ensembl gtf). However, plots of the tophat mappings do not support such an interpretation. In other words, there is nothing going for some splice variants, yet they are deemed to be present. Does anyone else feel that the isoform calls are too permissive?

**Cole Trapnell** · 03-26-2010, 03:08 PM

I just wanted to announce that v0.8.2 of Cufflinks addresses the divide by zero, along with a number of other issues.

**Cole Trapnell** · 03-26-2010, 03:10 PM

Originally posted by xiaomimeow View Post

I was able to get cuffdiff to work by going to the source code and writing some lines to handle the divisions by zero. Installing boost and then compiling thereafter took a great deal of trial and error, though.

I am a bit puzzled by the results. There are some genes for which cufflinks/cuffcompare/cuffdiff asserts multiple isoforms (I used the Ensembl gtf). However, plots of the tophat mappings do not support such an interpretation. In other words, there is nothing going for some splice variants, yet they are deemed to be present. Does anyone else feel that the isoform calls are too permissive?

Can you be a bit more specific about what you're seeing? Some screenshots of data in a browser for example?

**RockChalkJayhawk** · 03-26-2010, 03:46 PM

Originally posted by Cole Trapnell View Post

Optionally, you may wish to clean up the stdout.combined.gtf before running cuffdiff, to remove partial transfrags that resulted from low depth of sequencing coverage in one of the samples. We like to perform differential testing only on transcripts that are either already known to annotation or that we've assembled in two different samples independently.

What metrics do you use to filter out incomplete transcripts?

**xiaomimeow** · 03-27-2010, 02:07 PM

I am re-running my data with the new version of cuffcompare. However, I am getting the error below. If I get rid of either the -s or the -r flag, it will work. If I have both, I get the below error.

*** glibc detected *** cuffcompare: double free or corruption (!prev): 0x000000003f018dc0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3b56e722ef]
/lib64/libc.so.6(cfree+0x4b)[0x3b56e7273b]
cuffcompare[0x41a0b9]
cuffcompare[0x4034ae]
cuffcompare[0x406d7c]
cuffcompare[0x40f70d]
cuffcompare[0x4109e9]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3b56e1d994]
cuffcompare(__gxx_personality_v0+0x5a)[0x4015aa]
======= Memory map: ========
00400000-0043c000 r-xp 00000000 68:05 48562737 /home/shinlin/src/cufflinks-0.8.2.Linux_x86_64/cuffcompare
0053b000-0053c000 rw-p 0003b000 68:05 48562737 /home/shinlin/src/cufflinks-0.8.2.Linux_x86_64/cuffcompare
0053c000-00545000 rw-p 0053c000 00:00 0
1883d000-3f1fc000 rw-p 1883d000 00:00 0 [heap]
3b56a00000-3b56a1c000 r-xp 00000000 68:01 1334308 /lib64/ld-2.5.so
3b56c1b000-3b56c1c000 r--p 0001b000 68:01 1334308 /lib64/ld-2.5.so
3b56c1c000-3b56c1d000 rw-p 0001c000 68:01 1334308 /lib64/ld-2.5.so
3b56e00000-3b56f4d000 r-xp 00000000 68:01 1334328 /lib64/libc-2.5.so
3b56f4d000-3b5714d000 ---p 0014d000 68:01 1334328 /lib64/libc-2.5.so
3b5714d000-3b57151000 r--p 0014d000 68:01 1334328 /lib64/libc-2.5.so
3b57151000-3b57152000 rw-p 00151000 68:01 1334328 /lib64/libc-2.5.so
3b57152000-3b57157000 rw-p 3b57152000 00:00 0
3b57600000-3b57682000 r-xp 00000000 68:01 1334517 /lib64/libm-2.5.so
3b57682000-3b57881000 ---p 00082000 68:01 1334517 /lib64/libm-2.5.so
3b57881000-3b57882000 r--p 00081000 68:01 1334517 /lib64/libm-2.5.so
3b57882000-3b57883000 rw-p 00082000 68:01 1334517 /lib64/libm-2.5.so
3b57a00000-3b57a16000 r-xp 00000000 68:01 1334330 /lib64/libpthread-2.5.so
3b57a16000-3b57c15000 ---p 00016000 68:01 1334330 /lib64/libpthread-2.5.so
3b57c15000-3b57c16000 r--p 00015000 68:01 1334330 /lib64/libpthread-2.5.so
3b57c16000-3b57c17000 rw-p 00016000 68:01 1334330 /lib64/libpthread-2.5.so
3b57c17000-3b57c1b000 rw-p 3b57c17000 00:00 0
3b66800000-3b6680d000 r-xp 00000000 68:01 1334627 /lib64/libgcc_s-4.1.2-20080825.so.1
3b6680d000-3b66a0d000 ---p 0000d000 68:01 1334627 /lib64/libgcc_s-4.1.2-20080825.so.1
3b66a0d000-3b66a0e000 rw-p 0000d000 68:01 1334627 /lib64/libgcc_s-4.1.2-20080825.so.1
3b69800000-3b698e6000 r-xp 00000000 68:01 207246 /usr/lib64/libstdc++.so.6.0.8
3b698e6000-3b69ae5000 ---p 000e6000 68:01 207246 /usr/lib64/libstdc++.so.6.0.8
3b69ae5000-3b69aeb000 r--p 000e5000 68:01 207246 /usr/lib64/libstdc++.so.6.0.8
3b69aeb000-3b69aee000 rw-p 000eb000 68:01 207246 /usr/lib64/libstdc++.so.6.0.8
3b69aee000-3b69b00000 rw-p 3b69aee000 00:00 0
2b2ce7635000-2b2ce7637000 rw-p 2b2ce7635000 00:00 0
2b2ce764e000-2b2cf765a000 rw-p 2b2ce764e000 00:00 0
7fff7619b000-7fff761b0000 rw-p 7ffffffea000 00:00 0 [stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso]
Abort

**xiaomimeow** · 03-28-2010, 09:54 AM

Cole:

Thanks for answering. Attached is a plot of a gene on chromosome 11. For a clearer picture, go to

http://bifrost.wustl.edu/~ylin/s/ATHL1.jpeg

1. The black lines are the exons according to Ensembl. There are 9 different transcripts. I will refer to the first as the one closest to the top; the last, at the bottom.
2. Green is the coverage of sequence mapped by Tophat.
3. The blue are reads that span greater than 101bp. Some of these should be mappings across exon junctions.

Cufflinks and cuffcompare asserts the presence of transcripts 1,3,4,6, and 8. I disagree with some of these calls.

a. For transcripts 3 and 4, there are exons with only partial coverage. By the eyeball test, I would disfavor asserting their presence. Especially 3 since there is almost no coverage over a large portion of the first exon.
b. For transcript 8, why is this transcript favored over 9? There is a portion of the first exon that is not covered. If I had to choose between 8 and 9, I would favor 9.
c. For transcript 6, there is no read spanning the junction between exons 8 and 9. Why posit the presence of this transcript?

By my eyeball, I would assert the presence of only 5, 7, 9. 7 & 9 are the same thing except for the CDS so I would only represent both with one call. What excess coverage there seems to be at the end could be accounted for by 5. In total, I guess what I'm saying is that I would favor more parsimony in the event that so many transcripts are being called for this gene.

Thanks for your help.

Attached Files

ATHL1.jpg (19.9 KB, 61 views)

**jgarbe** · 03-30-2010, 09:36 AM

Proper TopHat-Cufflinks workflow

Originally posted by Cole Trapnell View Post

So the basic workflow we recommend is:
1) Assemble each sample with cufflinks
2) Run cuffcompare on the sample transfrags all at the same time, providing a reference annotation if you want to classify your transfrags according to known, novel, etc.
3) Give the stdout.combined.gtf to cuffdiff, along with your original SAM alignments from the samples. Cuffdiff will re-estimate the abundances of the transfrags in the GTF using the alignments in each sample, and do the differential expression testing at the same time.

So if I have reads from 6 samples I should run TopHat separately on each sample, then run cufflinks on each of the transcripts.expr files output by TopHat?

I'm not using a reference transcriptome, and it seems like I would get a better de novo transcriptome by running TopHat on the six samples together, but then I don't get output files needed to run cuffcompare.

**Cole Trapnell** · 03-30-2010, 03:13 PM

Originally posted by xiaomimeow View Post

Cole:

Thanks for answering. Attached is a plot of a gene on chromosome 11. For a clearer picture, go to

http://bifrost.wustl.edu/~ylin/s/ATHL1.jpeg

1. The black lines are the exons according to Ensembl. There are 9 different transcripts. I will refer to the first as the one closest to the top; the last, at the bottom.
2. Green is the coverage of sequence mapped by Tophat.
3. The blue are reads that span greater than 101bp. Some of these should be mappings across exon junctions.

Cufflinks and cuffcompare asserts the presence of transcripts 1,3,4,6, and 8. I disagree with some of these calls.

a. For transcripts 3 and 4, there are exons with only partial coverage. By the eyeball test, I would disfavor asserting their presence. Especially 3 since there is almost no coverage over a large portion of the first exon.
b. For transcript 8, why is this transcript favored over 9? There is a portion of the first exon that is not covered. If I had to choose between 8 and 9, I would favor 9.
c. For transcript 6, there is no read spanning the junction between exons 8 and 9. Why posit the presence of this transcript?

By my eyeball, I would assert the presence of only 5, 7, 9. 7 & 9 are the same thing except for the CDS so I would only represent both with one call. What excess coverage there seems to be at the end could be accounted for by 5. In total, I guess what I'm saying is that I would favor more parsimony in the event that so many transcripts are being called for this gene.

Thanks for your help.

You raise some good questions. I want to clarify one thing: Cufflinks is only "parsimonious" when it is assembling it's own transcripts. When you give it the Ensembl annotation, it is going to give you it's best guess for how much of each of those transcripts is present. Only transcripts that have *no* overlapping reads will be left out of the output.

Because it's common for transcripts sequenced at low abundance to receive spotty, incomplete sequencing coverage, one can't rely on the absence of reads on the "salient" features of a transcript (such as non-constitutive exons) as a means of saying that the transcript isn't there.

In this example, what are the confidence intervals for the FPKM for each transcript? I suspect that many of them have confidence intervals that include zero (or where the lower bound is close to zero). That's Cufflinks saying that those isoforms could be present, but it can't be sure. So it reports its best guess under the abundance model along with its own level of uncertainty.

**Cole Trapnell** · 03-30-2010, 03:38 PM

Originally posted by jgarbe View Post

So if I have reads from 6 samples I should run TopHat separately on each sample, then run cufflinks on each of the transcripts.expr files output by TopHat?

I'm not using a reference transcriptome, and it seems like I would get a better de novo transcriptome by running TopHat on the six samples together, but then I don't get output files needed to run cuffcompare.

Whether to pool the reads depends on how long they are, oddly enough. The algorithm TopHat uses for longer reads (>=75bp with default settings) is totally different than what it uses for shorter reads. The long read algorithm is much more sensitive, and scales fine with any amount of reads. The short read algorithm performs less well (both in running time and in alignment sensitivity) when you throw a ton of reads at it at once. So if your reads are long, you could pool. If you don't actually pool them all together, you could still map them twice. Do each sample once, get the junctions.bed file for each, convert it into TopHat's .juncs format using the included script bed_to_juncs (this is an undocumented script for now), and pool all the .juncs files. Then remap each sample using the -j flag with this pooled .juncs file. This will improve your spliced alignment sensitivity, no matter how long your reads are.

You can get even fancier with this kind of thing, including splice junctions from other evidence (are there any ESTs for this organism? You could make .juncs files from their alignments to the genome...). TopHat will check the reads against all the junctions in the .juncs file you provide.

If you do pool the reads, you could also rename them to tag them by sample, so you can split the sample alignments up again after the TopHat run if needed.

As far as pooling for Cufflinks, I don't usually recommend it, because the more alternative splice variants that are present in a given sample, the more likely it is that Cufflinks will misassemble some of the isoforms. If you give it a mixture of 6 samples from different conditions or tissues, you'll present a more complicated picture for Cufflinks in the moderately and highly expressed genes where you could be seeing many isoforms. On the low end of the expression profile, you *are* more likely to get a less fragmented assembly by pooling. So I think it's up to you and depends on what your experiment focuses on. If you're trying to annotate the genome with RNA-Seq, you may want to pool.

**xiaomimeow** · 03-30-2010, 08:00 PM

more questions on transcript calls

Originally posted by Cole Trapnell View Post

You raise some good questions. I want to clarify one thing: Cufflinks is only "parsimonious" when it is assembling it's own transcripts. When you give it the Ensembl annotation, it is going to give you it's best guess for how much of each of those transcripts is present. Only transcripts that have *no* overlapping reads will be left out of the output.

Because it's common for transcripts sequenced at low abundance to receive spotty, incomplete sequencing coverage, one can't rely on the absence of reads on the "salient" features of a transcript (such as non-constitutive exons) as a means of saying that the transcript isn't there.

In this example, what are the confidence intervals for the FPKM for each transcript? I suspect that many of them have confidence intervals that include zero (or where the lower bound is close to zero). That's Cufflinks saying that those isoforms could be present, but it can't be sure. So it reports its best guess under the abundance model along with its own level of uncertainty.

Thanks for replying.

Actually, the estimates do not cover zero.

transcript id estimate low high
ENST00000474221 TSS5 0.117914 0.0534213 0.182407
ENST00000382625 TSS5 0.265343 0.0575158 0.47317
ENST00000409655 TSS5 0.417567 0.222686 0.612448
ENST00000409479 TSS6 0.660262 0.483473 0.837051
ENST00000482937 TSS7 0.547119 0.426929 0.667309

Moreover, I don't fully understand your statement of how "Only transcripts that have *no* overlapping reads will be left out of the output." Why then are 5, 7, and 9 not asserted to be present? They have some overlapping reads, right?

I hope you don't feel I'm disparaging your program, because I really like what you've done with tophat and cufflinks. I just want to understand the programs better.

Thanks again.

**vanil** · 05-09-2010, 03:32 AM

Segmentation fault with cuffdiff

Has anyone encountered a segmentation fault immediately after running cuffdiff?

./cuffdiff transcripts.gtf sample1.SAM.sorted sample2.SAM.sorted

output:
Segmentation fault

thanks

**ghosr** · 11-17-2010, 11:03 PM

Error while running CuffDiff

Hi Cole,

I have been getting this error msg while running CuffDiff with the latest version of cufflinks. The .bam files are generated by Tophat.

I see this error has been reported earlier, but unfortunately I haven't been able to get around it yet. I have successfully run CuffDiff with the same annotation file for another data set though. Not sure what is going wrong in this case.

I have copied the error msg below.
Thanks.

[ghoshr@compute-0-0 cuffdiff_9.2]$ /home/ghoshr/cufflinks-0.9.2.Linux_x86_64/./cuffdiff -N -r ../../../Chlre4_genomic_scaffolds.fasta ../../augustus.u9.mod1.gff3 ../RAS_Lane1_50/new.tophat.lane1/accepted_hits.bam ../RAS_Lane2_50/new.tophat.lane2/accepted_hits.bam ../RAS_Lane3_50/new.tophat.lane3/accepted_hits.bam ../RAS_Lane4_50/new.tophat.lane4/accepted_hits.bam ../RAS_Lane6_50/new.tophat.lane5/accepted_hits.bam
/home/ghoshr/cufflinks-0.9.2.Linux_x86_64/./cuffdiff: /usr/lib64/libz.so.1: no version information available (required by /home/ghoshr/cufflinks-0.9.2.Linux_x86_64/./cuffdiff)
[15:53:50] Loading reference annotation and sequence
[15:53:56] Inspecting maps and determining fragment length distributions.
[15:59:12] Calculating initial abundance estimates.
> Processed 15935 loci. [*************************] 100%
[16:40:31] Learning bias parameters.
> Processed 15934 loci. [*************************] 100%
[16:41:51] Learning bias parameters.
> Processed 15934 loci. [*************************] 100%
[16:43:57] Learning bias parameters.
> Processed 15934 loci. [*************************] 100%
[16:46:02] Learning bias parameters.
> Processed 15934 loci. [*************************] 100%
[16:48:45] Learning bias parameters.
> Processed 15934 loci. [*************************] 100%
[16:48:49] Testing for differential expression and regulation in locus.
> Processing Locus chromosome_1:602228-627910 [ ] 0%terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::domain_error> >'
what(): Error in function boost::math::cdf(const normal_distribution<d>&, d): Random variate x is nan, but must be finite!
Aborted

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News