Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

cufflinks - having difficulty reading reference GTF

  • Filter
  • Time
  • Show
Clear All
new posts

  • cufflinks - having difficulty reading reference GTF

    Hi all,

    I'm using cufflinks 0.9.0 beta. But the problem persists when I use v0.8.4.

    Everything seems to work fine if I don't use a reference GTF.
    But when I do use a reference (downloaded from the Ensembl FTP "current GTF" folder, Mouse release 59), cuffcompare doesn't seem to read from it.
    (I did manually delete the giant transcript ENSMUST00000127664 from the GTF file to preclude the error message.)

    When I run:
    cuffcompare -r ../Mus_musculus.NCBIM37.56.gtf A.gtf B.gtf
    where A.gtf and B.gtf are the two "transcripts.gtf" generated by cufflinks

    the resulting "combined.gtf" file from cuffcompare does not contain any gene ID or gene name from the mouse reference GTF, and the refmap file is completely empty, too, besides the header row.

    Does anyone have any idea what is going on?

    Another weird thing is, if I run cufflinks with
    "cufflinks -G ../Mus_musculus.NCBIM37.56.gtf A.sam",
    then cufflinks does read the reference GTF, but in the output "transcripts.gtf", the estimated transcript levels are 0 for all transcripts...
    However, if I run cufflinks without "-G", then the level estimation is correct.

    Any ideas or help are highly appreciated!

    Last edited by tguo; 09-30-2010, 10:03 PM.

  • #2
    Hi, tguo,
    It looks like you don't have enough reads as input. Please check how many reads are in your fastq file? To test cufflinks, I first used 1000 reads and got the same problem as yours. Later on, I used 10000 reads, then cuffcompare produces transcript expression level. But the problem is that I don't know how to get the gene expression level. Do you have any idea on this?


    • #3
      Hi ldong,

      I guess gene expression levels are in the "genes.expr" output file.

      I didn't think too few reads were the reason causing the problem. I was using a 10M read RNA-seq result as an input.. the SAM input into cufflinks were several gigabytes..




      • #4
        New version of tophat works!
        Last edited by ldong; 10-08-2010, 07:49 AM.


        • #5
          Originally posted by tguo View Post
          Hi all,

          Everything seems to work fine if I don't use a reference GTF.
          I have the same problem. I'm using cufflinks-0.9.1.Linux_x86_64 and trying to get it to produce expression levels based on a GTF file. If I run it without a reference GTF file, it works fine, gives some expressions. But with reference:
          $cufflinks -G hg18_annotation.gtf MyFile.sam
          it gives:
          [20:24:17] Inspecting reads and determining fragment length distribution.
          > Processing Locus chr6:27032750-27099732      [*****************        ]  69%
          Error: this SAM file doesn't appear to be correctly sorted!
                  current hit is at chr10:272501, last one was at chr9:139953334
          Cufflinks requires that if your file has SQ records in
          the SAM header that they appear in the same order as the chromosomes names 
          in the alignments.
          If there are no SQ records in the header, or if the header is missing,
          the alignments must be sorted lexicographically by chromsome
          name and by position.

          The GTF file I downloaded from UCSC genomebrowser portal and .sam file is produced by Bowtie, and then sorted by Samtools. Does anybody have any clue?


          • #6
            Hi, Pejman,
            I got the same error. To walk around it I use the gtf file when I call cuffcompare. See our wiki site for detail:


            • #7

              Try remapping with the newest version of TopHat and your problem should disappear. Alternatively, you can add a proper header to your SAM file.



              • #8
                @ ldong Thanks for the nice link!

                I managed to get around the problem, by manually adding the SQ headers to the sorted same file! They were dumped out during the SAM->BAM->sort->SAM process

                But about TopHAt, I have tried to use it, so far have not been so successful, it gives errors. I'm using SOLiD data.


                • #9
                  I'm not interested in new splice junctions, just looking for differential expression of the known transcripts using Cufflinks. However, I kinda think that it would be still better to use tophat (without new splice junction prediction) for the alignment part than bowtie, in order to get more accurate results on the genes that have isoforms. What do you think as a Cufflinker?


                  • #10
                    Hi adarob,
                    Thank you for pointing to the right direction. I downloaded new version tophat binary. Now it can read gtf file. I assume in this way, tophat will be more accurate. But when I compare the output file accepted_hit.sam between using and not using gtf on a small data set, there is no difference. By the way, the new tophat does not output sam, I have to use samtools to convert the file:
                    samtools view -h -o accepted_hits.sam accepted_hits.bam
                    Any commnents on this? Best, ldong


                    • #11

                      I don't work on TopHat so I can't answer your question about the GTF. You can contact Daehwan from the TopHat page. As for the bam output, conversion is not necessary since Cufflinks 0.9.x takes bam as input.



                      • #12
                        Hi, Adam,
                        Thanks. I didn't realize new cufflnks can read bam. Now I am trying to see why UCSC genome browser can not read the bam file. Best, ldong


                        • #13
                          I think I've figured out what the problem was...

                          I have used a UCSC mm9 reference assembly against an EMSEMBL GTF, and that apparently was the problem. Some attributes of the two apparently doesn't match (didn't look in to see exactly why).

                          If instead, I use a UCSC GTF for mm9 reference, or conversely, use the EMSEMBL m_musculus_ncbi37 reference for EMSEMBL GTF, it seems to work.


                          • #14
                            Hi, tguo,
                            I think the reason is that reference sequence ID in mm9 does not match the chromosome Id in the GTF file.
                            By the way, where did you download the UCSC GTF file? Could you remind me? I could not find it anywhere. I use a GFF file from AceView.


                            • #15
                              Sure, from the UCSC table browser: