Header Leaderboard Ad

Collapse

cufflinks with -G option give 0 FPKM, why??

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cufflinks with -G option give 0 FPKM, why??

    I have the same problem as other guys posted before: Cufflinks works well without a reference GTF file, while with reference, it gives 0 FPKM values. And this is not always for all genes, only some genes are like that. I am really confused.

    Here are my input files:

    htt.sam (output from Tophat, for example purpose, I only contain reads located in HTT genes regions, >3000 reads. See the screenshot of these aligned reads in UCSC)
    htt.gtf (GTF for HTT genes, extracted from ensemble gtf file. I think the format is same as the GTF format Cufflinks website linked)

    Here is my code (my cufflinks version: 0.9.3.Linux_x86_64):
    Code:
    cufflinks -G htt.gtf htt.sam
    You can test it easily. Here is the output from screen:
    Code:
    $ /home/dongx/bin/cufflinks-0.9.3.Linux_x86_64/cufflinks -G htt.gtf htt.sam 
    /home/dongx/bin/cufflinks-0.9.3.Linux_x86_64/cufflinks: /usr/lib64/libz.so.1: no version information available (required by /home/dongx/bin/cufflinks-0.9.3.Linux_x86_64/cufflinks)
    [bam_header_read] EOF marker is absent.
    File htt.sam doesn't appear to be a valid BAM file, trying SAM...
    [17:33:52] Inspecting reads and determining fragment length distribution.
    > Processed 1 loci.                            [*************************] 100%
    > Map Properties:
    >	Total Map Mass: 3542.44
    >	Read Type: 40bp single-end
    >	Fragment Length Distribution: Gaussian (default)
    >	              Estimated Mean: 204.01
    >	           Estimated Std Dev: 74.81
    [17:33:52] Estimating transcript abundances.
    > Processed 1 loci.                            [*************************] 100%
    Here is the content in output genes.expr file:
    Code:
    $ cat genes.expr
    gene_id	bundle_id	chr	left	right	FPKM	FPKM_conf_lo	FPKM_conf_hi	status
    [COLOR="Red"]ENSG00000197386	3	chr4	3076406	3245676	0	0	0	OK[/COLOR]
    Here is the content of output transcripts.expr file:
    Code:
    [[email protected] ~]$ cat transcripts.expr
    trans_id	bundle_id	chr	left	right	FPKM	FMI	frac	FPKM_conf_lo	FPKM_conf_hi	coverage	length	effective_length	status
    ENST00000355072	3	chr4	3076406	3245676	0	0	0	0	0	0	13531	13531	OK
    ENST00000506137	3	chr4	3117054	3123439	0	0	0	0	0	0	784	784	OK
    ENST00000512909	3	chr4	3123125	3125193	0	0	0	0	0	0	613	613	OK
    ENST00000510626	3	chr4	3130064	3245667	0	0	0	0	0	0	14491	14491	OK
    ENST00000509618	3	chr4	3162087	3174954	0	0	0	0	0	0	432	432	OK
    ENST00000513639	3	chr4	3180072	3182400	0	0	0	0	0	0	261	261	OK
    ENST00000513326	3	chr4	3180072	3182514	0	0	0	0	0	0	375	375	OK
    ENST00000509043	3	chr4	3180072	3182521	0	0	0	0	0	0	382	382	OK
    ENST00000502820	3	chr4	3204731	3208306	0	0	0	0	0	0	290	290	OK
    ENST00000509751	3	chr4	3213637	3214782	0	0	0	0	0	0	725	725	OK
    ENST00000512068	3	chr4	3230370	3231649	0	0	0	0	0	0	403	403	OK
    ENST00000513806	3	chr4	3230438	3237123	0	0	0	0	0	0	436	436	OK
    ENST00000508321	3	chr4	3237149	3237906	0	0	0	0	0	0	388	388	OK
    It seems that cufflinks has read the GTF file correctly and the reads are obviously mapped to the gene. But why the FPKM is 0 ??


    Anyone have clue for this? You can test the above code in your machine. Pls let me know if you got different results. THANKS!!


    Originally posted by Pejman View Post
    I have the same problem. I'm using cufflinks-0.9.1.Linux_x86_64 and trying to get it to produce expression levels based on a GTF file. If I run it without a reference GTF file, it works fine, gives some expressions. But with reference:
    Code:
    $cufflinks -G hg18_annotation.gtf MyFile.sam
    it gives:
    Code:
    [20:24:17] Inspecting reads and determining fragment length distribution.
    > Processing Locus chr6:27032750-27099732      [*****************        ]  69%
    Error: this SAM file doesn't appear to be correctly sorted!
            current hit is at chr10:272501, last one was at chr9:139953334
    Cufflinks requires that if your file has SQ records in
    the SAM header that they appear in the same order as the chromosomes names 
    in the alignments.
    If there are no SQ records in the header, or if the header is missing,
    the alignments must be sorted lexicographically by chromsome
    name and by position.

    The GTF file I downloaded from UCSC genomebrowser portal and .sam file is produced by Bowtie, and then sorted by Samtools. Does anybody have any clue?
    Last edited by sterding; 12-03-2010, 03:35 PM.

  • #2
    What even more confused me is, this works file for some other genes (see below):

    DEAF1:
    http://zlab.umassmed.edu/~dongx/deaf1.sam
    http://zlab.umassmed.edu/~dongx/deaf1.gtf
    http://zlab.umassmed.edu/~dongx/toph...nshot.ucsc.png (sam file in UCSC)

    $ cufflinks -G deaf1.gtf deaf1.sam

    This gives correct expression values (at least it's not zero as the HTT gene):
    Code:
    $cat transcripts.expr
    trans_id	bundle_id	chr	left	right	FPKM	FMI	frac	FPKM_conf_lo	FPKM_conf_hi	coverage	length	effective_length	status
    ENST00000359958	3	chr11	644223	695047	423827	1	0.987446	0	430526	27.5172	1994	1955	FAIL
    ENST00000382409	3	chr11	644223	695740	0.00391839	9.24527e-09	1.25567e-08	0	313172	2.54404e-07	2728	2689	FAIL
    ENST00000338675	3	chr11	644223	695740	0.00521126	1.22957e-08	1.66936e-08	0	313289	3.38343e-07	2727	2688	FAIL
    ENST00000388804	3	chr11	651313	695740	4478.96	0.0105679	0.0125543	0	357962	0.290799	2391	2352	FAIL

    Comment


    • #3
      Similar problem with Drosophila Annotation file

      I am having a similar problem with the Drosophila annotation GTF file. However my problem only occurs with certain chromosomes....the Drosophila genome build contains the following chromosomes:
      2L, 2LHet, 2R, 2RHet, 3L, 3LHet, 3R, 3RHet, 4, U, Uextra, X, XHet, YHet, dmel_mitochondrion_genome.
      The annotation GTF file from ensemble contains chromosome ids by the same name.

      Like sterding, running cufflinks with no annotation works well. I can then run cuffcompare with the Annotation file mentioned above to associate transcripts with cufflinks data. However, the cufflinks data doesn't always align well with known transcripts, thus the need for running cufflinks with -G.

      When running cufflinks with the annotation file, chromosomes 2L, 2LHet, 2R, 2RHet, 3L, 3LHet, 3R, 3RHet, and 4 are processed just fine. U matches very little, Uextra doesn't show up, and the remaining chromosomes have 0 matches. This is using the same .sam file as above, so I know the data is there.

      Is it possible cufflinks has a problem with chromosome IDs being letters? cuffdiff also produces the same problem.

      Can anyone help?

      Comment


      • #4
        I'm not using cufflinks anymore. But, I got it working with GTF files. It works nicely with .bam outputs from tophat. I can't remeber exactly where this exact problem was coming from, but I guess it was related to the order of the cromosomes, you can sort the cromosomes alphabetically or numerically, which has different results. ANyway, If you addd a header to your .sam file to explicitly give the chromosome orders, your problem should be solved. If you are doing your alignment with bowtie, you have the header, but it gets lost during the sam to bam conversion(if you do it for sorting.)
        I know this was not the most clear answer but hope it helps

        pej

        Comment


        • #5
          jfofly,

          I think you should contact the Cufflinks authors directly to address this problem.

          Addiotionally, I think sterding's problem was related to an error in the output of an earlier TopHat release and was resolved by mapping with the latest TopHat release. Others might try the same approach and see if this resolves some of the odd cases of 0 FPKM values.

          Comment


          • #6
            I had this same issue and discussed it with Cole Trapnell. This issue arises because Cufflinks requires a tab delimited header in the SAM file you are using. Without the header, the GTF and SAM files are processed in the same order. So, if annotations for chromosome 4 appear in the GTF file before the annotations for chromosome 2, but the chromosome 2 reads appear before the chromosome 4 reads in the SAM file, all of the genes/transcripts for one of those chromosomes will have FPKMs of 0 because by the time Cufflinks starts calculating FPKMs for one of those chromosomes, the corresponding reads have already been passed by in the SAM file - thus FPKM=0. A tab-delimited header file solves this. This also explains why it works fine without a GTF.

            Comment


            • #7
              Solved: It's a bug for OLD version of Tophat! Use version after v.1.1.3

              Several things I think might be useful to share here:

              1. For Illumina strand-unspecific single-end reads, when you used Tophat to map the reads, better NOT use the default option "--library-type fr-unstranded", simply because I found if I used the option (even though it's said to be default), the "XS:A:-" will add to each line in the output SAM, even though there might already have a XS:A:+ (decided by spliced alignment records) on that line, which looks weird, such as :
              HWUSI-EAS623:1:1:2356:14048#0 0 chr4 3106211 255 29M1747N11M * 0 0 CTGTAGTCCTGGCTACTCGGGAGGCTGAGGTGGGAGGATC GGGGGGGGGEAFFFFGDFGEFEGGGECBEEBFEEBDDAAD NM:i:1 XS:A:+ NH:i:1 XS:A:-

              2. My solution to the 0 FPKM problem is simply to use the latest version of Tophat (v. 1.1.4). I was using v. 1.1.2, where there was a serious bug related to the strand assignment issue (see Tophat Release News page).
              Version 1.1.2 suffered from a bug related to the addition of strand specific read processing that could result in some reads being incorrectly assigned to the wrong strand. This bug affected users of unstranded RNA-Seq data as well as users of stranded reads, so 1.1.3 is a recommended update for all users.
              After using the new version of Tophat, the error was gone:
              Code:
              $ cufflinks -G [URL="http://zlab.umassmed.edu/~dongx/htt.gtf"]htt.gtf[/URL] [URL="http://zlab.umassmed.edu/~dongx/htt.new.sam"]htt.new.sam[/URL]
              The output is :
              Code:
              $ cat genes.expr 
              gene_id	bundle_id	chr	left	right	FPKM	FPKM_conf_lo	FPKM_conf_hi	status
              ENSG00000197386	3	chr4	3076406	3245676	234400	227603	241197	OK
              Thanks to Cole/Tom for the help.

              3. Regarding to graveley's problem, I guess it's issue of 'sorting' the sam file. Cufflinks requires sorted SAM as input, as
              The SAM file supplied to Cufflinks must be sorted by reference position. If you aligned your reads with TopHat, your alignments will be properly sorted already. If you used another tool, you may want to make sure they are properly sorted as follows:

              sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted
              The output SAM from Tophat (actually, converted from SAM to SAM by Samtools) is already sorted, which you can see it by :
              Code:
              $cut -f3 tophat.mm2.unique.hg19.accepted_hits.sam | uniq
              chr1
              chr10
              chr11
              chr12
              chr13
              chr14
              chr15
              chr16
              chr17
              ...
              Last edited by sterding; 12-08-2010, 10:40 AM.

              Comment


              • #8
                Thank you all for the advice.

                Pejman, on your suggestion I started playing around a bit. I discovered that while my genome build contains the chromosome "Uextra" my annotation GTF does not.

                I removed "Uextra" from my genome build and things are aligning properly now. The absence of that chromosome from the Annotation must've been somehow causing the error.

                Comment


                • #9
                  Hello All,

                  I have a drosophila GTF file which I created, since I wanted to use the latest release from flyable and the UCSC annotated GTF is from 2006. Anyway, here is what it looks like…

                  3R FlyBase exon 24574105 24575330 . - . gene_id "FBgn0039596"; gene_name "CG10000"; exon_number "7"; transcript_id "FBtr0085315"; parent_type=mRNA;
                  3R FlyBase exon 24575574 24575753 . - . gene_id "FBgn0039596"; gene_name "CG10000"; exon_number "6"; transcript_id "FBtr0085315"; parent_type=mRNA;
                  3R FlyBase exon 24575893 24576062 . - . gene_id "FBgn0039596"; gene_name "CG10000"; exon_number "5"; transcript_id "FBtr0085315"; parent_type=mRNA;
                  3R FlyBase exon 24576188 24576651 . - . gene_id "FBgn0039596"; gene_name "CG10000"; exon_number "4"; transcript_id "FBtr0085315"; parent_type=mRNA;
                  3R FlyBase exon 24576707 24576885 . - . gene_id "FBgn0039596"; gene_name "CG10000"; exon_number "3"; transcript_id "FBtr0085315"; parent_type=mRNA;
                  3R FlyBase exon 24576947 24577107 . - . gene_id "FBgn0039596"; gene_name "CG10000"; exon_number "2"; transcript_id "FBtr0085315"; parent_type=mRNA;
                  3R FlyBase exon 24577166 24577313 . - . gene_id "FBgn0039596"; gene_name "CG10000"; exon_number "1"; transcript_id "FBtr0085315"; parent_type=mRNA;
                  3R FlyBase exon 24562831 24563368 . - . gene_id "FBgn0039595"; gene_name "AR-2"; exon_number "4"; transcript_id "FBtr0085316"; parent_type=mRNA;
                  3R FlyBase exon 24566194 24566352 . - . gene_id "FBgn0039595"; gene_name "AR-2"; exon_number "3"; transcript_id "FBtr0085316"; parent_type=mRNA;
                  3R FlyBase exon 24566428 24566706 . - . gene_id "FBgn0039595"; gene_name "AR-2"; exon_number "2"; transcript_id "FBtr0085316"; parent_type=mRNA;

                  After running the latest version of Tophat with the one-directional RNA-Seq reads, I get a .bam file. Using this my input to cufflinks was…

                  cufflinks -p 4 -G ALLEXONS.gtf -o ./outputdir accepted_hits.bam

                  After running this using LSF command 'bsub', I get an error output file, the start of which looks like this

                  GFF warning: merging adjacent/overlapping segments of FBtr0084817 on 3R (21094383-21094697, 21094700-21095435)
                  [20:04:14] Inspecting reads and determining fragment length distribution.
                  > Processing Locus 2L:21918-25151 [ ] 0%
                  > Processing Locus 2L:76445-77211 [ ] 0%
                  > Processing Locus 2L:102381-106718 [ ] 0%

                  I have three output files, the smallest of which is the genes.expr file and the largest is the transcripts.gtf file.
                  Questions:

                  1) How will I know if cufflinks has accepted the GTF file that I created correctly or not? I used a perl script from the internet to check the validity of my GTF file. Its output says that barring the non-availability of CDS coordinates the GTF file is ok. Do I need to change the positions of some fields? or add CDS coordinates to the file.

                  2) Why am I getting such a huge error file? I see that in the UCSD GTF file the chromosome names begin with chr. Do I need to change the annotation to chr3L, chr3R, chr2L, chr2R and so on?

                  3) When I supply the reference genome, to bowtie-build, I send a single fasta file containing the chromosome names and their sequences. It looks like this.

                  >Y
                  AGCTAGCT
                  >2L
                  GCTGCTGCAGTC
                  >2R
                  CGATGATGA

                  Do I need to break this up into separate chromosomes/files and build separate indexes? Sounds a bit illogical but I thought I'd ask.

                  4) I get FPKM values of zero for some genes in the genes.expr file. Do I interpret that as too low expression to call? or is the program not running proper.

                  Thank you for taking the time to read such a long post. I am relatively new to the cufflinks program, and GTF so any inputs and suggestions would be much appreciated.

                  Regards
                  Abhijit

                  Comment


                  • #10
                    Using Cufflinks with -G option

                    Dear gen2prot (Abhijit),

                    I encountered the same problems as you have (assuming that you have not already found a solution) and did several things that solved the problem:

                    Q2) "Why am I getting such a huge error file? I see that in the UCSD GTF file the chromosome names begin with chr. Do I need to change the annotation to chr3L, chr3R, chr2L, chr2R and so on?"

                    I found that prepending 'chr' on the names of the chromosomes in the GTF file and excluding regions of note that overlap with other annotations e.g. HSCHR6_MHC_SSTO (MHC region on human chromosome 6) allowed FPKM values to be estimated. This will need a short script.

                    Q3) "When I supply the reference genome, to bowtie-build, I send a single fasta file containing the chromosome names and their sequences."

                    You might need to prepend the chromosome names in the FASTA file with 'chr' like so:
                    [email protected]_node ~$ for i in Y 2L 2R; do sed -i 's/>'$i'/>chr'$i'/g' <fasta_file>; done

                    To reverse this you can run it in reverse:
                    [email protected]_node ~$ for i in Y 2L 2R; do sed -i 's/>chr'$i'/>'$i'/g' <fasta_file>; done

                    I don't think it is necessary to break the file into the respective chromosomes since Cufflinks takes one reference FASTA file.

                    Q4) "I get FPKM values of zero for some genes in the genes.expr file. Do I interpret that as too low expression to call? or is the program not running proper."

                    The author/maintainer of Cufflinks (C. Trapnell) has given a good pipeline of running Cufflinks here. In summary, (i) run cufflinks for each sample without annotation, (ii) run cuffcompare across all samples with annotation, (iii) run cuffdiff on cuffcompare GTF output together with BAM files for each sample (in the same order they were supplied to cuffcompare).

                    Paul Korir

                    Comment


                    • #11
                      hi all,
                      I have encountered same problem using cufflinks-2.2.0.Linux_x86_64,, I am getting 0 FPKM values when i m using -G option.

                      I have used .bed files which were directly obtained from the epigenome data and converted the bed file to bam file using bedtools.i am using Homo_sapiens.GRCh37.75.gtf
                      I have sorted the .bam file and still its giving the same reult. ????

                      can someone help me out.???

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                        by seqadmin


                        ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                        01-24-2023, 01:19 PM
                      • seqadmin
                        Introduction to Single-Cell Sequencing
                        by seqadmin
                        Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                        The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                        ...
                        01-09-2023, 03:10 PM

                      ad_right_rmr

                      Collapse
                      Working...
                      X