Seqanswers Leaderboard Ad

**sterding** · 12-03-2010, 03:42 PM

What even more confused me is, this works file for some other genes (see below):

DEAF1:

http://zlab.umassmed.edu/~dongx/deaf1.sam

http://zlab.umassmed.edu/~dongx/deaf1.gtf

http://zlab.umassmed.edu/~dongx/toph...nshot.ucsc.png (sam file in UCSC)

$ cufflinks -G deaf1.gtf deaf1.sam

This gives correct expression values (at least it's not zero as the HTT gene):

Code:

$cat transcripts.expr
trans_id	bundle_id	chr	left	right	FPKM	FMI	frac	FPKM_conf_lo	FPKM_conf_hi	coverage	length	effective_length	status
ENST00000359958	3	chr11	644223	695047	423827	1	0.987446	0	430526	27.5172	1994	1955	FAIL
ENST00000382409	3	chr11	644223	695740	0.00391839	9.24527e-09	1.25567e-08	0	313172	2.54404e-07	2728	2689	FAIL
ENST00000338675	3	chr11	644223	695740	0.00521126	1.22957e-08	1.66936e-08	0	313289	3.38343e-07	2727	2688	FAIL
ENST00000388804	3	chr11	651313	695740	4478.96	0.0105679	0.0125543	0	357962	0.290799	2391	2352	FAIL

**jfofly** · 12-07-2010, 12:58 PM

Similar problem with Drosophila Annotation file

I am having a similar problem with the Drosophila annotation GTF file. However my problem only occurs with certain chromosomes....the Drosophila genome build contains the following chromosomes:
2L, 2LHet, 2R, 2RHet, 3L, 3LHet, 3R, 3RHet, 4, U, Uextra, X, XHet, YHet, dmel_mitochondrion_genome.
The annotation GTF file from ensemble contains chromosome ids by the same name.

Like sterding, running cufflinks with no annotation works well. I can then run cuffcompare with the Annotation file mentioned above to associate transcripts with cufflinks data. However, the cufflinks data doesn't always align well with known transcripts, thus the need for running cufflinks with -G.

When running cufflinks with the annotation file, chromosomes 2L, 2LHet, 2R, 2RHet, 3L, 3LHet, 3R, 3RHet, and 4 are processed just fine. U matches very little, Uextra doesn't show up, and the remaining chromosomes have 0 matches. This is using the same .sam file as above, so I know the data is there.

Is it possible cufflinks has a problem with chromosome IDs being letters? cuffdiff also produces the same problem.

Can anyone help?

**Pejman** · 12-08-2010, 02:40 AM

I'm not using cufflinks anymore. But, I got it working with GTF files. It works nicely with .bam outputs from tophat. I can't remeber exactly where this exact problem was coming from, but I guess it was related to the order of the cromosomes, you can sort the cromosomes alphabetically or numerically, which has different results. ANyway, If you addd a header to your .sam file to explicitly give the chromosome orders, your problem should be solved. If you are doing your alignment with bowtie, you have the header, but it gets lost during the sam to bam conversion(if you do it for sorting.)
I know this was not the most clear answer but hope it helps

pej

**Thomas Doktor** · 12-08-2010, 04:04 AM

jfofly,

I think you should contact the Cufflinks authors directly to address this problem.

Addiotionally, I think sterding's problem was related to an error in the output of an earlier TopHat release and was resolved by mapping with the latest TopHat release. Others might try the same approach and see if this resolves some of the odd cases of 0 FPKM values.

**graveley** · 12-08-2010, 04:57 AM

I had this same issue and discussed it with Cole Trapnell. This issue arises because Cufflinks requires a tab delimited header in the SAM file you are using. Without the header, the GTF and SAM files are processed in the same order. So, if annotations for chromosome 4 appear in the GTF file before the annotations for chromosome 2, but the chromosome 2 reads appear before the chromosome 4 reads in the SAM file, all of the genes/transcripts for one of those chromosomes will have FPKMs of 0 because by the time Cufflinks starts calculating FPKMs for one of those chromosomes, the corresponding reads have already been passed by in the SAM file - thus FPKM=0. A tab-delimited header file solves this. This also explains why it works fine without a GTF.

**sterding** · 12-08-2010, 10:36 AM

Solved: It's a bug for OLD version of Tophat! Use version after v.1.1.3

Several things I think might be useful to share here:

1. For Illumina strand-unspecific single-end reads, when you used Tophat to map the reads, better NOT use the default option "--library-type fr-unstranded", simply because I found if I used the option (even though it's said to be default), the "XS:A:-" will add to each line in the output SAM, even though there might already have a XS:A:+ (decided by spliced alignment records) on that line, which looks weird, such as :
HWUSI-EAS623:1:1:2356:14048#0 0 chr4 3106211 255 29M1747N11M * 0 0 CTGTAGTCCTGGCTACTCGGGAGGCTGAGGTGGGAGGATC GGGGGGGGGEAFFFFGDFGEFEGGGECBEEBFEEBDDAAD NM:i:1 XS:A:+ NH:i:1 XS:A:-

2. My solution to the 0 FPKM problem is simply to use the latest version of Tophat (v. 1.1.4). I was using v. 1.1.2, where there was a serious bug related to the strand assignment issue (see Tophat Release News page).

Version 1.1.2 suffered from a bug related to the addition of strand specific read processing that could result in some reads being incorrectly assigned to the wrong strand. This bug affected users of unstranded RNA-Seq data as well as users of stranded reads, so 1.1.3 is a recommended update for all users.

After using the new version of Tophat, the error was gone:

Code:

$ cufflinks -G [URL="http://zlab.umassmed.edu/~dongx/htt.gtf"]htt.gtf[/URL] [URL="http://zlab.umassmed.edu/~dongx/htt.new.sam"]htt.new.sam[/URL]

The output is :

Code:

$ cat genes.expr 
gene_id	bundle_id	chr	left	right	FPKM	FPKM_conf_lo	FPKM_conf_hi	status
ENSG00000197386	3	chr4	3076406	3245676	234400	227603	241197	OK

Thanks to Cole/Tom for the help.

3. Regarding to graveley's problem, I guess it's issue of 'sorting' the sam file. Cufflinks requires sorted SAM as input, as

The SAM file supplied to Cufflinks must be sorted by reference position. If you aligned your reads with TopHat, your alignments will be properly sorted already. If you used another tool, you may want to make sure they are properly sorted as follows:

sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted

The output SAM from Tophat (actually, converted from SAM to SAM by Samtools) is already sorted, which you can see it by :

Code:

$cut -f3 tophat.mm2.unique.hg19.accepted_hits.sam | uniq
chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
...

**jfofly** · 12-08-2010, 02:06 PM

Thank you all for the advice.

Pejman, on your suggestion I started playing around a bit. I discovered that while my genome build contains the chromosome "Uextra" my annotation GTF does not.

I removed "Uextra" from my genome build and things are aligning properly now. The absence of that chromosome from the Annotation must've been somehow causing the error.

**gen2prot** · 12-22-2010, 08:50 AM

Hello All,

I have a drosophila GTF file which I created, since I wanted to use the latest release from flyable and the UCSC annotated GTF is from 2006. Anyway, here is what it looks like…

3R FlyBase exon 24574105 24575330 . - . gene_id "FBgn0039596"; gene_name "CG10000"; exon_number "7"; transcript_id "FBtr0085315"; parent_type=mRNA;
3R FlyBase exon 24575574 24575753 . - . gene_id "FBgn0039596"; gene_name "CG10000"; exon_number "6"; transcript_id "FBtr0085315"; parent_type=mRNA;
3R FlyBase exon 24575893 24576062 . - . gene_id "FBgn0039596"; gene_name "CG10000"; exon_number "5"; transcript_id "FBtr0085315"; parent_type=mRNA;
3R FlyBase exon 24576188 24576651 . - . gene_id "FBgn0039596"; gene_name "CG10000"; exon_number "4"; transcript_id "FBtr0085315"; parent_type=mRNA;
3R FlyBase exon 24576707 24576885 . - . gene_id "FBgn0039596"; gene_name "CG10000"; exon_number "3"; transcript_id "FBtr0085315"; parent_type=mRNA;
3R FlyBase exon 24576947 24577107 . - . gene_id "FBgn0039596"; gene_name "CG10000"; exon_number "2"; transcript_id "FBtr0085315"; parent_type=mRNA;
3R FlyBase exon 24577166 24577313 . - . gene_id "FBgn0039596"; gene_name "CG10000"; exon_number "1"; transcript_id "FBtr0085315"; parent_type=mRNA;
3R FlyBase exon 24562831 24563368 . - . gene_id "FBgn0039595"; gene_name "AR-2"; exon_number "4"; transcript_id "FBtr0085316"; parent_type=mRNA;
3R FlyBase exon 24566194 24566352 . - . gene_id "FBgn0039595"; gene_name "AR-2"; exon_number "3"; transcript_id "FBtr0085316"; parent_type=mRNA;
3R FlyBase exon 24566428 24566706 . - . gene_id "FBgn0039595"; gene_name "AR-2"; exon_number "2"; transcript_id "FBtr0085316"; parent_type=mRNA;

After running the latest version of Tophat with the one-directional RNA-Seq reads, I get a .bam file. Using this my input to cufflinks was…

cufflinks -p 4 -G ALLEXONS.gtf -o ./outputdir accepted_hits.bam

After running this using LSF command 'bsub', I get an error output file, the start of which looks like this

GFF warning: merging adjacent/overlapping segments of FBtr0084817 on 3R (21094383-21094697, 21094700-21095435)
[20:04:14] Inspecting reads and determining fragment length distribution.
> Processing Locus 2L:21918-25151 [ ] 0%
> Processing Locus 2L:76445-77211 [ ] 0%
> Processing Locus 2L:102381-106718 [ ] 0%

I have three output files, the smallest of which is the genes.expr file and the largest is the transcripts.gtf file.
Questions:

1) How will I know if cufflinks has accepted the GTF file that I created correctly or not? I used a perl script from the internet to check the validity of my GTF file. Its output says that barring the non-availability of CDS coordinates the GTF file is ok. Do I need to change the positions of some fields? or add CDS coordinates to the file.

2) Why am I getting such a huge error file? I see that in the UCSD GTF file the chromosome names begin with chr. Do I need to change the annotation to chr3L, chr3R, chr2L, chr2R and so on?

3) When I supply the reference genome, to bowtie-build, I send a single fasta file containing the chromosome names and their sequences. It looks like this.

>Y
AGCTAGCT
>2L
GCTGCTGCAGTC
>2R
CGATGATGA

Do I need to break this up into separate chromosomes/files and build separate indexes? Sounds a bit illogical but I thought I'd ask.

4) I get FPKM values of zero for some genes in the genes.expr file. Do I interpret that as too low expression to call? or is the program not running proper.

Thank you for taking the time to read such a long post. I am relatively new to the cufflinks program, and GTF so any inputs and suggestions would be much appreciated.

Regards
Abhijit

**polarise** · 03-01-2011, 06:02 AM

Using Cufflinks with -G option

Dear gen2prot (Abhijit),

I encountered the same problems as you have (assuming that you have not already found a solution) and did several things that solved the problem:

Q2) "Why am I getting such a huge error file? I see that in the UCSD GTF file the chromosome names begin with chr. Do I need to change the annotation to chr3L, chr3R, chr2L, chr2R and so on?"

I found that prepending 'chr' on the names of the chromosomes in the GTF file and excluding regions of note that overlap with other annotations e.g. HSCHR6_MHC_SSTO (MHC region on human chromosome 6) allowed FPKM values to be estimated. This will need a short script.

Q3) "When I supply the reference genome, to bowtie-build, I send a single fasta file containing the chromosome names and their sequences."

You might need to prepend the chromosome names in the FASTA file with 'chr' like so:
you@your_node ~$ for i in Y 2L 2R; do sed -i 's/>'$i'/>chr'$i'/g' <fasta_file>; done

To reverse this you can run it in reverse:
you@your_node ~$ for i in Y 2L 2R; do sed -i 's/>chr'$i'/>'$i'/g' <fasta_file>; done

I don't think it is necessary to break the file into the respective chromosomes since Cufflinks takes one reference FASTA file.

Q4) "I get FPKM values of zero for some genes in the genes.expr file. Do I interpret that as too low expression to call? or is the program not running proper."

The author/maintainer of Cufflinks (C. Trapnell) has given a good pipeline of running Cufflinks here. In summary, (i) run cufflinks for each sample without annotation, (ii) run cuffcompare across all samples with annotation, (iii) run cuffdiff on cuffcompare GTF output together with BAM files for each sample (in the same order they were supplied to cuffcompare).

Paul Korir

**krish91** · 04-14-2014, 10:49 PM

hi all,
I have encountered same problem using cufflinks-2.2.0.Linux_x86_64,, I am getting 0 FPKM values when i m using -G option.

I have used .bed files which were directly obtained from the epigenome data and converted the bed file to bam file using bedtools.i am using Homo_sapiens.GRCh37.75.gtf
I have sorted the .bam file and still its giving the same reult. ????

can someone help me out.???

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 14 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

cufflinks with -G option give 0 FPKM, why??

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News