I have a question about cuffdiff using only a reference gtf annotation file. I am seeing that the gene start/stop in the output of cuffdiff does not always match the reference gtf that I provide. Here is an outline of what I am doing:
-map reads using tophat2 to igenome UCSC mm9 fasta sequence with igenome UCSC mm9 gtf files from ftp.illumina.com/Mus_musculus/UCSC/mm9/Mus_musculus_UCSC_mm9.tar.gz
-run cuffdiff 2.0.2 using igenome UCSC mm9 gtf file using a command like
$cuffdir/cuffdiff -p 4 --upper-quartile-norm --multi-read-correct -M $rRNAgtf --frag-bias-correct $genome -o $datadir/cuffdiffRefConVsPrimed3 -L Control,Primed $gtfFile \
$datadir/Control_7/tophat2_out/accepted_hitsSD.bam,$datadir/Control_11/tophat2_out/accepted_hitsSD.bam \
$datadir/Primed_4/tophat2_out/accepted_hitsSD.bam,$datadir/Primed_9/tophat2_out/accepted_hitsSD.bam
An example:
grep 0610005C13Rik $gtfFile
chr7 unknown exon 52823165 52823749 . - . gene_id "0610005C13Rik"; transcript_id "NR_038166"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52823165 52823749 . - . gene_id "0610005C13Rik"; transcript_id "NR_038165"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52826356 52826562 . - . gene_id "0610005C13Rik"; transcript_id "NR_038166"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52829783 52829892 . - . gene_id "0610005C13Rik"; transcript_id "NR_038166"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52829783 52829892 . - . gene_id "0610005C13Rik"; transcript_id "NR_038165"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52829978 52830147 . - . gene_id "0610005C13Rik"; transcript_id "NR_038166"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52829978 52830147 . - . gene_id "0610005C13Rik"; transcript_id "NR_038165"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52830497 52830546 . - . gene_id "0610005C13Rik"; transcript_id "NR_038166"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52830497 52830546 . - . gene_id "0610005C13Rik"; transcript_id "NR_038165"; gene_name "0610005C13Rik"; tss_id "TSS24565";
grep 0610005C13Rik gene_exp.diff
0610005C13Rik 0610005C13Rik 0610005C13Rik chr7:52823164-52845080 Control Primed NOTEST 0 0 0 0 1 1 no
Note that one end 52823164 is correct (52823165 off by one due to end convention?), but the other end is not 52845080 vs. 52830546. Where did the size come from? It seems to be from the overlapping Bcat2 gene
grep Bcat2 ../cuffdiffRefConVsPrimed2/gene_exp.diff
Bcat2 Bcat2 Bcat2 chr7:52823164-52845080 Control Primed OK 42.9202 96.0403 1.16198 -0.909377 0.363151 0.999861 no
Is the data for the two genes somehow separated out, and I don't need to worry about the strange start/end points?
Thanks,
Vince
-map reads using tophat2 to igenome UCSC mm9 fasta sequence with igenome UCSC mm9 gtf files from ftp.illumina.com/Mus_musculus/UCSC/mm9/Mus_musculus_UCSC_mm9.tar.gz
-run cuffdiff 2.0.2 using igenome UCSC mm9 gtf file using a command like
$cuffdir/cuffdiff -p 4 --upper-quartile-norm --multi-read-correct -M $rRNAgtf --frag-bias-correct $genome -o $datadir/cuffdiffRefConVsPrimed3 -L Control,Primed $gtfFile \
$datadir/Control_7/tophat2_out/accepted_hitsSD.bam,$datadir/Control_11/tophat2_out/accepted_hitsSD.bam \
$datadir/Primed_4/tophat2_out/accepted_hitsSD.bam,$datadir/Primed_9/tophat2_out/accepted_hitsSD.bam
An example:
grep 0610005C13Rik $gtfFile
chr7 unknown exon 52823165 52823749 . - . gene_id "0610005C13Rik"; transcript_id "NR_038166"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52823165 52823749 . - . gene_id "0610005C13Rik"; transcript_id "NR_038165"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52826356 52826562 . - . gene_id "0610005C13Rik"; transcript_id "NR_038166"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52829783 52829892 . - . gene_id "0610005C13Rik"; transcript_id "NR_038166"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52829783 52829892 . - . gene_id "0610005C13Rik"; transcript_id "NR_038165"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52829978 52830147 . - . gene_id "0610005C13Rik"; transcript_id "NR_038166"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52829978 52830147 . - . gene_id "0610005C13Rik"; transcript_id "NR_038165"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52830497 52830546 . - . gene_id "0610005C13Rik"; transcript_id "NR_038166"; gene_name "0610005C13Rik"; tss_id "TSS24565";
chr7 unknown exon 52830497 52830546 . - . gene_id "0610005C13Rik"; transcript_id "NR_038165"; gene_name "0610005C13Rik"; tss_id "TSS24565";
grep 0610005C13Rik gene_exp.diff
0610005C13Rik 0610005C13Rik 0610005C13Rik chr7:52823164-52845080 Control Primed NOTEST 0 0 0 0 1 1 no
Note that one end 52823164 is correct (52823165 off by one due to end convention?), but the other end is not 52845080 vs. 52830546. Where did the size come from? It seems to be from the overlapping Bcat2 gene
grep Bcat2 ../cuffdiffRefConVsPrimed2/gene_exp.diff
Bcat2 Bcat2 Bcat2 chr7:52823164-52845080 Control Primed OK 42.9202 96.0403 1.16198 -0.909377 0.363151 0.999861 no
Is the data for the two genes somehow separated out, and I don't need to worry about the strange start/end points?
Thanks,
Vince