Hi,
Edit : I felt the question was a little too long-winded. Just to be clear :
a) Even though paired reads fall within an exon, cufflinks seems to be reporting zero coverage for that exon.
b) cufflinks does not report the transcript length for any of the genes in my data set.
I have been running a tophat (2.0.11) -> cufflinks (2.2.1) -> cuffnorm/cuffdiff pipeline for looking at some mouse RNA-seq data.
When looking at the output of cufflinks, I found some genes where the expression level of a gene was zero, but there were many reads that mapped to the locus. This was a bug that was supposed to have been fixed in v1.2.0. This issue has been brought up in earlier threads, but not resolved clearly :
I first ran the tophat->cufflinks pipeline on my data. I then wrote a shell script to look through the transcripts.gtf file produced by cufflinks for transcripts whose FPKM readout was 0. This gave me a fairly large number of gene_id's, of the order of a few thousand.
I then extracted the exon definitions for each of these gene_ids from the merged transcriptome. I looked for reads falling into these exon loci, and used samtools to pull out the information on all reads that fell into these loci.
Roughly, I found two kinds of instances in which cufflinks reports zero fpkm for a locus when reads map to that locus.
i) Read alignments imply a fragment length much much larger than the exon or transcript definition implies. For example, I found that some tophat alignment reports a template length of 10^6 nt even though the transcript length is about 10^2 nt in length in the supplied gtf file. A lot of microRNAs and lincRNAs fell into this category. When I filtered out all these reads, I found zero reads mapping to these loci. The cufflinks output seemed reasonable here.
ii) This was the category that confused me. For one of the genes whose expression was zero fpkm, I found that 184 reads aligned to one of the exons. This number dropped to 16 when I counted only paired alignments and ignoring supplementary alignments.
Of these 16 alignments, I found that the fragment lengths implied by the alignments were are "reasonable" - 118 nt, 130 nt, etc. The cigar strings were all 101M, and these were the highest scoring alignments for these reads. An example read :
HWI-1KL120:1171C63ACXX:3:1108:3575:87870 99 chr13 21924938 3 101M = 21925038 201 GTGAACGGCTGGTCTTATTTTCCCTTGGCCTTGTGGTGGCTCTCGGTCTTCTTGGGCAGCAGCACGGCCTGGATGTTGGGCAGGACGCCGCCCTGCTCGAT @@@BDEFFHHGH+AABFGGIIJCHBHEGGGGGIGGGFCHICBBD@B;AGHIIIIIIBHHEHHDBBDCDB?@AACAACC?DACDDBBD>@B@>BBABCC?@> AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:96G4 YT:Z:UU XS:A:- NH:i:2 HI:i:1
I understand that out of millions of reads, 16 reads falling on a gene is more likely to be explained by sequencing and alignment errors than true gene expression. But, the transcripts.gtf file still reported zero coverage across any exon. For example the read above maps to chr13:21924938. But, the transcripts.gtf file states that the coverage of the exon containing this coordinate is zero :
chr13 Cufflinks exon 21924893 21925444 1 - . gene_id "Hist1h2ao"; transcript_id "Hist1h2ao_dup1"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
What's weirder is that the read shown above has two possible alignments. One is to chr13:21924938 (alignment score AS:i:-5), and the other is to chr13:21902542 (As:i:0). This second alignment falls into this exon :
chr13 Cufflinks exon 21902236 21902787 1 + . gene_id "Hist1h2ao"; transcript_id "Hist1h2ao"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
Given that the tophat alignment states that the read came from either one of these exons, why is the coverage of *both* exons zero? I'm guessing the multi-read correct option had something to do with it?
Also, oddly, cufflinks does not report the transcript length for any of the transcripts in the genes.fpkm_tracking file. All transcript length entries are a "-". For example :
tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
Rp1 - - Rp1 - - chr1:4280926-4399322 - - 0 0 0 OK
Is there something that I am missing on how cufflinks calculates transcript lengths, or am I using the wrong inputs in some way?
My cufflinks command :
cufflinks --no-update-check -p $NCORES -o $OUTPUTDIR $INPUTFILE -G ${GTF[$INDEX]} --multi-read-correct --frag-bias-correct ${GENOME_FA[$INDEX]}
The GTF file I used is from the UCSC mm9 refFlat tables.
Edit : I felt the question was a little too long-winded. Just to be clear :
a) Even though paired reads fall within an exon, cufflinks seems to be reporting zero coverage for that exon.
b) cufflinks does not report the transcript length for any of the genes in my data set.
I have been running a tophat (2.0.11) -> cufflinks (2.2.1) -> cuffnorm/cuffdiff pipeline for looking at some mouse RNA-seq data.
When looking at the output of cufflinks, I found some genes where the expression level of a gene was zero, but there were many reads that mapped to the locus. This was a bug that was supposed to have been fixed in v1.2.0. This issue has been brought up in earlier threads, but not resolved clearly :
I first ran the tophat->cufflinks pipeline on my data. I then wrote a shell script to look through the transcripts.gtf file produced by cufflinks for transcripts whose FPKM readout was 0. This gave me a fairly large number of gene_id's, of the order of a few thousand.
I then extracted the exon definitions for each of these gene_ids from the merged transcriptome. I looked for reads falling into these exon loci, and used samtools to pull out the information on all reads that fell into these loci.
Roughly, I found two kinds of instances in which cufflinks reports zero fpkm for a locus when reads map to that locus.
i) Read alignments imply a fragment length much much larger than the exon or transcript definition implies. For example, I found that some tophat alignment reports a template length of 10^6 nt even though the transcript length is about 10^2 nt in length in the supplied gtf file. A lot of microRNAs and lincRNAs fell into this category. When I filtered out all these reads, I found zero reads mapping to these loci. The cufflinks output seemed reasonable here.
ii) This was the category that confused me. For one of the genes whose expression was zero fpkm, I found that 184 reads aligned to one of the exons. This number dropped to 16 when I counted only paired alignments and ignoring supplementary alignments.
Of these 16 alignments, I found that the fragment lengths implied by the alignments were are "reasonable" - 118 nt, 130 nt, etc. The cigar strings were all 101M, and these were the highest scoring alignments for these reads. An example read :
HWI-1KL120:1171C63ACXX:3:1108:3575:87870 99 chr13 21924938 3 101M = 21925038 201 GTGAACGGCTGGTCTTATTTTCCCTTGGCCTTGTGGTGGCTCTCGGTCTTCTTGGGCAGCAGCACGGCCTGGATGTTGGGCAGGACGCCGCCCTGCTCGAT @@@BDEFFHHGH+AABFGGIIJCHBHEGGGGGIGGGFCHICBBD@B;AGHIIIIIIBHHEHHDBBDCDB?@AACAACC?DACDDBBD>@B@>BBABCC?@> AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:96G4 YT:Z:UU XS:A:- NH:i:2 HI:i:1
I understand that out of millions of reads, 16 reads falling on a gene is more likely to be explained by sequencing and alignment errors than true gene expression. But, the transcripts.gtf file still reported zero coverage across any exon. For example the read above maps to chr13:21924938. But, the transcripts.gtf file states that the coverage of the exon containing this coordinate is zero :
chr13 Cufflinks exon 21924893 21925444 1 - . gene_id "Hist1h2ao"; transcript_id "Hist1h2ao_dup1"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
What's weirder is that the read shown above has two possible alignments. One is to chr13:21924938 (alignment score AS:i:-5), and the other is to chr13:21902542 (As:i:0). This second alignment falls into this exon :
chr13 Cufflinks exon 21902236 21902787 1 + . gene_id "Hist1h2ao"; transcript_id "Hist1h2ao"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
Given that the tophat alignment states that the read came from either one of these exons, why is the coverage of *both* exons zero? I'm guessing the multi-read correct option had something to do with it?
Also, oddly, cufflinks does not report the transcript length for any of the transcripts in the genes.fpkm_tracking file. All transcript length entries are a "-". For example :
tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
Rp1 - - Rp1 - - chr1:4280926-4399322 - - 0 0 0 OK
Is there something that I am missing on how cufflinks calculates transcript lengths, or am I using the wrong inputs in some way?
My cufflinks command :
cufflinks --no-update-check -p $NCORES -o $OUTPUTDIR $INPUTFILE -G ${GTF[$INDEX]} --multi-read-correct --frag-bias-correct ${GENOME_FA[$INDEX]}
The GTF file I used is from the UCSC mm9 refFlat tables.
Comment