Hi there,
I was wondering what levels of Sp/Sn people are seeing from Cufflinks output, and what you would consider as acceptable.
I am working with 36bp Illumina data and processing using the TopHat/Cufflinks pipeline.
The latest dataset I have run through Cufflinks (v0.9.1) gives the following result in the stats file:
#= Summary for dataset:
# Query mRNAs : 67342 in 66642 loci (21607 multi-exon transcripts)
# (635 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs : 52287 in 14604 loci (50514 multi-exon)
# Corresponding super-loci: 13723
#--------------------| Sn | Sp | fSn | fSp
Base level: 53.5 59.2 - -
Exon level: 7.7 16.1 13.4 27.8
Intron level: 18.6 87.1 18.9 88.3
Intron chain level: 0.7 1.5 1.3 2.9
Transcript level: 0.0 0.0 0.0 0.0
Locus level: 2.2 0.5 3.6 0.8
Missed exons: 110793/219160 ( 50.6%)
Wrong exons: 24548/105391 ( 23.3%)
Missed introns: 133243/179883 ( 74.1%)
Wrong introns: 2273/38400 ( 5.9%)
Missed loci: 0/14604 ( 0.0%)
Wrong loci: 18822/66642 ( 28.2%)
I have seen worse results on previous reslts, and always see 0 for the Transcript level - is this something I should focus on? Or rather just the % values for Missed and wrong exons? (I have seen other posters focus on this)
Another point is - surely I should see total number of reference loci around ~23,000 ?? (number of human protein coding genes, I have seen a greater number of loci with previous versions of Cufflinks and the exact same Ensembl reference .gtf file)
What are other people seeing for Ensembl Human reference.
Another problem that I am having, Cufflinks is producing 'u' matches (and no error messages) for every transcripts when I use the Homo_sapiens.GRCh37.60.gtf reference file, but gave good results ( a range of match types) when I used Homo_sapiens.GRCh37.55.gtf (both files have been formatted so that 'chr' is in front of the chromosome number). I can't see any obvious formatting difference between the two files, so I am a bit stumped.
Cheers
I was wondering what levels of Sp/Sn people are seeing from Cufflinks output, and what you would consider as acceptable.
I am working with 36bp Illumina data and processing using the TopHat/Cufflinks pipeline.
The latest dataset I have run through Cufflinks (v0.9.1) gives the following result in the stats file:
#= Summary for dataset:
# Query mRNAs : 67342 in 66642 loci (21607 multi-exon transcripts)
# (635 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs : 52287 in 14604 loci (50514 multi-exon)
# Corresponding super-loci: 13723
#--------------------| Sn | Sp | fSn | fSp
Base level: 53.5 59.2 - -
Exon level: 7.7 16.1 13.4 27.8
Intron level: 18.6 87.1 18.9 88.3
Intron chain level: 0.7 1.5 1.3 2.9
Transcript level: 0.0 0.0 0.0 0.0
Locus level: 2.2 0.5 3.6 0.8
Missed exons: 110793/219160 ( 50.6%)
Wrong exons: 24548/105391 ( 23.3%)
Missed introns: 133243/179883 ( 74.1%)
Wrong introns: 2273/38400 ( 5.9%)
Missed loci: 0/14604 ( 0.0%)
Wrong loci: 18822/66642 ( 28.2%)
I have seen worse results on previous reslts, and always see 0 for the Transcript level - is this something I should focus on? Or rather just the % values for Missed and wrong exons? (I have seen other posters focus on this)
Another point is - surely I should see total number of reference loci around ~23,000 ?? (number of human protein coding genes, I have seen a greater number of loci with previous versions of Cufflinks and the exact same Ensembl reference .gtf file)
What are other people seeing for Ensembl Human reference.
Another problem that I am having, Cufflinks is producing 'u' matches (and no error messages) for every transcripts when I use the Homo_sapiens.GRCh37.60.gtf reference file, but gave good results ( a range of match types) when I used Homo_sapiens.GRCh37.55.gtf (both files have been formatted so that 'chr' is in front of the chromosome number). I can't see any obvious formatting difference between the two files, so I am a bit stumped.
Cheers