Hi,
I'm new to RNASeq but I've successfully managed to run through a tophat/cufflinks workflow, however I'm getting some results which I find slightly difficult to believe as well as some general oddness.
I was hoping to run it past people who have more experience with cufflinks and might be able to spot any mistakes I'm making.
Our intention is to compare our treatment (gene knockdown using siRNA - s_6_sequence.txt) vs. control (s_7_sequence.txt).
First, what I've done:
Aligned single ended reads with tophat
Downloaded Ensembl Homo_sapiens NCBI36 release 51 GTF file and changed chromosomes to match hg18 format
From: ftp://ftp.ensembl.org/pub/release-51...BI36.51.gtf.gz
Run Cufflinks
Cuffcompare
Cuffdiff
I wanted to be able to view the data myself so I created bigWig and bigBed files that I could use on the UCSC genome browser from the coverage.wig and junctions.bed files created for each of the lanes.
For some reason the coverage.wig files occasionally have an entry at the end of a chromosome where the start and end positions are identical (which would imply a length of 0) which I take to be a bug so my bigWig and bigBed files are generated as below:
Using:
cufflinks v0.8.2
TopHat v1.0.13
bowtie version 0.12.5
So, I believe this is roughly a correct workflow for this data. Now, for my questions.
Locus Annotation
Looking at 0_1_gene_expr.diff, it seems that quite a few loci aren't annotated correctly even though they are sitting right on top of a gene and several are wrongly annotated.
One example is a locus found at chr16:2017016-2029027. This is annotated as NTHL1 however it should be SLC9A3R2 which is at that position.
NTHL1 is however just next door at chr16:2,029,817-2,037,868 so I thought perhaps SLC9A3R2 is just missing from the annotation and it just picked the closest one, however this is not the case as SLC9A3R2 is there and in the correct place.
There are also a large number of loci (~2700) that don't have any annotated transcript from the reference at all despite many of them being right on top of a single gene.
Is this just what I should expect or is there something wrong with my GTF annotation or parameters?
Weird significance
Some loci seem to be classed as significantly different despite having a very small number of reads in them.
i.e. the locus:
Looks like this:
which by eye I'd have called as a NOTEST. However there are many genes with loads of coverage which are being called as NOTEST while this isn't.
Why do I get loci with very few reads being tested but loci with large numbers of reads as NOTEST?
Weird expression values
I've got some really really really high fold change differences between genes.
Take for example
A ln(fold change) of 19 ! Surely that'd be a regular fold change of about 178 million? Am I misunderstanding that?
Obviously coverage isn't directly equatable with FPKM, but still, I'm hard pressed to believe there is such a huge fold change on that gene as cufflinks reports.
Why do I get huge fold changes for loci? Presumably it occurs because one of the samples is being assigned a FPKM of 7.55E-009 when in reality it should be much higher than this, why would this be the case?
Summary / TLDR
I'm obviously doing something a bit suspect and getting weird results for both annotation, fold change and significance. It probably stems from one or more error I've made somewhere along the way and was hoping someone could provide a second set of eyes to spot it. Any help would be appreciated.
I'm new to RNASeq but I've successfully managed to run through a tophat/cufflinks workflow, however I'm getting some results which I find slightly difficult to believe as well as some general oddness.
I was hoping to run it past people who have more experience with cufflinks and might be able to spot any mistakes I'm making.
Our intention is to compare our treatment (gene knockdown using siRNA - s_6_sequence.txt) vs. control (s_7_sequence.txt).
First, what I've done:
Aligned single ended reads with tophat
tophat --solexa1.3-quals -p 14 -o s_6_sequence hg18 s_6_sequence.txt
tophat --solexa1.3-quals -p 14 -o s_7_sequence hg18 s_7_sequence.txt
tophat --solexa1.3-quals -p 14 -o s_7_sequence hg18 s_7_sequence.txt
From: ftp://ftp.ensembl.org/pub/release-51...BI36.51.gtf.gz
cat Homo_sapiens.NCBI36.51.gtf | sed "s/\(.*\)/chr\1/" | sed "s/chrMT/chrM/" > Homo_sapiens.NCBI36.51.fixed.gtf
cd s_6_sequence
cufflinks -p 14 accepted_hits.sam
cd ..
cd s_7_sequence
cufflinks -p 14 accepted_hits.sam
cufflinks -p 14 accepted_hits.sam
cd ..
cd s_7_sequence
cufflinks -p 14 accepted_hits.sam
cuffcompare -r Homo_sapiens.NCBI36.51.fixed.gtf -R -o summarystats.txt s_6_sequence/transcripts.gtf s_7_sequence/transcripts.gtf
cuffdiff -p 14 summarystats.combined.gtf s_6_sequence/accepted_hits.sam s_7_sequence/accepted_hits.sam
For some reason the coverage.wig files occasionally have an entry at the end of a chromosome where the start and end positions are identical (which would imply a length of 0) which I take to be a bug so my bigWig and bigBed files are generated as below:
cat coverage.wig | awk 'FNR > 1' | awk '{if ($2==$3){$3 = $3 +1}; print $1 "\t" $2 "\t" $3 "\t" $4}' > coverage.bedGraph
UCSC/bedGraphToBigWig coverage.bedGraph chrmSizes.hg18 coverage.bigWig
rm coverage.bedGraph
UCSC/bedGraphToBigWig coverage.bedGraph chrmSizes.hg18 coverage.bigWig
rm coverage.bedGraph
cat junctions.bed | awk 'FNR > 1' | sort -k1,1 -k2,2n > junctions.bedTrim
/UCSC/bedToBigBed junctions.bedTrim chrmSizes.hg18 junctions.bigBed
rm junctions.bedTrim
/UCSC/bedToBigBed junctions.bedTrim chrmSizes.hg18 junctions.bigBed
rm junctions.bedTrim
cufflinks v0.8.2
TopHat v1.0.13
bowtie version 0.12.5
So, I believe this is roughly a correct workflow for this data. Now, for my questions.
Locus Annotation
Looking at 0_1_gene_expr.diff, it seems that quite a few loci aren't annotated correctly even though they are sitting right on top of a gene and several are wrongly annotated.
One example is a locus found at chr16:2017016-2029027. This is annotated as NTHL1 however it should be SLC9A3R2 which is at that position.
NTHL1 is however just next door at chr16:2,029,817-2,037,868 so I thought perhaps SLC9A3R2 is just missing from the annotation and it just picked the closest one, however this is not the case as SLC9A3R2 is there and in the correct place.
$> grep SLC9A3R2 Homo_sapiens.NCBI36.51.fixed.gtf
chr16 protein_coding exon 2016924 2017220 . + . gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "1"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201";
chr16 protein_coding CDS 2017008 2017220 . + 0 gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "1"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201"; protein_id "ENSP00000191922";
chr16 protein_coding start_codon 2017008 2017010 . + 0 gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "1"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201";
chr16 protein_coding exon 2019584 2019784 . + . gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "2"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201";
chr16 protein_coding CDS 2019584 2019784 . + 0 gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "2"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201"; protein_id "ENSP00000191922";
... etc
chr16 protein_coding exon 2016924 2017220 . + . gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "1"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201";
chr16 protein_coding CDS 2017008 2017220 . + 0 gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "1"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201"; protein_id "ENSP00000191922";
chr16 protein_coding start_codon 2017008 2017010 . + 0 gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "1"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201";
chr16 protein_coding exon 2019584 2019784 . + . gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "2"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201";
chr16 protein_coding CDS 2019584 2019784 . + 0 gene_id "ENSG00000065054"; transcript_id "ENST00000191922"; exon_number "2"; gene_name "SLC9A3R2"; transcript_name "SLC9A3R2-201"; protein_id "ENSP00000191922";
... etc
Is this just what I should expect or is there something wrong with my GTF annotation or parameters?
Weird significance
Some loci seem to be classed as significantly different despite having a very small number of reads in them.
i.e. the locus:
locus status value_1 value_2 log(fold_change) test_stat p_value significant
chr1:100503947-100504182 OK 0.728869 0 0 0 0 yes
chr1:100503947-100504182 OK 0.728869 0 0 0 0 yes
which by eye I'd have called as a NOTEST. However there are many genes with loads of coverage which are being called as NOTEST while this isn't.
gene locus status value_1 value_2 log(fold_change) test_stat p_value significant
RPL7 chr8:74367477-74368779 NOTEST 208.194 206.278 -0.00924444 -0.00573799 0.995422 no
RPL7 chr8:74367477-74368779 NOTEST 208.194 206.278 -0.00924444 -0.00573799 0.995422 no
Why do I get loci with very few reads being tested but loci with large numbers of reads as NOTEST?
Weird expression values
I've got some really really really high fold change differences between genes.
Take for example
gene locus status value_1 value_2 log(fold_change) test_stat p_value significant
CCDC36 chr3:49108372-49117225 OK 2.33E-008 5.91318 19.3523 2546.34 0 yes
CCDC36 chr3:49108372-49117225 OK 2.33E-008 5.91318 19.3523 2546.34 0 yes
A ln(fold change) of 19 ! Surely that'd be a regular fold change of about 178 million? Am I misunderstanding that?
Obviously coverage isn't directly equatable with FPKM, but still, I'm hard pressed to believe there is such a huge fold change on that gene as cufflinks reports.
Why do I get huge fold changes for loci? Presumably it occurs because one of the samples is being assigned a FPKM of 7.55E-009 when in reality it should be much higher than this, why would this be the case?
Summary / TLDR
I'm obviously doing something a bit suspect and getting weird results for both annotation, fold change and significance. It probably stems from one or more error I've made somewhere along the way and was hoping someone could provide a second set of eyes to spot it. Any help would be appreciated.
Comment