Hi all
I am very new to RNAseq, and while I know perl and some R, I am not exactly a computer wizard.... so please bare with me - its probably something stupid
We have paired end RNAseq data generated from a mouse tissue on Illumina Hiseq 2000, 50 bp, ~180M reads for each of the 4 conditions (both ends).
We want to do several things, the first one is to identify and quantify expressed isoforms (preferably finding new ones as well), and call differential expression of genes between the conditions. Because of some size/memory constrains we run each lane using several files and merge the cufflinks assembly at the end (is that ok???)
These are the commands we used:
tophat command:
cufflink command:
cuffmerge command:
1. After running Tophat+Cufflinks we get very low FPKM values (from 4.96066e-324.... to ~32), with FPKM_lo and FPKM_hi being 0 for all - this makes no sense to me, but may be I am absolutely wrong....? Can that happen if the tophat insert size is not accurate? (I am asking because we first run the Tophat with r -200, which was too large, and all insert sizes in sam files were 0, we the rerun using a new version of Tophat (1.3.0) with -r 50 (which is smaller than true) and used insert size column to estimate the parameter (which seems to be ~120) - this is being processed).
2. Is there a simple way to get summary data for how many known genes are expressed, and how many known and new isoforms of these genes were identified? Are there novel transcripts (not from known genes) and how many? Is there confidence criteria for these expression values?
3. Also, at first we did something much more simple minded - we used a different aligner (Mr/Mrs Fasta) to map the reads to the mouse genome - without using the pairing info
and calculated RPKM values. These have absolutely no relation to the FPKM values from Cufflinks (which we suspect are not right anyway).....
Thanks in advance for the help
Yehudit
I am very new to RNAseq, and while I know perl and some R, I am not exactly a computer wizard.... so please bare with me - its probably something stupid

We have paired end RNAseq data generated from a mouse tissue on Illumina Hiseq 2000, 50 bp, ~180M reads for each of the 4 conditions (both ends).
We want to do several things, the first one is to identify and quantify expressed isoforms (preferably finding new ones as well), and call differential expression of genes between the conditions. Because of some size/memory constrains we run each lane using several files and merge the cufflinks assembly at the end (is that ok???)
These are the commands we used:
tophat command:
Code:
tophat-1.3.0.Linux_x86_64/tophat -r 50 .../data/all .../read1_X ..../read2_0
Code:
cufflinks-1.0.3.Linux_x86_64/cufflinks -g ..../mm9_refGene ..../accepted_hits.bam
Code:
cuffmerge -s ..../all.fa -g ..../mm9_refGene assemblies.txt
2. Is there a simple way to get summary data for how many known genes are expressed, and how many known and new isoforms of these genes were identified? Are there novel transcripts (not from known genes) and how many? Is there confidence criteria for these expression values?
3. Also, at first we did something much more simple minded - we used a different aligner (Mr/Mrs Fasta) to map the reads to the mouse genome - without using the pairing info
and calculated RPKM values. These have absolutely no relation to the FPKM values from Cufflinks (which we suspect are not right anyway).....
Thanks in advance for the help

Yehudit