Hi all,
I will be performing RNAseq transcriptome analysis on a certain organism under a specified number of conditions (let's say 10). The goal is to construct a tab-delimited file which contains the expression values (the raw read counts, not RPKM/FPKM values) for each gene under all conditions.
I am the first one in the lab to perform such an experiment and we don't have a standard workflow developed. Therefore I would ask you, the community, to review what I sketched so far and respond to my questions if possible!
my design so far:
1) generate RNAseq data
2) preprocessing the data: FASTX-toolkit (quality check, trimming, clipping, filtering)
3) aligning the reads -> TopHat (SAMformat output)
4) Iterate 1-3 for each condition...
5) construct file
6) further analysis
Questions:
a) what to do with isoforms? Do I take them into consideration (using Cufflinks or so), or not? My organism has very few introns, and I expect to see little isoform transcripts. Nevertheless, any isoform information is valuable.
b) how to tackle multireads? As far as I understand it, TopHat does not carry out some multi-read re-distribution like ERANGE does...
c) how to proceed to raw read counts? TopHat reports RPKM values, but I need raw read counts -> I could use some sort of comparison script which uses my annotation files to construct a read count for each gene? (BEDtools can do this I think).
If anybody has a better suggestion for a workflow and/or possible answers to my questions, please post them here
thanks!
I will be performing RNAseq transcriptome analysis on a certain organism under a specified number of conditions (let's say 10). The goal is to construct a tab-delimited file which contains the expression values (the raw read counts, not RPKM/FPKM values) for each gene under all conditions.
I am the first one in the lab to perform such an experiment and we don't have a standard workflow developed. Therefore I would ask you, the community, to review what I sketched so far and respond to my questions if possible!
my design so far:
1) generate RNAseq data
2) preprocessing the data: FASTX-toolkit (quality check, trimming, clipping, filtering)
3) aligning the reads -> TopHat (SAMformat output)
4) Iterate 1-3 for each condition...
5) construct file
6) further analysis
Questions:
a) what to do with isoforms? Do I take them into consideration (using Cufflinks or so), or not? My organism has very few introns, and I expect to see little isoform transcripts. Nevertheless, any isoform information is valuable.
b) how to tackle multireads? As far as I understand it, TopHat does not carry out some multi-read re-distribution like ERANGE does...
c) how to proceed to raw read counts? TopHat reports RPKM values, but I need raw read counts -> I could use some sort of comparison script which uses my annotation files to construct a read count for each gene? (BEDtools can do this I think).
If anybody has a better suggestion for a workflow and/or possible answers to my questions, please post them here
thanks!
Comment