I am starting out analyzing unpaired Illumina RNA-seq reads to determine differential expression between two samples (and eventually want to look at alternative splicing).
From all my reading so far, it seems these are the step to follow and there are TONS of programs for each step. There are also integrated softwares out there which will supposedly do all these for you.
The thread here is to find out (1) Are these steps right? (2) Which programs work the best for all of you. I am asking the masters out there, what worked for them and to fill in the gaps.
Steps:
(1) Obtain a data set, preferably a small set your computer and brain can handle. Tons of data available now in NCBI Trace archive.
(2) Align it to the reference genome (I am using Bowtie, I know there are many other software). Tweak the parameters and get maximum reads aligned. Understand what is happening to unmapped reads as well as the mapped reads. Get familiar with output. Ultimately get it in a format you understand. I am using SAM format. This step is critical, I think.
(3) Since I am interested in alternative splicing, I plan on using Tophat (which uses Bowtie) to align reads to known junctions. In the future, I plan on providing my own set of junctions.
(4) Determine differential expression. Which program is best for this and has worked well?? Cufflinks, DegSeq. I understand that this is step where you will perform any normalization and implement complex statistics (Poisson or determine likelihood) to determine if a gene is differentially expressed between two samples. Which program has worked well for you, what are the pros and cons?
(5) Use a visualization tool to look at your data. you can also do this after step (2)
So, does this sound right? What are the challenges in analyzing RNA-seq data, besides choosing from a large number of options. I am more interested in knowing from people who have successfully determined differential expression of genes which step requires caution and is time consuming, where in lies the challenge?
From all my reading so far, it seems these are the step to follow and there are TONS of programs for each step. There are also integrated softwares out there which will supposedly do all these for you.
The thread here is to find out (1) Are these steps right? (2) Which programs work the best for all of you. I am asking the masters out there, what worked for them and to fill in the gaps.
Steps:
(1) Obtain a data set, preferably a small set your computer and brain can handle. Tons of data available now in NCBI Trace archive.
(2) Align it to the reference genome (I am using Bowtie, I know there are many other software). Tweak the parameters and get maximum reads aligned. Understand what is happening to unmapped reads as well as the mapped reads. Get familiar with output. Ultimately get it in a format you understand. I am using SAM format. This step is critical, I think.
(3) Since I am interested in alternative splicing, I plan on using Tophat (which uses Bowtie) to align reads to known junctions. In the future, I plan on providing my own set of junctions.
(4) Determine differential expression. Which program is best for this and has worked well?? Cufflinks, DegSeq. I understand that this is step where you will perform any normalization and implement complex statistics (Poisson or determine likelihood) to determine if a gene is differentially expressed between two samples. Which program has worked well for you, what are the pros and cons?
(5) Use a visualization tool to look at your data. you can also do this after step (2)
So, does this sound right? What are the challenges in analyzing RNA-seq data, besides choosing from a large number of options. I am more interested in knowing from people who have successfully determined differential expression of genes which step requires caution and is time consuming, where in lies the challenge?
Comment