Hello all. I'm starting off here with something fairly complex, I guess.
I'm looking for some advice because I'm starting a new project and I haven't done this sort of thing before.
I've got a few samples worth of RNAseq reads and I'd like to generate expression information with them. They are human, but likely to contain sequences that are not in an existing reference transcriptomes. The read depth is also not very high, so I'm hesitant to just use the genome as a reference when generating expression data.
My plan right now is to map them to the current version of the human genome using STAR, then to use the alignments generated to produce a fasta file that has consensus sequences for everything aligned.
From there I can use Sailfish or Salmon to get read counts, RPKM, etc, and compare my samples using some kind of differential expression analysis in R. This part I'm solid on, it's the middle step that I'm not so sure about.
Does anyone think that generating a reference transcriptome this way is inadvisable (and why)?
If this sounds reasonable, what do you think is the best way to do so? I see that Trinity has a genome guided transcriptome generation option. I'd like to try that out. If not, I also see that there are ways to get just the aligned portions of a .bam file in .fasta format. Seems like a relatively convoluted process, though. I'd prefer to keep things relatively simple where possible.
One last thing, I'm not sure about preserving annotations for either of these options. So I'm open to advice on how to do that no matter what route I end up going.
Thanks in advance!
I'm looking for some advice because I'm starting a new project and I haven't done this sort of thing before.
I've got a few samples worth of RNAseq reads and I'd like to generate expression information with them. They are human, but likely to contain sequences that are not in an existing reference transcriptomes. The read depth is also not very high, so I'm hesitant to just use the genome as a reference when generating expression data.
My plan right now is to map them to the current version of the human genome using STAR, then to use the alignments generated to produce a fasta file that has consensus sequences for everything aligned.
From there I can use Sailfish or Salmon to get read counts, RPKM, etc, and compare my samples using some kind of differential expression analysis in R. This part I'm solid on, it's the middle step that I'm not so sure about.
Does anyone think that generating a reference transcriptome this way is inadvisable (and why)?
If this sounds reasonable, what do you think is the best way to do so? I see that Trinity has a genome guided transcriptome generation option. I'd like to try that out. If not, I also see that there are ways to get just the aligned portions of a .bam file in .fasta format. Seems like a relatively convoluted process, though. I'd prefer to keep things relatively simple where possible.
One last thing, I'm not sure about preserving annotations for either of these options. So I'm open to advice on how to do that no matter what route I end up going.
Thanks in advance!
Comment