Hi. After using Bowtie2/Tophat2 for mapping an RNA-Seq study, I ran my data through htseq and then have aimed for methods like edgeR/DESeq that allow a neg. binomial distribution -- I gathered for differential gene expression, this may be more appropriate than Cufflinks/CuffDiff...?
Yet because we just have Single End reads (100 bp HiSeq), there are a lot of multi-reads (=multi-hits / multi-mapped reads). HTSeq scores these as ambiguous and discards them; I'd like to rescue the multi-reads. Yet the Mortazavi method E-RANGE uses RPKM, and I'm unsure if I could reliably convert it back to raw counts for edgeR/DESeq. I found a newer package (BM-MAP) but am concerned (1) of the level of dependence it has on the estimate of polymorphism rates (if this is critical -- that could be a problem because I'm not sure if I can ascertain too reliably transcriptome-wide), and (2) I'm not sufficiently adept at coding to know how to translate the sam+ file it generates (appending a mapping probability to the end of each line in the sam file) into something I can put through htseq or some other program to generate counts, or to write my own script to do this. (3) Even if I CAN make this work, is it viable (or is the uncertainty due to using a Probability for multi-mapped, enough to make the DESeq/edgeR output unreliable)?
I'm tempted to just run Cuffdiff instead -- to avoid losing the potential information gleaned by including multi-reads (for downstream analysis it may be enough to know if a certain gene family is up/down-regulated, even if we can't get at the exact gene). Yet again I gather that may be more for isoform-specific comparisons vs differential Gene expression?
If anyone has time, suggestions are very welcome!
~Hilary
Yet because we just have Single End reads (100 bp HiSeq), there are a lot of multi-reads (=multi-hits / multi-mapped reads). HTSeq scores these as ambiguous and discards them; I'd like to rescue the multi-reads. Yet the Mortazavi method E-RANGE uses RPKM, and I'm unsure if I could reliably convert it back to raw counts for edgeR/DESeq. I found a newer package (BM-MAP) but am concerned (1) of the level of dependence it has on the estimate of polymorphism rates (if this is critical -- that could be a problem because I'm not sure if I can ascertain too reliably transcriptome-wide), and (2) I'm not sufficiently adept at coding to know how to translate the sam+ file it generates (appending a mapping probability to the end of each line in the sam file) into something I can put through htseq or some other program to generate counts, or to write my own script to do this. (3) Even if I CAN make this work, is it viable (or is the uncertainty due to using a Probability for multi-mapped, enough to make the DESeq/edgeR output unreliable)?
I'm tempted to just run Cuffdiff instead -- to avoid losing the potential information gleaned by including multi-reads (for downstream analysis it may be enough to know if a certain gene family is up/down-regulated, even if we can't get at the exact gene). Yet again I gather that may be more for isoform-specific comparisons vs differential Gene expression?
If anyone has time, suggestions are very welcome!
~Hilary
Comment