Hi All,
I've trawled the forums but have not found a complete discussion around this question: For RNA-seq DE analysis in non-model species, where a de novo transcriptome is the only mapping reference available, what's the most legitimate approach for DE testing? Transcript-level or 'gene'-level analysis?
This is my understanding: In most cases, the non-model species community uses Trinity pipelines to assemble a reference transcriptome de novo (typically from the same reads used for downstream DE analysis), using RSEM for alignment-based abundance estimation to generate the counts tables for downstream DE analysis in whatever software you choose. Obviously, the success of DE analysis hinges on the accuracy of the count data used as input.
There's a choice of using counts for Trinity transcripts (i.e., contigs in the de novo assembly theoretically equivalent to isoforms) (RSEM.isoforms.results), or at the level of Trinity 'components', which are a proxy for genes (RSEM.genes.results). (Compared to mapping against a genome, there are obvious inaccuracies with assembling genes and isoforms de novo, but its what we have).
Obviously, a transcript-level analysis is preferred biologically but tricky in practice.
*I'm aware that transcript-level analysis in popular edgeR and DESeq2 violates key assumptions of these programs. Many people go ahead anyway, and publish such results.
*DEXseq is recommended for exon-level analysis, but appears to require mapping to a genome.
*Alternatively, the 'gene'-level counts from RSEM can be used in e.g. DESeq2, although this brings its own issues because the Trinity components are only a proxy for gene level studies. Is this nevertheless the most legitimate approach for counts derived from de novo transcriptome mapping??
*I've recently read of the alignment-free k-mer based approach of kallisto, with downstream DE analysis in sleuth, suitable at the transcript level. Is this new approach perhaps the best yet for non-model species??
Like most, I'm relatively new to RNA-seq and am not a biostatistician. I realise there are issues with all of the above options, but I'm hoping some of the program developers and those with statistical minds can share some advice on what might be the most legitimate approach for non-model species.
Many thanks.
I've trawled the forums but have not found a complete discussion around this question: For RNA-seq DE analysis in non-model species, where a de novo transcriptome is the only mapping reference available, what's the most legitimate approach for DE testing? Transcript-level or 'gene'-level analysis?
This is my understanding: In most cases, the non-model species community uses Trinity pipelines to assemble a reference transcriptome de novo (typically from the same reads used for downstream DE analysis), using RSEM for alignment-based abundance estimation to generate the counts tables for downstream DE analysis in whatever software you choose. Obviously, the success of DE analysis hinges on the accuracy of the count data used as input.
There's a choice of using counts for Trinity transcripts (i.e., contigs in the de novo assembly theoretically equivalent to isoforms) (RSEM.isoforms.results), or at the level of Trinity 'components', which are a proxy for genes (RSEM.genes.results). (Compared to mapping against a genome, there are obvious inaccuracies with assembling genes and isoforms de novo, but its what we have).
Obviously, a transcript-level analysis is preferred biologically but tricky in practice.
*I'm aware that transcript-level analysis in popular edgeR and DESeq2 violates key assumptions of these programs. Many people go ahead anyway, and publish such results.
*DEXseq is recommended for exon-level analysis, but appears to require mapping to a genome.
*Alternatively, the 'gene'-level counts from RSEM can be used in e.g. DESeq2, although this brings its own issues because the Trinity components are only a proxy for gene level studies. Is this nevertheless the most legitimate approach for counts derived from de novo transcriptome mapping??
*I've recently read of the alignment-free k-mer based approach of kallisto, with downstream DE analysis in sleuth, suitable at the transcript level. Is this new approach perhaps the best yet for non-model species??
Like most, I'm relatively new to RNA-seq and am not a biostatistician. I realise there are issues with all of the above options, but I'm hoping some of the program developers and those with statistical minds can share some advice on what might be the most legitimate approach for non-model species.
Many thanks.
Comment