(Cross posted to the Trinity mailing list, but I wanted to see what SEQanswers thought about the problem)
I’m running an RNA-seq experiment using a de novo assembled transcriptome for a non-model organism (a beetle), where we have multiple treatments (diet and sex), and have 4 individuals per treatment. Furthermore we have sequenced 4 different tissues per individual (barcoded separately). I’ve encountered an interesting situation, and wanted some suggestions on how to resolve it. After using Trimmomatic to remove adapter sequences (but not sequence quality), and diginorm to normalize, I assembled the transcriptome in two different ways- first I generated assemblies from each individual specimen, using all the tissue libraries from that individuals, with the intention to combine all of the libraries together. Second, I also pooled all the reads (post diginorm) across all individuals (followed by a second round of diginorm) and then assembled a transcriptome from those reads. I modified the recommendations in the Nature Protocols (Haas et al 2013) paper slightly . (see below)
I used Trinity version: trinityrnaseq-r2013-02-25
In both methods of assembly generation I used the commands below, with the only difference being an increase in --JM to 60gb for the pooled assembly.
Trinity.pl --seqType fq --JM 20G --min_kmer_cov 2 --CPU 4 --left left.fq --right right.fq --min_contig_length 300
When I compare assembly metrics, something stood out to me: Each individually assembled transcriptome contained approximately similar amounts of components (~16,000 components per library). This is well in line with the number of “genes” with other related beetles (Tribolium and the dung beetles for instance). It is also quite similar to what we observed from our previous assembly based on 454 sequencing (http://dx.plos.org/10.1371/journal.pone.0088364) for this same species.
However the assembly that came from the reads pooled across individuals had an incredible amount of components (~40,000 components!). Clearly this is artificially high, almost certainly due to the degree of polymorphism among individuals. Yet we want a single transcriptome (this transcriptome will be used for mapping reads for differential expression analysis, at least at the gene (well, component) level.
My question is this: what sorts of parameters should I vary when using Trinity to reduce chimeric transcript reconstruction that are likely due to polymorphism ? I’m not specifically concerned about alternative transcripts at the moment, just generating a more biologically reasonably set of components not inflated due to polymorphism. More specifically I guess I want to know how to make the component generation and selection process more conservative.
Would running Trinity with the –CuffFly option reduce the number of components generated, or does that only affect the alternatively spliced transcripts? Similarly, do the parameters underneath the options –min_per_id_same_path (and related) affect alternative splice variants?
Or is it a better idea to run Inchworm with the –jaccard_clip flag?
As I mentioned above, the data I’m using is adapter trimmed (Trimmomatic), normalized (diginorm), Illumina 50bp paired-end data. Thanks in advance!
-Robert
I’m running an RNA-seq experiment using a de novo assembled transcriptome for a non-model organism (a beetle), where we have multiple treatments (diet and sex), and have 4 individuals per treatment. Furthermore we have sequenced 4 different tissues per individual (barcoded separately). I’ve encountered an interesting situation, and wanted some suggestions on how to resolve it. After using Trimmomatic to remove adapter sequences (but not sequence quality), and diginorm to normalize, I assembled the transcriptome in two different ways- first I generated assemblies from each individual specimen, using all the tissue libraries from that individuals, with the intention to combine all of the libraries together. Second, I also pooled all the reads (post diginorm) across all individuals (followed by a second round of diginorm) and then assembled a transcriptome from those reads. I modified the recommendations in the Nature Protocols (Haas et al 2013) paper slightly . (see below)
I used Trinity version: trinityrnaseq-r2013-02-25
In both methods of assembly generation I used the commands below, with the only difference being an increase in --JM to 60gb for the pooled assembly.
Trinity.pl --seqType fq --JM 20G --min_kmer_cov 2 --CPU 4 --left left.fq --right right.fq --min_contig_length 300
When I compare assembly metrics, something stood out to me: Each individually assembled transcriptome contained approximately similar amounts of components (~16,000 components per library). This is well in line with the number of “genes” with other related beetles (Tribolium and the dung beetles for instance). It is also quite similar to what we observed from our previous assembly based on 454 sequencing (http://dx.plos.org/10.1371/journal.pone.0088364) for this same species.
However the assembly that came from the reads pooled across individuals had an incredible amount of components (~40,000 components!). Clearly this is artificially high, almost certainly due to the degree of polymorphism among individuals. Yet we want a single transcriptome (this transcriptome will be used for mapping reads for differential expression analysis, at least at the gene (well, component) level.
My question is this: what sorts of parameters should I vary when using Trinity to reduce chimeric transcript reconstruction that are likely due to polymorphism ? I’m not specifically concerned about alternative transcripts at the moment, just generating a more biologically reasonably set of components not inflated due to polymorphism. More specifically I guess I want to know how to make the component generation and selection process more conservative.
Would running Trinity with the –CuffFly option reduce the number of components generated, or does that only affect the alternatively spliced transcripts? Similarly, do the parameters underneath the options –min_per_id_same_path (and related) affect alternative splice variants?
Or is it a better idea to run Inchworm with the –jaccard_clip flag?
As I mentioned above, the data I’m using is adapter trimmed (Trimmomatic), normalized (diginorm), Illumina 50bp paired-end data. Thanks in advance!
-Robert
Comment