Hi there,
I have some questions regarding CLC Bio vs. Trinity for de novo RNA seq assembly. When I was told how to process my data by a postdoc in my lab, he insisted on using CLC Bio, on the default setting. I've looked into CLC Bio and I'm not sure if this is the correct way to go about things. But before I step away from a published methodology, I have to convince my supervisor too. That's where I hope you can help me.
The data:
1/ Eukaryotic, single celled algae - dinoflagellates. The phylum is particularly known for bizaare genetic elements:
- 0.5 to 40 x genetic content of human haploid genome
- mRNA frequently reinserted into genome, i.e.. a mine field of truncated paralogs. They are the hoarders of the genetic world.
- ancient lineage, they've had a long time to accumulate paralogs. Some rDNA genes have in excess of 2000 copies, most phylogenetic analyses of the order that I work with are rubbish because of this.
- they have a different, still unknown mode of gene regulation, appears to be post-transcriptional. I.e. mRNA seq data is massive and gives us a pretty good idea about the genome. We think.
- hence, no reference genomes or even transcriptomes available.
2/ Working with sequencing data from both public database (MMETSP) and my own work. Some of the former is really quite low quality.
- public: Illumina Hi-Seq 2000, PE, 50bp inserts
- mine: Nextseq500, PE, 75bp inserts, HO
- mine, second round of sequencing occurring now: Nextseq500, PE, 150bp inserts, HO
The Problem:
I've come across someone else's (Lisa Cohen, github - really cool project) usage of the publicly available data, using Trinity and then the same quality control assessment that I had run - BUSCO (looks for single copy genes via hmmer libraries, successor of CEGMA). So I have a direct comparison point between the BUSCO score of my CLC Bio assemblies vs. her Trinity assemblies using the same RNA seq libraries. Hers are better across the board for single copy hits. Some transcriptomes only by 2 genes, but in one or two transcriptomes the difference is 50 single copy genes out of the 450 tested.
The questions:
- what is the general knowledge/feeling about CLC Bio and Trinity? Preferences or horror stories?
- is either of the assemblers known for making mistakes?
- more directly, is either of them partial to misassembly of paralogs - if one gives me more single copy genes, is that a 'true' result or are they actually a mash up of paralogs?
Thanks, y'all!
I have some questions regarding CLC Bio vs. Trinity for de novo RNA seq assembly. When I was told how to process my data by a postdoc in my lab, he insisted on using CLC Bio, on the default setting. I've looked into CLC Bio and I'm not sure if this is the correct way to go about things. But before I step away from a published methodology, I have to convince my supervisor too. That's where I hope you can help me.
The data:
1/ Eukaryotic, single celled algae - dinoflagellates. The phylum is particularly known for bizaare genetic elements:
- 0.5 to 40 x genetic content of human haploid genome
- mRNA frequently reinserted into genome, i.e.. a mine field of truncated paralogs. They are the hoarders of the genetic world.
- ancient lineage, they've had a long time to accumulate paralogs. Some rDNA genes have in excess of 2000 copies, most phylogenetic analyses of the order that I work with are rubbish because of this.
- they have a different, still unknown mode of gene regulation, appears to be post-transcriptional. I.e. mRNA seq data is massive and gives us a pretty good idea about the genome. We think.
- hence, no reference genomes or even transcriptomes available.
2/ Working with sequencing data from both public database (MMETSP) and my own work. Some of the former is really quite low quality.
- public: Illumina Hi-Seq 2000, PE, 50bp inserts
- mine: Nextseq500, PE, 75bp inserts, HO
- mine, second round of sequencing occurring now: Nextseq500, PE, 150bp inserts, HO
The Problem:
I've come across someone else's (Lisa Cohen, github - really cool project) usage of the publicly available data, using Trinity and then the same quality control assessment that I had run - BUSCO (looks for single copy genes via hmmer libraries, successor of CEGMA). So I have a direct comparison point between the BUSCO score of my CLC Bio assemblies vs. her Trinity assemblies using the same RNA seq libraries. Hers are better across the board for single copy hits. Some transcriptomes only by 2 genes, but in one or two transcriptomes the difference is 50 single copy genes out of the 450 tested.
The questions:
- what is the general knowledge/feeling about CLC Bio and Trinity? Preferences or horror stories?
- is either of the assemblers known for making mistakes?
- more directly, is either of them partial to misassembly of paralogs - if one gives me more single copy genes, is that a 'true' result or are they actually a mash up of paralogs?
Thanks, y'all!
Comment