Seqanswers Leaderboard Ad

**Blahah404** · 04-16-2014, 02:43 AM

Firstly, I would say 40,000 is not at all an unrealistic number of genes. Your previous assemblies are quite likely to be massive underestimates of the true set of transcripts. 454-based assemblies tend to reconstruct many fewer transcripts, and there is no reason to assemble each sample individually - pooling is much more likely to assemble a higher proportion of true transcripts. I would trust your pooled assembly more than any of the others.

Secondly, why didn't you do any quality trimming of your reads? You should inspect the read quality distributions with fastqc or a similar tool, as almost all read sets require a bit of quality trimming. There's some argument that very stringent trimming is a bad thing, but no trimming at all will lead to errors causing problematic false isoforms.

Finally, why do you think polymorphism would cause chimeras? It should cause bubbles in the graph, which will lead to more isoforms, but not more components or chimeras. If you do have chimeras, the best thing to do with them is to split them after the assembly. If you have isoform inflation due to polymorphisms, you can collapse those by clustering with CD-HIT-EST and ID set to 99.

**idworkin** · 04-16-2014, 11:14 AM

Thanks so much for the comments.

The reason we think the 40,000 components ("genes") is suspect is based on the observations from many related species of beetles. From both previous transcriptome assemblies, and importantly reasonably well assembled/annotated genomes like Tribolium (the Flour beetle http://www.nature.com/nature/journal...ture06784.html) all are consistent with the ~16,000 number for genes.

As far as we know there has been no whole genome duplication (or triplication) leading up to the lineage we are study, and our previous analysis and assembly from 454 data (also across multiple individuals) also had a similar number of genes for this species.

It is only when we take all of the samples ("pooled" then run through diginorm) and do a Trinity assembly that the problem occurs. While it is possible that we somehow get ~16,000 components when we do assemblies for each individual separately, this seems unlikely given the vagaries of sequencing depth. Based on both the a priori considerations and our own recent observations it suggests that the problem has to do with the assembly with the pooled data alone, and thus the most likely culprit is genetic variation among individuals causing reads to be assembled into multiple components when they simply vary based upon genetic variation.

As for your other question (why did we only trim the adapters, but not for sequence quality). This is based on some discussions that started with these posts (http://genomebio.org/optimal-trimming/ & http://genomebio.org/is-trimming-is-...al-in-rna-seq/) and paper (http://www.ncbi.nlm.nih.gov/pubmed/24567737). We plan to go back and do some light quality trimming as a check.

**dongilbert** · 04-17-2014, 10:22 AM

One general comment: worry first about getting the correct gene assemblies before you worry about a proper number of genes. Each of your treatments and tissues is expected to express a somewhat different gene set. You will need to do orthology measures on any final gene collection, but each of your assemblies will have some of the best models for some loci.

You can find advice and software here on how to best select your beetle genes from multiple transcript assemblies

EvidentialGene

http://arthropods.eugenes.org/EvidentialGene/

This includes a pine beetle example and other insects, plants, animals (the pine beetle mRNA-assembled gene set is more ortho-complete than a pine beetle genome-gene set, where ortho-complete means both more ortholog loci and longer, fuller proteins). If you add more assembly methods, eg. oases/velvet, soap-trans, and data slices, that will give you the most complete gene set, after selecting out the best models of each locus from your input assembly. I find repeatably that velvet/oases and soapdenovo-trans give more complete gene sets than trinity, but drawing on them all gives you the most complete set. I typically need to generate several million transcript assemblies to get "just right" accurate gene sets of animals and plants.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Reducing potentially chimeric contigs during assembly

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News