Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reducing potentially chimeric contigs during assembly

    (Cross posted to the Trinity mailing list, but I wanted to see what SEQanswers thought about the problem)

    I’m running an RNA-seq experiment using a de novo assembled transcriptome for a non-model organism (a beetle), where we have multiple treatments (diet and sex), and have 4 individuals per treatment. Furthermore we have sequenced 4 different tissues per individual (barcoded separately). I’ve encountered an interesting situation, and wanted some suggestions on how to resolve it. After using Trimmomatic to remove adapter sequences (but not sequence quality), and diginorm to normalize, I assembled the transcriptome in two different ways- first I generated assemblies from each individual specimen, using all the tissue libraries from that individuals, with the intention to combine all of the libraries together. Second, I also pooled all the reads (post diginorm) across all individuals (followed by a second round of diginorm) and then assembled a transcriptome from those reads. I modified the recommendations in the Nature Protocols (Haas et al 2013) paper slightly . (see below)

    I used Trinity version: trinityrnaseq-r2013-02-25
    In both methods of assembly generation I used the commands below, with the only difference being an increase in --JM to 60gb for the pooled assembly.

    Trinity.pl --seqType fq --JM 20G --min_kmer_cov 2 --CPU 4 --left left.fq --right right.fq --min_contig_length 300

    When I compare assembly metrics, something stood out to me: Each individually assembled transcriptome contained approximately similar amounts of components (~16,000 components per library). This is well in line with the number of “genes” with other related beetles (Tribolium and the dung beetles for instance). It is also quite similar to what we observed from our previous assembly based on 454 sequencing (http://dx.plos.org/10.1371/journal.pone.0088364) for this same species.

    However the assembly that came from the reads pooled across individuals had an incredible amount of components (~40,000 components!). Clearly this is artificially high, almost certainly due to the degree of polymorphism among individuals. Yet we want a single transcriptome (this transcriptome will be used for mapping reads for differential expression analysis, at least at the gene (well, component) level.

    My question is this: what sorts of parameters should I vary when using Trinity to reduce chimeric transcript reconstruction that are likely due to polymorphism ? I’m not specifically concerned about alternative transcripts at the moment, just generating a more biologically reasonably set of components not inflated due to polymorphism. More specifically I guess I want to know how to make the component generation and selection process more conservative.

    Would running Trinity with the –CuffFly option reduce the number of components generated, or does that only affect the alternatively spliced transcripts? Similarly, do the parameters underneath the options –min_per_id_same_path (and related) affect alternative splice variants?

    Or is it a better idea to run Inchworm with the –jaccard_clip flag?

    As I mentioned above, the data I’m using is adapter trimmed (Trimmomatic), normalized (diginorm), Illumina 50bp paired-end data. Thanks in advance!

    -Robert

  • #2
    Firstly, I would say 40,000 is not at all an unrealistic number of genes. Your previous assemblies are quite likely to be massive underestimates of the true set of transcripts. 454-based assemblies tend to reconstruct many fewer transcripts, and there is no reason to assemble each sample individually - pooling is much more likely to assemble a higher proportion of true transcripts. I would trust your pooled assembly more than any of the others.

    Secondly, why didn't you do any quality trimming of your reads? You should inspect the read quality distributions with fastqc or a similar tool, as almost all read sets require a bit of quality trimming. There's some argument that very stringent trimming is a bad thing, but no trimming at all will lead to errors causing problematic false isoforms.

    Finally, why do you think polymorphism would cause chimeras? It should cause bubbles in the graph, which will lead to more isoforms, but not more components or chimeras. If you do have chimeras, the best thing to do with them is to split them after the assembly. If you have isoform inflation due to polymorphisms, you can collapse those by clustering with CD-HIT-EST and ID set to 99.
    Last edited by Blahah404; 04-16-2014, 02:47 AM.

    Comment


    • #3
      Thanks so much for the comments.

      The reason we think the 40,000 components ("genes") is suspect is based on the observations from many related species of beetles. From both previous transcriptome assemblies, and importantly reasonably well assembled/annotated genomes like Tribolium (the Flour beetle http://www.nature.com/nature/journal...ture06784.html) all are consistent with the ~16,000 number for genes.

      As far as we know there has been no whole genome duplication (or triplication) leading up to the lineage we are study, and our previous analysis and assembly from 454 data (also across multiple individuals) also had a similar number of genes for this species.

      It is only when we take all of the samples ("pooled" then run through diginorm) and do a Trinity assembly that the problem occurs. While it is possible that we somehow get ~16,000 components when we do assemblies for each individual separately, this seems unlikely given the vagaries of sequencing depth. Based on both the a priori considerations and our own recent observations it suggests that the problem has to do with the assembly with the pooled data alone, and thus the most likely culprit is genetic variation among individuals causing reads to be assembled into multiple components when they simply vary based upon genetic variation.

      As for your other question (why did we only trim the adapters, but not for sequence quality). This is based on some discussions that started with these posts (http://genomebio.org/optimal-trimming/ & http://genomebio.org/is-trimming-is-...al-in-rna-seq/) and paper (http://www.ncbi.nlm.nih.gov/pubmed/24567737). We plan to go back and do some light quality trimming as a check.

      Comment


      • #4
        One general comment: worry first about getting the correct gene assemblies before you worry about a proper number of genes. Each of your treatments and tissues is expected to express a somewhat different gene set. You will need to do orthology measures on any final gene collection, but each of your assemblies will have some of the best models for some loci.

        You can find advice and software here on how to best select your beetle genes from multiple transcript assemblies


        This includes a pine beetle example and other insects, plants, animals (the pine beetle mRNA-assembled gene set is more ortho-complete than a pine beetle genome-gene set, where ortho-complete means both more ortholog loci and longer, fuller proteins). If you add more assembly methods, eg. oases/velvet, soap-trans, and data slices, that will give you the most complete gene set, after selecting out the best models of each locus from your input assembly. I find repeatably that velvet/oases and soapdenovo-trans give more complete gene sets than trinity, but drawing on them all gives you the most complete set. I typically need to generate several million transcript assemblies to get "just right" accurate gene sets of animals and plants.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Genetic Variation in Immunogenetics and Antibody Diversity
          by seqadmin



          The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
          11-06-2024, 07:24 PM
        • seqadmin
          Choosing Between NGS and qPCR
          by seqadmin



          Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
          10-18-2024, 07:11 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 11:09 AM
        0 responses
        24 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Today, 06:13 AM
        0 responses
        20 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 11-01-2024, 06:09 AM
        0 responses
        30 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-30-2024, 05:31 AM
        0 responses
        21 views
        0 likes
        Last Post seqadmin  
        Working...
        X