Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • upendra_35
    Senior Member
    • Apr 2010
    • 102

    Trinity transcriptome assembly

    After Trinity assembler finished its assembly i managed to calculate the basic statistics of the assembly which are as below

    File Number Total Size Min Size Max Size Average Size Median Size N50 Trinity.fasta 158863 176660784 201 22887 1112.03 665 1863

    Size@1Mbp Number@1Mbp Size@2Mbp Number@2Mbp Size@4Mbp Number@4Mbp Size@10Mbp Number@10Mbp 11440 65 8461 170 7088 430 5424 1417

    Now my question is does these values look reasonable? Though N50 looks good i am worried about the number of transcripts that are less than 1kb (~ 60%) of the overall transcripts. Is this normal in Trinity?

    Also how do people normally do downstream analysis after getting the assembly to select the best transcritps. I ask this because the number of Transcripts is way higher than expect number of genes in related species.

    Thanks........
  • mbayer
    Member
    • Mar 2009
    • 31

    #2
    Hi,

    personally I think that looks reasonable, assuming you have a eukaryotic organism -- the average gene length in eukaryotes is supposed to be in the 1,500 bp region. What organism is this, and do you know how common alternative splicing is in that species? That would obviously affect your number of transcripts relative to the number of genes. Also, given that most alternative splicing produces transcripts that are shorter than the full length mRNA, an average transcript length of 1,112 seems reasonable.

    To evaluate your assembly I would run something on the transcripts that predicts proteins, like getorf from the EMBOSS suite of tools, then select the longest predicted protein and BLAST this against related species. This will give you an idea of how good your assembly is.

    cheers

    Micha

    Comment

    • upendra_35
      Senior Member
      • Apr 2010
      • 102

      #3
      Originally posted by mbayer View Post
      Hi,

      personally I think that looks reasonable, assuming you have a eukaryotic organism -- the average gene length in eukaryotes is supposed to be in the 1,500 bp region. What organism is this, and do you know how common alternative splicing is in that species? That would obviously affect your number of transcripts relative to the number of genes. Also, given that most alternative splicing produces transcripts that are shorter than the full length mRNA, an average transcript length of 1,112 seems reasonable.

      To evaluate your assembly I would run something on the transcripts that predicts proteins, like getorf from the EMBOSS suite of tools, then select the longest predicted protein and BLAST this against related species. This will give you an idea of how good your assembly is.

      cheers

      Micha
      Hi mbayer

      Thanks a lot for your response.

      The organism that i am working on is Brassica rapa plant which is very close to model plant Arabidopsis. When i got my final trinity output i didn't realize that they include alternative splicing transcripts. Anyway during last few days i learnt a lot of how to do downstream analysis and this is something i planned to do to get the final transcripts (i mean best transcripts)

      1. expression based: after running the abundance estimation (bowtie-express), retain those that have some minimum FPKM value (such as 1).

      2. run the ORF extraction pipeline included in Trinity (don't restrict it to complete ORFs, get both complete and partials) - retain those that encode long ORFs (eg. 200 aa)

      3. blastx the trinity transcripts against RefSeq, retain those that have homology to known proteins (E<=1e-10)

      Take the union of {1,2,3} above and call it 'best'.

      Do you further comment on this?

      Comment

      • mbayer
        Member
        • Mar 2009
        • 31

        #4
        Hi,

        2 and 3 sound reasonable. As to point 1), I wouldn't exclude transcripts on the basis of being lowly expressed -- you may end up removing genuine transcripts from your final set. Remember that some transcripts really are expressed at very low levels, and also that Illumina sequencing contains an element of randomness which means that at the lower end of the expression range there may be transcripts that actually were present in the sample at very low levels but have not been caught by the sequencing/and or data analysis.

        cheers

        Micha

        Comment

        • Jan_R
          Junior Member
          • Jul 2011
          • 9

          #5
          Originally posted by upendra_35 View Post
          The organism that i am working on is Brassica rapa plant which is very close to model plant Arabidopsis. When i got my final trinity output i didn't realize that they include alternative splicing transcripts.

          Might be a bit late but anyway:


          Most times the sequences in one comp(onent) do resemble each other. It is likley they are isoforms of the same gene.
          At least that's my impression after sequence comparison by blastn and blastx. You can also put them in clustalw.
          My experience is that we could easily amplify the few sequences we were interested in out of cDNA. So at least for these few examples trinity worked pretty nice for us.

          A colleague of mine used different smaller sequences from one component to assemble larger fragments. The fragments he obtained from trinity all had similarity to a known gene but were much smaller than expected. He had to try two or three combinations of alignments before he got the expected fragment in his PCR.

          So the number of components could give you rough impression about the total number of genes represented in your assembly.

          Good luck with your assembly!

          Comment

          • upendra_35
            Senior Member
            • Apr 2010
            • 102

            #6
            Originally posted by mbayer View Post
            Hi,

            2 and 3 sound reasonable. As to point 1), I wouldn't exclude transcripts on the basis of being lowly expressed -- you may end up removing genuine transcripts from your final set. Remember that some transcripts really are expressed at very low levels, and also that Illumina sequencing contains an element of randomness which means that at the lower end of the expression range there may be transcripts that actually were present in the sample at very low levels but have not been caught by the sequencing/and or data analysis.

            cheers

            Micha
            Does that mean i should have more relaxed FPKM cut-off? What do you think would be ideal by the way?

            Comment

            • upendra_35
              Senior Member
              • Apr 2010
              • 102

              #7
              Originally posted by Jan_R View Post
              Might be a bit late but anyway:


              Most times the sequences in one comp(onent) do resemble each other. It is likley they are isoforms of the same gene.
              At least that's my impression after sequence comparison by blastn and blastx. You can also put them in clustalw.
              My experience is that we could easily amplify the few sequences we were interested in out of cDNA. So at least for these few examples trinity worked pretty nice for us.

              A colleague of mine used different smaller sequences from one component to assemble larger fragments. The fragments he obtained from trinity all had similarity to a known gene but were much smaller than expected. He had to try two or three combinations of alignments before he got the expected fragment in his PCR.

              So the number of components could give you rough impression about the total number of genes represented in your assembly.

              Good luck with your assembly!
              Thanks for the info. Yes i did worked out on this a bit but in the manner that your colleague work. I have a feeling by myself that the Trinity transcripts are smaller when compared to known genes and probably i might have do the same as your colleague. Do you know if he had written any script to do this? Thanks in advance...

              Comment

              • Jan_R
                Junior Member
                • Jul 2011
                • 9

                #8
                Originally posted by upendra_35 View Post
                Thanks for the info. Yes i did worked out on this a bit but in the manner that your colleague work. I have a feeling by myself that the Trinity transcripts are smaller when compared to known genes and probably i might have do the same as your colleague. Do you know if he had written any script to do this? Thanks in advance...
                Seems to be Murphy's law that people are on vacation when one comes up with questions

                He was specificaly interested in one specific gene so he didn't write a script for that. He just put them in a classic sequence alignment program (clustalw) together with a sequence from a closely related species.
                After that he did an ordinary blast with the sequence of the related species and the reads. Best matching reads he also aligned onto the sequence to close gaps. So he more or less refined the work already done by Trinity.
                I lack the experience to judge wether this was a wild guess but this approach worked out for him and he could amplify his gene at the second or third attempt.

                But that is the part that bothers you when you already have a specific gene you're interested in.

                I got small fragments assembled when the transcript just was not that abundant which makes perfectly sense.

                What do you have in mind with your 'best of all times'-collection?

                Comment

                • mbayer
                  Member
                  • Mar 2009
                  • 31

                  #9
                  Originally posted by upendra_35 View Post
                  Does that mean i should have more relaxed FPKM cut-off? What do you think would be ideal by the way?
                  I would not set a cut-off at all. Like I said, I don't think you can assume that a low expression level means there is an artefact -- it could be real.

                  Micha

                  Comment

                  • kurban910
                    Member
                    • Jul 2014
                    • 58

                    #10
                    hi guys,
                    i assembled the raw reads and got trinity.fasta file, then i got the basic statistics by using TrinityStats.pl , but i also wanna get little bit more detialed statistical results like length distribution of transcripts in the fasta file with an image if its possible.
                    any suggesion for that?
                    thanks.

                    Comment

                    • GenoMax
                      Senior Member
                      • Feb 2008
                      • 7142

                      #11
                      Originally posted by kurban910 View Post
                      hi guys,
                      i assembled the raw reads and got trinity.fasta file, then i got the basic statistics by using TrinityStats.pl , but i also wanna get little bit more detialed statistical results like length distribution of transcripts in the fasta file with an image if its possible.
                      any suggesion for that?
                      thanks.
                      Try stats.sh from BBMap: http://seqanswers.com/forums/showthread.php?t=43529

                      Comment

                      • kurban910
                        Member
                        • Jul 2014
                        • 58

                        #12
                        thanks for the link @GenoMax

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          Yesterday, 10:05 AM
                        • SEQadmin2
                          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                          by SEQadmin2


                          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                          Introduction

                          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                          05-22-2026, 06:42 AM
                        • SEQadmin2
                          Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                          by SEQadmin2

                          Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                          Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                          05-06-2026, 09:04 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, Yesterday, 12:03 PM
                        0 responses
                        19 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, Yesterday, 11:40 AM
                        0 responses
                        14 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-28-2026, 11:40 AM
                        0 responses
                        29 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-26-2026, 10:12 AM
                        0 responses
                        31 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...