Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • galata44
    Junior Member
    • Mar 2013
    • 4

    Transcript redundancy in denovo assembly

    Hi There,

    I am analyzing de novo assembled transcriptome data from a plant and see there are various loci, each with one to multiple transcripts. When aligned against each other all the transcripts belonging to the same locus are >95% homologous. Why do these redundant transcripts show up? And how do I account for this transcript redundancy when I look for differential gene expression between tissues of this non-model plant at three developmental stages?

    Thank you, galata44
  • mbayer
    Member
    • Mar 2009
    • 31

    #2
    Hi galata44,

    what software have you used for assembling the transcripts? If you used a dedicated transcriptome assembler your clusters of similar transcripts probably represent alternative splice products.

    It is also worth bearing in mind that de novo assembly is a computationally difficult problem, and assemblers never get it 100% right. Did you quality trim your reads before assembly? That usually makes a vast difference in terms of assembly quality.

    cheers

    Micha

    Comment

    • galata44
      Junior Member
      • Mar 2013
      • 4

      #3
      Hi Micha,

      Thanks for your reply. The reads were assembled into contigs with Velvet and then from that input transcripts were assembled using Oasis transcript assembler. The reads were indeed quality trimmed. You think it is possible that these redundant transcripts are splice variants even though their nucleotide sequence homology is >95% ? Do you know other reasons for transcript redundancy within a designated locus?

      Thank you, galata44

      Comment

      • tuan_pham_6885
        Junior Member
        • Mar 2013
        • 1

        #4
        Hi,

        I have the same situation with galata. I got all the information from a company analyzing the data for us. They also used Velvet & Oases to assemble. They provide various loci, each with one to multiple transcripts. Then, they choose one representative transcript (I do not know how they choose, maybe the longest transcripts) for the differential gene expression analysis. I wonder why the UTR of transcripts is very long (sometimes >3000bp).

        Because this is a Korean company, so I do not understand well their methods. My question is that can I describe the loci as unigenes? and for the further analysis (Annotaion, gene oncology, KEGG), I will just use the representative transcript. It is OK? Because the all the transcripts belonging to the same locus are >95% homologous.

        Thank you very much

        Comment

        • pengchy
          Senior Member
          • Feb 2009
          • 116

          #5
          I think it is reasonable to select one representative transcripts for one locus. Alternatively, you can cluster the assembly by TGICL and then filter the redundancy by cd-hit. They used different algorithms.

          Best

          Comment

          • Jeremy
            Senior Member
            • Nov 2009
            • 190

            #6
            When I looked at differential expression from a de novo assembly I did two analyses
            1. All transcripts
            2. Genes (summing transcript reads from the same locus)
            Arbitrarily choosing one representative transcript may cause you to exclude important data.

            Comment

            • galata44
              Junior Member
              • Mar 2013
              • 4

              #7
              summing transcripts in locus for DE analysis

              Hi Jeremy,

              When you summed all the transcripts in one locus for DE analysis, did you exclude transcripts which did not have complete ORFs? Or did you only sum those transcripts within a locus that shared an identical ORF? Also, did you have a cutoff length excluding analysis of transcripts shorter than a particular length?

              Thank you,
              galata

              Comment

              • Jeremy
                Senior Member
                • Nov 2009
                • 190

                #8
                Firstly, a disclaimer, this is just what I did, I'm not claiming it is the best way to do it.

                Originally posted by galata44 View Post
                When you summed all the transcripts in one locus for DE analysis, did you exclude transcripts which did not have complete ORFs?
                No, because that part is to look at gene level expression. Part 1 looks at transcript level expression. The idea is to look at what is in the data without imposing your own biases by selecting parts of it.


                Originally posted by galata44 View Post
                Or did you only sum those transcripts within a locus that shared an identical ORF?
                No, for the same reason above.
                The thing to remember with de novo RNA assembly is that some of the transcripts will be real and some will be assembly artifacts, but the reads all came from the transcriptome so they are all important.

                Originally posted by galata44 View Post
                Also, did you have a cutoff length excluding analysis of transcripts shorter than a particular length?
                No, I only applied a cut off of a minimum read count to exclude transcripts that likely represent transcriptional noise or assembly errors. There is no rule for what the cut-off should be, base it on the type of data you have. My exclusion resulted in a reduction from about 350000 assembled sequences to about 90000 but still used about 99% of the mapped reads.

                Comment

                • galata44
                  Junior Member
                  • Mar 2013
                  • 4

                  #9
                  Thanks for your reply Jeremy,

                  In your opinion, if I am looking to analysis the DE of certain mRNA transcripts in the transcriptome then would it not make sense to exclude transcripts without an ORF as these are not being actively expressed? I have transcripts within a locus that has hits with a particular protein's nucleotide sequence however, not all of the transcripts in this locus have a full ORF. Am I correct in thinking these "ORF-less" transcripts are not being actively expressed? Also, do you know of any reasons these transcripts which have homology with protein coding nucleotide sequences do not have an ORF?

                  Thank you again, galata

                  Comment

                  • Cofactor Genomics
                    Registered Vendor
                    • Jan 2010
                    • 52

                    #10
                    We see similar things in many of our RNA-seq projects. Agreeing with Jeremy, on it is very important to employ a filter for coverage to separate noise from signal. We see expressed regions with homology to gene sequences with no ORF in our RNA-seq projects where the organisms genes have long UTRs. There is also a possibility of a stochastic transcript product in the area of a pseudo gene. Ultimately, if I was in your shoes I would not pay too much attention unless the locus was clearly showing DE. Hope my 2c contributes.

                    Jarret Glasscock
                    Cofactor Genomics

                    Comment

                    • Jeremy
                      Senior Member
                      • Nov 2009
                      • 190

                      #11
                      Originally posted by galata44 View Post
                      Thanks for your reply Jeremy,

                      In your opinion, if I am looking to analysis the DE of certain mRNA transcripts in the transcriptome then would it not make sense to exclude transcripts without an ORF as these are not being actively expressed? I have transcripts within a locus that has hits with a particular protein's nucleotide sequence however, not all of the transcripts in this locus have a full ORF. Am I correct in thinking these "ORF-less" transcripts are not being actively expressed? Also, do you know of any reasons these transcripts which have homology with protein coding nucleotide sequences do not have an ORF?

                      Thank you again, galata
                      Unless you have a nice polished genome how do you really know that the reads are from a transcript that doesn't have an open reading frame?
                      What I was trying to get at before is that the reads represent what was actually in the sample, the assembly is just an interpretation (and a very error prone one) and can include sequences that were never in the sample. It doesn't make any sense to me to throw out real data based on an error prone interpretation that says it probably isn't real...

                      That aside, recent advances have shown that an RNA doesn't need an ORF to be functionally relevant. The benefit of RNA seq is that you see everything that is there (depending on RNA isolation/purification methods).

                      Comment

                      • nako
                        Junior Member
                        • Apr 2013
                        • 5

                        #12
                        pengchy - will you be willing to expend your suggestion?

                        I have the same problem - I have de novo assemblies of 454, Sanger, and Illumina data, and I would like to detect splice isoforms, and collapse different splice isoforms into one representative transcript. The data I have is already assembled. Thank you, Nako

                        Comment

                        • pengchy
                          Senior Member
                          • Feb 2009
                          • 116

                          #13
                          Trinity has assembled all the transcripts while retain the isoforms to "gene" relationship in the results as the transcript ids denote. Based on the RSEM package, trinity also calculate the "gene"/isoforms expression value.



                          If you want to use trinity to assemble the sanger, 454 reads, you can fragment these data to single end or paired-end data and then feed to trinity.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Pathogen Surveillance with Advanced Genomic Tools
                            by seqadmin




                            The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                            03-24-2025, 11:48 AM
                          • seqadmin
                            New Genomics Tools and Methods Shared at AGBT 2025
                            by seqadmin


                            This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                            The Headliner
                            The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                            03-03-2025, 01:39 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 03-20-2025, 05:03 AM
                          0 responses
                          49 views
                          0 reactions
                          Last Post seqadmin  
                          Started by seqadmin, 03-19-2025, 07:27 AM
                          0 responses
                          57 views
                          0 reactions
                          Last Post seqadmin  
                          Started by seqadmin, 03-18-2025, 12:50 PM
                          0 responses
                          50 views
                          0 reactions
                          Last Post seqadmin  
                          Started by seqadmin, 03-03-2025, 01:15 PM
                          0 responses
                          201 views
                          0 reactions
                          Last Post seqadmin  
                          Working...