Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • gfmgfm
    Member
    • Jun 2010
    • 64

    de novo transcriptome and diffrential expression

    Hello,

    We have Illumina de novo transcriptome data of 3 different samples. We united the 3 samples and created from them contigs using different methods and united them with CAP3.
    Now we want to check for differential expression in the 3 different samples using the contigs we defined. The problem is that there is redundancy in the contigs (due to either incomplete assembly or to real different transcripts from the same locus).
    So it is a problem to map the reads uniquely to our contigs.
    Any suggestions how to check for differential expression?
  • petang
    Member
    • Nov 2008
    • 13

    #2
    Originally posted by gfmgfm View Post
    Hello,

    We have Illumina de novo transcriptome data of 3 different samples. We united the 3 samples and created from them contigs using different methods and united them with CAP3.
    Now we want to check for differential expression in the 3 different samples using the contigs we defined. The problem is that there is redundancy in the contigs (due to either incomplete assembly or to real different transcripts from the same locus).
    So it is a problem to map the reads uniquely to our contigs.
    Any suggestions how to check for differential expression?
    You can merge all 3 datasets and assemble it together. Then use the assembled contigs as reference, re-map the reads from each dataset to the reference.

    Comment

    • gfmgfm
      Member
      • Jun 2010
      • 64

      #3
      Thanks a lot for the reply!
      This is what we did. But now,not sure how to map to the contigs as a reference. If we consider only unique tags, we get a very low percentage of uniqely aligned reads (probably because of some redundancy in the contigs and maybe because of real different transcripts of the same locus).
      Any suggestions?

      Comment

      • schmima
        Member
        • Apr 2010
        • 56

        #4
        hm - depends a bit on what you want to do. You could either try to distribute multireads proportionally to the unique reads (what is a problem if the majority are multireads) or create a "non-redundant" reference (where you will sacrifice eventually truely different transcripts from a gene). For the latter you would have to group your transcripts together based on similarity and assemble them - the TGI clustering tool may help you to do this: http://compbio.dfci.harvard.edu/tgi/software/ .

        Comment

        • gfmgfm
          Member
          • Jun 2010
          • 64

          #5
          Thanks a lot! the TGI clustering tool looks very interesting. I am trying to run it.

          Comment

          • LizBent
            Member
            • Jan 2012
            • 31

            #6
            I am also going to be mapping short reads to assembled contigs from multiple samples- and my strategy is to assemble the contigs together in Trinity, then map the reads to the contigs. I would assume that a clustering step would improve the quality of the data.

            One question: I have tissue from two different organisms in some samples, so I have two transcriptomes. Would clustering take transcripts from different organisms for the same genes and cluster those?

            Comment

            • oxydeepu
              Member
              • Jul 2011
              • 41

              #7
              Denovo Transcriptome Assembly.

              Hi all,

              I have paired end RNA-Seq tophat run. so now i have to run cufflinks on them. I dont have a refernce GTF file, but i have the genome and transcriptome file for the same. Can anyone pls tell me how to create a reference transcript annotation file from genome and transcriptome file..??

              Thanking you in advance
              Regards
              Deepak.

              Comment

              • LizBent
                Member
                • Jan 2012
                • 31

                #8
                Deepak, I suggest you post your question in a thread that is relevant- if you have a reference genome you are not doing de novo transcriptome assembly, and you are also not looking at differential gene expression unless you have multiple samples.

                Comment

                • gfmgfm
                  Member
                  • Jun 2010
                  • 64

                  #9
                  Hi LizBent,

                  I guess this depends on the overlap between the 2 genomes you are analyzing If there are very similar genes, I guess they might cluster together.

                  Comment

                  • Wallysb01
                    Senior Member
                    • Feb 2011
                    • 286

                    #10
                    Have you thought about just using the average kmer coverage from your original, pre-CAP3, assemblies? Even with the cap3 assemblies you could use the log files to determine the sequences that got merged, their lengths, their average kmer coverage, then a weighted average of the kmer coverage of the CAP3-merged transcript.

                    Then, you could go back through these averages and flag ones that have relatively large variances in the kmer coverage of the merged transcripts. That could be a clue into either isoforms being merged or spurious merging.

                    I thought about using CAP3 with our transcriptome assemblies for things without a reference, but I just didn't trust it. What program are you using to assembly this, btw? I've noticed that while Trinity is very selective and maybe "under-assembles" somethings, its not very redundant, especially compared to the strategy taken by ABySS/trans-abyss.

                    You'll still hit similar downstream problems with estimating abundance, but it might be a little easier if you get rid of the redundancy earlier in the assembly process.

                    Comment

                    • LizBent
                      Member
                      • Jan 2012
                      • 31

                      #11
                      Originally posted by Wallysb01 View Post
                      I thought about using CAP3 with our transcriptome assemblies for things without a reference, but I just didn't trust it. What program are you using to assembly this, btw? I've noticed that while Trinity is very selective and maybe "under-assembles" somethings, its not very redundant, especially compared to the strategy taken by ABySS/trans-abyss.

                      You'll still hit similar downstream problems with estimating abundance, but it might be a little easier if you get rid of the redundancy earlier in the assembly process.
                      Hi- so far I've been testing Trinity for my assemblies, though I was also thinking of using the Rnnotator pipeline (JGI Galaxy server), which uses Velvet. I'm not sure I understand what you mean by "redundant" - I'm new to all this, so would you mind explaining?

                      Comment

                      • Wallysb01
                        Senior Member
                        • Feb 2011
                        • 286

                        #12
                        Originally posted by LizBent View Post
                        Hi- so far I've been testing Trinity for my assemblies, though I was also thinking of using the Rnnotator pipeline (JGI Galaxy server), which uses Velvet. I'm not sure I understand what you mean by "redundant" - I'm new to all this, so would you mind explaining?
                        Liz,

                        Differential coverage along your transcript and alternate splicing (plus the usual snps/indels) can lead to assemblers making several contigs out of the same gene. Sometimes they are alternate splice forms and sometimes its just an assembly artifact. Usually assemblers have some sort of merging step to try and reduce this, but again because of alternate splicing, you don't want to do this as aggressively as you can with genomic DNA.

                        From my experience Trinity does a pretty good job of giving you as complete of transcripts as possible with minimal redundancy. However, that comes at the cost of completeness. ABySS/trans-abyss does a very good job of just giving you everything, but its kinda messy. I haven't used Velvet based programs, so I can't speak to them.

                        If you don't have a reference genome, you're not done after assembly. I think you have to accept some attrition by doing things like extracting ORF and only keeping long ones (or even "complete" ones). You can also filter the contigs to only keep things that are <XX% similar and keeping only the longest contig of a the group using a tool like CD-HIT. Plus, doing a blast to take things that match up well with a closely related species. You could even filter your results to only take the best hit for each "reference" transcript, what ever you determine your reference to be.

                        It all depends on what you want the output to look like. Would you rather have fewer, more complete, non-redundant contigs at the cost of losing alternate splicing, and incomplete transcripts. Or do you want as much as possible, knowing you'll deal with redundancy.

                        Comment

                        • RNAddict
                          Member
                          • Mar 2012
                          • 17

                          #13
                          Originally posted by gfmgfm View Post
                          Hello,

                          We have Illumina de novo transcriptome data of 3 different samples. We united the 3 samples and created from them contigs using different methods and united them with CAP3.
                          Now we want to check for differential expression in the 3 different samples using the contigs we defined. The problem is that there is redundancy in the contigs (due to either incomplete assembly or to real different transcripts from the same locus).
                          So it is a problem to map the reads uniquely to our contigs.
                          Any suggestions how to check for differential expression?
                          We are having a similar experience. We de novo assembled a transcriptome we are using as a "reference" but when we map reads to that we get so many multi-mapped reads that many transcripts that we know are there (RT-PCR, Northerns, In situs) do not even show up as present in our in silco analysis.

                          We have tried various methods of reducing redundancy in our reference such as taking only the longest sequence from each cluster, using various contig assembly programs (CAP3 etc.)... these help... but they do not seem to solve the problem completely.

                          Since it has been sometime since your original post I was wondering what your experience has been with this issue.

                          How far did you take your elimination of redundant transcripts?

                          Comment

                          Latest Articles

                          Collapse

                          • SEQadmin2
                            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                            by SEQadmin2


                            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                            Here are nine questions we think about, in roughly the order they matter, before...
                            06-18-2026, 07:11 AM
                          • SEQadmin2
                            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                            by SEQadmin2


                            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                            ...
                            06-02-2026, 10:05 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, 06-26-2026, 11:10 AM
                          0 responses
                          15 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-17-2026, 06:09 AM
                          0 responses
                          49 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-09-2026, 11:58 AM
                          0 responses
                          107 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-05-2026, 10:09 AM
                          0 responses
                          125 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...