Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • In Silico read normalization prior to de novo assembly

    I am about to use Trinity for de novo transcriptome assembly prior to differential expression analyses.

    I have 11 individuals (5 control, 6 treated) with 3 tissue types = 33 samples with ~20 million ~80bp single-end reads each (after trimming and QC).... so that's about 660 million single end reads!

    In order to reduce what is likely to be a LONG trinity run, would you suggest utilizing Trinity's normalization script or similar (e.g. khmer) prior to assembly?
    Or should I just take a small subset of samples to make assembly?

    I don't know how much individual genetic variability there is so I'm worried that using a subset for assembly will miss rarer transcripts.

    Does anyone here have any experience with normalization? Are there any downsides to this method over using a subset of samples?

    Any advice or experiences much appreciated!

  • #2
    It seems Trinity's In silico Read Normalization hasn't been publised.


    • #3
      the following links may be helpful:

      DigiNorm on Paired-end samples
      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

      What is digital normalization, anyway?
      I'm out at a Cloud Computing for the Human Microbiome Workshop and I've been trying to convince people of the importance of digital normalization....

      Digital normalization of short-read shotgun data
      We just posted a pre-submission paper to A single pass approach to reducing sampling variation, removing errors, and scaling de novo...

      Basic Digital Normalization

      A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data
      Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified single-cell genomes, and metagenomes has enabled investigation of a wide range of organisms and ecosystems. However, sampling variation in short-read data sets and high sequencing error rates of modern sequencers present many new computational challenges in data interpretation. These challenges have led to the development of new classes of mapping tools and {\em de novo} assemblers. These algorithms are challenged by the continued improvement in sequencing throughput. We here describe digital normalization, a single-pass computational algorithm that systematizes coverage in shotgun sequencing data sets, thereby decreasing sampling variation, discarding redundant data, and removing the majority of errors. Digital normalization substantially reduces the size of shotgun data sets and decreases the memory and time requirements for {\em de novo} sequence assembly, all without significantly impacting content of the generated contigs. We apply digital normalization to the assembly of microbial genomic data, amplified single-cell genomic data, and transcriptomic data. Our implementation is freely available for use and modification.

      What does Trinity's In Silico normalization do?
      This post can be referenced and cited at the following DOI: For a few months, the Trinity list was...
      Last edited by pengchy; 10-16-2013, 12:06 AM.


      • #4
        Thanks pengchy!

        I have read all already and was wondering if anyone had any experiences with their own data?

        It has also been suggested to me to take all the reads from 1 individual (all tissue types) and assemble as there may be too much ambiguity with using multiple individuals.

        The samples are clutch mates (frogs), but not inbred lines so there will be some variability between them and heterozygosity. But the question is with that option - which individual? Control or treated?

        Any thoughts from you knowledgeable lot on seqanswers much appreciated!!


        • #5
          Hi Amy,

          I am preparing to do the work and glad to exchange the experience with you here when i finish the test.



          • #6
            Great thanks!

            I am running Trinity's method at the moment (would have liked to use Titus's more efficient version of Trinity's method but waiting for that to be installed) - it's been running for 2 days now.

            I gave it 100GB RAM and 10 CPUs - which seems to have been OK for jellyfish, reading kmer occurences. Its been writing the .stats file now for a looooong time but it's not maxing out the memory and only using 1 cpu.


            • #7
              I find that Trinity's normalization takes a long time to run. Almost defeats the purpose of normalization in the first place. Days of run time -- yeap. We need to fix this some day.


              • #8
                If anyone is interested:

                Trinity normalization on ~854 million reads took about 2 days on a high-memory machine (gave it 300GB memory and 40 cores)

                Got it down from 854 to just 66 million reads!


                • #9
                  That's great that you got the number of reads reduced, but how is that reduction expected to improve performance on Trinity? Will it cut the time down considerably (enough to justify normalization)?

                  Best regards and great to find others working on similar projects!


                  • #10
                    Well that's the part I'd like other's experiences! This is my first de novo assembly.

                    To assemble the normalized reads (using same number of cores, memory etc) took less than a day. I'm running the full 854 million now to see how the assemblies compare - it's been going 2 days already.

                    It was also suggested I try assembling using all tissues from 1 individual (but to be careful for further analysis as this individual will map better to the assembly) as variability between individuals could create ambiguity in assembly.

                    I tried this: all normalized N50 = 1596, single individual N50 = 2029. Bowtie mapping back to assembly normalized = 80.65%, individual = 79.02%. I'm currently blasting to see which annotates better.

                    Does anyone have any other thoughts on how to test which is the "best" assembly?


                    • #11
                      I'm sorry I don't have any answers for you. I've got ~270 million reads so I'm not doing the normalization step for this run, but I will continue to watch this thread to see how your experiment comes out in the end. I'll be posting a question about installation of trinity with regard to jellyfish...feel free to take a peak, maybe its something you encountered?


                      • #12
                        In the end the assembly of the full set of reads took only about 3 days - so 2 days normalizing and 1 day assembly amounts to little or no saving on time.

                        The full read assembly only gave rise to marginally more contigs (~455000 vs ~447000 from normalized reads) and a lower N50 (1227 vs 1596).

                        I think Titus Brown's version of Trinity's method (which unfortunately I could not get installed on our machines here yet) probably does make normalizing worth it for my kind of sized read set.


                        • #13
                          fyi for those with access to more computing power - I used 24 cores and 119G on my 270 million reads without normalization and finished in 1 day.
                          But it may have also gone a bit faster b/c I ran it with the --min_kmer_cov 2 parameter


                          • #14
                            map back to the assembly

                            Hello everybody, interesting discussion.
                            Here we used Trinity on 10 samples, 5 tissues from 2 animals, sick and not sick. Total reads were >600M and on a 'big machine', sorry not sure of RAM and cores, it took <3 days.

                            The problem is that mapping the reads back to the contigs as suggested will map only 30% back!! Any clue? Does anyone else have this problem? Is this an issue or it is normal due to tissue diversity?

                            Thanks for your help!


                            • #15
                              Hi eppi,

                              I'm afraid I don't have that problem but out of interest, how many transcripts and components did you get?

                              I have produced ~447,000 transcripts (~350,000 component) - this seems far too many. I'm worried its from pooling tissues and individuals together for assembly? Anyone else got such a large number?? Any suggestions on how to reduce redundancy?


                              Latest Articles


                              • seqadmin
                                Latest Developments in Precision Medicine
                                by seqadmin

                                Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                                Somatic Genomics
                                “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                                05-24-2024, 01:16 PM
                              • seqadmin
                                Recent Advances in Sequencing Analysis Tools
                                by seqadmin

                                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                                05-06-2024, 07:48 AM





                              Topics Statistics Last Post
                              Started by seqadmin, 05-24-2024, 07:15 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 05-23-2024, 10:28 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 05-23-2024, 07:35 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 05-22-2024, 02:06 PM
                              0 responses
                              Last Post seqadmin