Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pre-assembly for short-reads to minimize RAM usage

    Hello everybody!

    I'm looking forward to assembling de novo ~1-5 Gb of short reads from next-generation sequencer. Data is of metagenomic character, hundreds of species. The amount of RAM required by assembly program (Velvet, SOAPdenovo, etc.) for such analysis is few hundred Gb. Is there a known way to cluster the initial reads into associated related portions , so that assembly is performed in portions and RAM peak usage is decreased?

    Thanks ahead,
    Alex

  • #2
    yes this is the exact same question I am having in my mind.

    I have around 400 million of 36bp paired-end reads. I am in the process of trying to assemble them with velvet but I was wondering if the input is too huge and a preclustering step is needed.

    If yes then what type of clustering approach?

    thanks

    Comment


    • #3
      Hi.
      I think that the clustering must be made before sequencing (by selecting specific regions of the genome using enzymes for example) and then assemble each small data set indepentetly.

      The only way to reduce the amount of memory needed is perform an error correction step. The problem is that the error correction step may require more RAM than the de novo assembly.

      Francesco

      Comment


      • #4
        I came upon the following discussion:

        The idea is to pre-cluster kmers into non-overlapping de Brujin subgraphs and assemble them separately (using lower memory requirements), then combine the results.

        Comment


        • #5
          Thanks Alex8, that is quite an interesting discussion, seems like Curtain is worth looking at.

          Dear Francesco, I came across your post in this discussion thread.

          "...The trick usually is to work with a subset of 10% of the reads. Make multiple assemblyes of several random subsets and then merge toghether the results."

          Can you please explain more on "random subsets"? Say if we assemble 10% of our reads at a time, am I correct that we will end up with 10 separate sub-assembly results for assembly/scaffolding? Or the subsets are suppose to be random, where the same read can exist in more than one subset?

          thanks!

          Comment


          • #6
            The post is quite old and this approach was usufull do to the lack of software able to assembly mere than one lane.

            The idea was to PARTITION (here is your point) in 10 or less independent subsets the data and assembly each of these subset independently. This was but still is meaningful when the coverage is very high. If a Microbe is sequenced at an expected coverage of 800X then this approach is usufull.

            Francesco

            Originally posted by leeht View Post
            Thanks Alex8, that is quite an interesting discussion, seems like Curtain is worth looking at.

            Dear Francesco, I came across your post in this discussion thread.

            "...The trick usually is to work with a subset of 10% of the reads. Make multiple assemblyes of several random subsets and then merge toghether the results."

            Can you please explain more on "random subsets"? Say if we assemble 10% of our reads at a time, am I correct that we will end up with 10 separate sub-assembly results for assembly/scaffolding? Or the subsets are suppose to be random, where the same read can exist in more than one subset?

            thanks!

            Comment


            • #7
              Originally posted by Alex8 View Post
              Hello everybody!

              I'm looking forward to assembling de novo ~1-5 Gb of short reads from next-generation sequencer. Data is of metagenomic character, hundreds of species. The amount of RAM required by assembly program (Velvet, SOAPdenovo, etc.) for such analysis is few hundred Gb. Is there a known way to cluster the initial reads into associated related portions , so that assembly is performed in portions and RAM peak usage is decreased?

              Thanks ahead,
              Alex
              I think if yours is a metagenomic sample your ram requirement is likely to be large and I am guessing there will be low coverage per species / contig.

              if you can already cluster the reads by kmers then you can do mini assemblies using any programs.

              Have a look at Softgenetic's NextGene to do the clustering. It looks like something useful but I can't comment much as I have limited experience with it.
              http://kevin-gattaca.blogspot.com/

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Advances in Sequencing Technologies
                by seqadmin







                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                Long-Read Sequencing
                Long-read sequencing has...
                12-02-2024, 01:49 PM
              • seqadmin
                Genetic Variation in Immunogenetics and Antibody Diversity
                by seqadmin



                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                11-06-2024, 07:24 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 12-02-2024, 09:29 AM
              0 responses
              139 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-02-2024, 09:06 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-02-2024, 08:03 AM
              0 responses
              38 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 11-22-2024, 07:36 AM
              0 responses
              69 views
              0 likes
              Last Post seqadmin  
              Working...
              X