Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • anyone1985
    Member
    • Mar 2009
    • 68

    How to assemble two different length Solexa data?

    I have two Solexa data sets. The length of Solexa data is 35 and 75 individually. The insert length is also different. How should I assemble them?
  • Chien-Yuan Chen
    Member
    • Feb 2009
    • 19

    #2
    If you use CLC genome workbench, the software can manage this problem. But you should specify the insert length to prevent incorrect alignment.

    Comment

    • anyone1985
      Member
      • Mar 2009
      • 68

      #3
      Maybe there is some free or open source assembler which is suit for this task. I had tried the AllPaths, however, it came across fatal error at last. I would like to know if any other can do the same job!

      Originally posted by Chien-Yuan Chen View Post
      If you use CLC genome workbench, the software can manage this problem. But you should specify the insert length to prevent incorrect alignment.

      Comment

      • caddymob
        Member
        • Apr 2009
        • 36

        #4
        Have you tried Maq map merge?



        I am guessing you could make a map for the 35 and 75bp reads separately, then merge them. Or maybe try samtools merge? Align with BWA or other favorite aligner, then merge the sam/bam files?

        Comment

        • anyone1985
          Member
          • Mar 2009
          • 68

          #5
          I tried to assemble de novo. I think I would like to assemble them sperately with velvet or edena, then assemble the contigs with CAP3, Phrap?
          Originally posted by caddymob View Post
          Have you tried Maq map merge?



          I am guessing you could make a map for the 35 and 75bp reads separately, then merge them. Or maybe try samtools merge? Align with BWA or other favorite aligner, then merge the sam/bam files?

          http://samtools.sourceforge.net/samtools.shtml

          Comment

          • jkbonfield
            Senior Member
            • Jul 2008
            • 146

            #6
            In an ideal world you'd have an assembler that just understands short-read data, mixed libraries with varying insert sizes, etc and just gives you the optimal answer. Some of the tools make a fair stab at this (eg velvet), but the system resources required can be HUGE.

            Therefore a more pragmatic approach used by many is starting with some sort of basic "read extension" where you lose track of the individual fragments, but build up contig consensus sequences by identifying overlapping Kmers and no branch points - much like ssake fuzzypaths, etc.

            From here you can then either take these contigs as-is or throw them into another assembly tool more appropriate for longer sequences to attempt to resolve further.

            Finally, map your individual reads (both 75 and 35) back to your consensus sequences again to get a true assembly rather than just consensus sequences.

            You could even iterate - finding reads that overlap contig ends uniquely to edit and extending the "reference", and remapping those that failed to map previously. This technique works in more "usual" cases too where the reference doesn't precisely match the organism you're mapping against it. Not pretty though.

            Comment

            • BaCh
              Member
              • May 2008
              • 81

              #7
              Originally posted by anyone1985 View Post
              I have two Solexa data sets. The length of Solexa data is 35 and 75 individually. The insert length is also different. How should I assemble them?
              You could play guinea pig and try MIRA (2.9.45): in theory, it should work. You can give the assembler all the necessary ancillary information (like sequencing technology, insert size, quality clips etc.pp) on a per read basis using a XML file in TRACEINFO format as standardized by the NCBI.

              MIRA will know how to treat Solexa data and handle many things almost automatically (like clipping) and even know of sequencing technology dependent errors (like the "GGC" problem in Solexa data).

              However, I would try this only for organsism of bacterial size and on a machine with lots and lots of memory.

              And you might want to try assembling the 75mers first: if you have an average coverage of >= 30x with the 75mers and the insert sizes of the 75mer library is larger than the one for the 36mer library, the 36mers probably won't improve the assembly.

              PS: Disclaimer: I wrote MIRA and might not be objective

              Comment

              • jnfass
                Member
                • Aug 2008
                • 88

                #8
                I'd have to say that velvet is still your best bet for de novo assembly. It can accept different read lengths with no problem, and you can feed it 2 different sets of paired reads, with 2 different insert sizes, "out of the box". However, you can also make a trivial change to the source code and recompile so that it accepts more than 2 sets of insert lengths.

                Also note that when you tell velvet the insert length (" -ins_length 280 "), you need to use the entire length of the fragment, so in this case if you told it 280, that would correspond to two 40bp reads with a 200bp "insert".

                Consult the velvet-users list for details on these two issues.

                Comment

                • jnfass
                  Member
                  • Aug 2008
                  • 88

                  #9
                  oh, and note that I'm not countering BaCh's suggestion! I've been wanting to try MIRA for a while, and velvet won't incorporate 454 reads well, like MIRA can ...

                  Comment

                  • bioinfosm
                    Senior Member
                    • Jan 2008
                    • 483

                    #10
                    any de novo assembly tools that can iteratively assemble reads instead of eating up a whole lot of RAM?

                    my limitation is less than 60Gb RAM for a 1GB+ organism, to be de novo assembled by 20x solexa coverage worth reads
                    --
                    bioinfosm

                    Comment

                    • anyone1985
                      Member
                      • Mar 2009
                      • 68

                      #11
                      Thank you for jnfass's suggestion. After I read the maual of Velet, I also find that it can handle different kinds insert length.

                      Comment

                      • BaCh
                        Member
                        • May 2008
                        • 81

                        #12
                        Originally posted by bioinfosm View Post
                        any de novo assembly tools that can iteratively assemble reads instead of eating up a whole lot of RAM?

                        my limitation is less than 60Gb RAM for a 1GB+ organism, to be de novo assembled by 20x solexa coverage worth reads
                        Uh ... I missed that post. No, no program I know of.

                        But just to be sure I understood you right: you have ~550 million 36mers that you want to assemble de-novo? That's (in terms of reads) almost 15-20 times more reads than the Human Genome Project or Celera had ... and they had *large* computing clusters to tackle the problems.

                        Even memory optimised programs with very simple assembly logic would need to keep lots of data in memory to be even decently efficient ... and you would still be in for *a lot* of disk reads/writes which would probably mean it'd literally take ages to get the thing assembled.

                        Correct me if I'm wrong or if you found some program which performs such a wonder ... but I don't think this is possible with 60Gb RAM.

                        Regards,
                        B.

                        Comment

                        • jkbonfield
                          Senior Member
                          • Jul 2008
                          • 146

                          #13
                          Well, parallel algorithms like ABySS could possibly work if you have enough machines in a cluster. It's far cheaper and easier to get lots of small machines than a few truely humungous ones. However I've no idea what the upper limit is on an abyss assembly.

                          However the iterative approach sounds more sensible. I'm not sure of any official programs that do a decent job of this yet, although lots have manually done similar things by successive rounds of mapping to close genomes, shredding of close genomic data, etc.

                          James

                          Comment

                          • cloughlab
                            Junior Member
                            • Aug 2011
                            • 1

                            #14
                            I am new to this as well and I am trying to set up an RNASeq pipeline for my lab. I've run into an issue though. I'm confused on why one would run Velvet and the on the resultant contigs run Phrap. Why not just head to phrap straight away? Any help would be appreciated.

                            Cheers,
                            Addison

                            Comment

                            • westerman
                              Rick Westerman
                              • Jun 2008
                              • 1104

                              #15
                              Originally posted by cloughlab View Post
                              I am new to this as well and I am trying to set up an RNASeq pipeline for my lab. I've run into an issue though. I'm confused on why one would run Velvet and the on the resultant contigs run Phrap. Why not just head to phrap straight away? Any help would be appreciated.

                              Cheers,
                              Addison
                              Phrap is slow and not optimized for the large NGS datasets.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM
                              • seqadmin
                                Investigating the Gut Microbiome Through Diet and Spatial Biology
                                by seqadmin




                                The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                                02-24-2025, 06:31 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-20-2025, 05:03 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-19-2025, 07:27 AM
                              0 responses
                              18 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-03-2025, 01:15 PM
                              0 responses
                              185 views
                              0 reactions
                              Last Post seqadmin  
                              Working...