Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Best approach- genome de novo assembly with reference

    I have some new Illumina data (HiSeq 100b reads- one paired-end (94xe6 reads) and one mate-pair (54xe6 reads) lib.) for a fungal genome (ca. 30MB) for which a pretty good reference is already assembled/available.

    My coverage is about 400X, and I have de novo assembled the new data with both Velvet (VelvetOptimiser) and Soapdenovo, but based on simple metrics, e.g. # scaffolds, largest scaffold, N50, this new assembly doesn't appear to be quite as good as the reference.

    I don't have access to the read data used to assemble the original reference, and I would like see if I can improve it with this additional data. It looks like you can give Velvet a -long switch for a reference seq, but the documentation isn't very clear on this. And, I'm not sure how to go about generating a "new" reference sequence/scaffolds after, for example, using an aligner, e.g. Bowtie or BWA, to align the new read data to the reference seq.

    Can someone suggest/describe the best approach or a pipeline to get where I want to go with this dataset?

  • #2
    Some questions:

    Q. What was the mate-pair insert/fragment size?
    Q. Is the reference still a draft (contigs/scaffolds) or is it closed?
    Q. What does "not quite as good" mean?
    Q. Are you de novo assembling your genome, or trying to improve the reference?

    Some comments:

    * the Velvet -long option might help, but look at the Columbus module instead, which is the reference-guided part of Velvet
    * the two genome assemblies are likely broken at the same places (mostly large/tricky repeats) so fixing one with the other won't gain much unless there were other coverage/bias problems
    * there are tools like GapFiller, IMAGE etc for improving assemblies via read alignment back
    * you may wish to use SSPACE etc instead of Velvet for the mate pair scaffolding
    * Illumina Mate Pair data can be very dodgy sometimes: http://thegenomefactory.blogspot.com...sequences.html
    * Have you tried SoapDenovo2 (just released)
    Last edited by Torst; 01-07-2013, 08:42 PM. Reason: typo

    Comment


    • #3
      Originally posted by Torst View Post
      Some questions:

      Q. What was the mate-pair insert/fragment size?
      Mean insert = 4400 b

      Q. Is the reference still a draft (contigs/scaffolds) or is it closed?
      Still a draft- 1386 scaffolds

      Q. What does "not quite as good" mean?
      Based on common assembly metrics, some are better some are worse. For example, scaffold metrics:

      ####scaffs, total_size, longest, mean, scaf>0.1M, scaf>1M, N50

      new1 1101, 51.8 Mb, 2.08 Mb, 47 kb, 101(9.2%), 16(1.5%), 0.6 Mb

      new2 2172, 5.22 Mb, 2.08 Mb, 21 kb, 101(4.0%), 13 (0.5%), 0.6 Mb

      oldref 1386, 50.9 Mb, 5.05 Mb, 36 kb, 64 (4.6%), 14(1%), 1.2 Mb

      The "new1" assembly is the PE data run with VelvetOptimiser (optimal kmer size = 85). The new2 assembly is PE + MP data (optimal kmer size = 87)
      I have also run all draft sets through the cegma pipeline as a benchmarking exercise, and they all give ~98-99% completeness for the KOG set of conserved protein seqs.

      Upon inspection, it appears that, except for the longest scaf and N50, the new1 assembly is pretty good (relative to the oldref assembly). Adding the MP data clearly splits and increases scaffolds, and slightly reduces scafs > 1 Mb. So, I'm unsure if this a good or a bad result at this point, i.e. are they breaking up because the PE contigs were poorly scaffolded, or because the MP data is problematic.

      I would also add that I have tried to run the data using these same kmer sizes with SOAPdenovo (see below).

      Q. Are you de novo assembling your genome, or trying to improve the reference?
      Actually, both... or either! Ultimately, whichever is determined to be the "best" assembly will be used downstream to map several expt'l RNA-Seq PE libs. via Tophat/Cufflinks for diff. exp analysis. So, while I'd like my assembly to come out better , I have no problem with trying to improve what's already available with my data. But, I'm beginning to think that there is a fair amount of nasty repeats, or perhaps my MP lib. is problematic

      Some comments:

      * the Velvet -long option might help, but look at the Columbus module instead, which is the reference-guided part of Velvet
      Thanks! The Velvet manual is a bit thin w.r.t. performing reference-based assembly.

      * the two genome assemblies are likely broken at the same places (mostly large/tricky repeats) so fixing one with the other won't gain much unless there were other coverage/bias problems.
      Probably so, but a better comparison/analysis is needed to know for sure.

      * there are tools like GapFiller, IMAGE etc for improving assemblies via read alignment back
      I have run Gapfiller against all three of the above assemblies, using both my PE and MP reads, and, other than resolving/removing some N's, no gaps were closed....

      * you may wish to use SSPACE etc instead of Velvet for the mate pair scaffolding
      Haven't used that before, I will take a look.

      * Illumina Mate Pair data can be very dodgy sometimes: http://thegenomefactory.blogspot.com...sequences.html
      Yes, you and I have chatted about this recently on that blog (me as WallyCuda).

      Have you tried SoapDenovo2 (just released)
      I have not, but I am installing it now and will attempt very soon, thanks!

      I have run Soapdenovo with the same kmer size as in Velvet; however, if I include the MP lib. (using the exact same rev. comp. read files that Velvet is quite happy with) Soap throws a segmentation fault/core dump error, which perhaps is another reason to suspect that something isn't quite cricket with the MP lib? The Soap PE assembly is not nearly as good w.r.t. above metrics as either the PE or PE+MP Velvet assemblies. I am running Gapfiller to see if I can improve it, but I'm not optimistic...

      Thanks for your recommendations, Toresten. Now that you have a few more details, maybe you have some thoughts regarding the effect the MP lib. has when included in the assembly.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Exploring the Dynamics of the Tumor Microenvironment
        by seqadmin




        The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
        07-08-2024, 03:19 PM
      • seqadmin
        Exploring Human Diversity Through Large-Scale Omics
        by seqadmin


        In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
        06-25-2024, 06:43 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:53 AM
      0 responses
      12 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 07-10-2024, 07:30 AM
      0 responses
      34 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 07-03-2024, 09:45 AM
      0 responses
      204 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 07-03-2024, 08:54 AM
      0 responses
      213 views
      0 likes
      Last Post seqadmin  
      Working...
      X