Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • glacerda
    Member
    • Aug 2008
    • 27

    Zorro: The Masked Assembler (first public release)

    ZORRO is an hybrid sequencing technology assembler. It takes 2 sets of pre-assembled contigs and merge them into a more contiguous and consistent assembly. We have already tested Zorro with Illumina Solexa and 454 from some of organisms varying from 3Mb to 100Mb. The main caracteristic of Zorro is the treatment before and after assembly to avoid errors.

    The ZORRO project is maintained by Gustavo Lacerda, Ramon Vidal and Marcelo Carazzole and were first used in this Yeast assembly: Genome structure of a Saccharomyces cerevisiae strain widely used in bioethanol production

    ZORRO needs to be better documented and has not undergone enough testing. If you want to discuss the pipeline you can join the mailing list: zorro-google group

    ZORRO PIPELINE
    Zorro is based on the minimus2 pipeline (AMOS package) and uses MuMMer,
    AMOS and bowtie in its internals. Zorro takes 2 contigs fasta files as
    input (representing assembled contigs from a whole genome assembly)
    and one fasta file containing some of the reads used for assembly
    (only 10X coverage is enough, more will slow down the pipeline and
    consume more resources).

    Zorro initial phase detect inconsistencies in the assemblies and split
    the contigs where they occur. Next, zorro counts k-mers (default k=22)
    in the reads and use the k-mer count table to detect and mask repeats
    in both assembly1 and assembly2. After repeat masking, zorro uses nucmer
    to detect overlaps between assembly1 and assembly2 (no overlaps between
    contigs from the same assembly are allowed). All overlaps found in this
    phase are expected to be between unique regions (because repeats are
    masked). The overlaps are used to layout and generate consensus for the
    merged contigs, using AMOS tools. The merged contigs are built using the
    unmasked contigs, so the final merged assembly should include the repeat
    regions.

    Another round of assembly, less stringent, tries to merge contigs that
    were not included in the first Zorro phase. All the contigs are outputted
    to <prefix>.ZORRO.fasta. We recommend the use of SSPACE to scaffold the
    ZORRO contigs.


    Zorro Website: www.lge.ibi.unicamp.br/zorro
  • RLB_84
    Junior Member
    • Nov 2010
    • 4

    #2
    hi, I read about your pipeline and it seems quite interesting, as I'm a newbie user of MUMmer. So I hope I could try your assembler as soon as possible. By the way, just one question about "masking": when I consider a contig dataset returned by Newbler, I should keep in mind that many of them can be repeated and I can't figure it from the contig fasta file (I need the other Newbler outputs). So, how does your assembler deal with the probable occurrence of a contig in more than one genome locus?

    Comment

    • glacerda
      Member
      • Aug 2008
      • 27

      #3
      Hi RLB_84, that's a very good question.

      In addition to the contig fasta files, Zorro takes as input the reads file (a subsample of WGS reads). The reads are used only to allow us to identify repeats in the contigs. I will explain technically

      1-Zorro counts the occurences of 22-mers in the reads file supplied by the user
      2-22-mer words that are unique in the genome should occur proportionally to the genome coverage
      3-22-mer words that represent repeats should occur at least twice the peak coverage
      4-we select the 22-mer words that occur at least twice the mode of the distribution. These 22-mer words are used to mask the contig files using bowtie.

      This technique is used by many ab initio repeat detection software. We do not need to screen repeat libraries and, even if newbler (or other software) has collapsed the repeats, we coulod still detect them.

      Comment

      • RLB_84
        Junior Member
        • Nov 2010
        • 4

        #4
        Thanks for the clarification, that sounds good! So, considering I'm trying to merge two sets of contigs returned by Newbler and AbySS, this approach can make feasible to compare directly the coverage of the datasets, as Newbler comes out with a bunch of files (very useful about coverage and so on), while AbySS output needs more processing.

        Does the "less stringent" assembly phase take into account the repeat prediction and try to infer this information in the assembly itself?

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          Yesterday, 10:05 AM
        • SEQadmin2
          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
          by SEQadmin2


          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


          Introduction

          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
          05-22-2026, 06:42 AM
        • SEQadmin2
          Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
          by SEQadmin2

          Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


          Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
          05-06-2026, 09:04 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, Yesterday, 12:03 PM
        0 responses
        19 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, Yesterday, 11:40 AM
        0 responses
        14 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 05-28-2026, 11:40 AM
        0 responses
        29 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 05-26-2026, 10:12 AM
        0 responses
        31 views
        0 reactions
        Last Post SEQadmin2  
        Working...