Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • horvathdp
    Member
    • Dec 2011
    • 66

    Novice looking to use PacBio data

    Hey all,

    I am a complete novice with only the barest understanding of working command line interfaces (currently using resources at iPlant). I have a bunch of PacBio sequences and about 30X coverage with illumina (100 base pair end). I'd like to be able to correct the PacBio sequences with my illumina reads. Anyone care to give me a step by step? Alternatively, I'd happily give authorship rights to anyone who wants to help me with my correction when I publish this work. I am dealing with a non-model invasive weed species (leafy spurge- Euphorbia esula) - mostlytrying to assemble gene space with promoters to help leverage a bunch of transcriptomics (microarray) data we have generated over the last 5 years.
  • samanta
    Senior Member
    • Feb 2010
    • 108

    #2
    Hello,

    Seems like an interesting problem. Here is what you need to do.

    (i) Please draw a k-mer distribution of the Illumina reads. I think your Illumina coverage (30) is slightly on the lower side, but we do not know until we see the chart. You can draw k-mer distribution by using SOAPdenovo, DSK (http://minia.genouest.org/dsk/) and many other k-mer counting packages.

    For new readers, easiest way to follow us is through our twitter feed. The feed is updated, whenever we post a commentary here.



    (ii) Use any de Bruijn graph-based assembler to assemble the Illumina reads first up to contig level. My favorites are SOAPdenovo (because it can handle PE) and Minia (http://minia.genouest.org/) for being light-weight. Ideally you need to do the assembly at multiple k-mer values.


    (iii) Once you have the the Illumina reads assembled, use BLASR (a tool distributed by PacBIO) to map the Illumina contigs on to large PacBio reads.

    Only after we have results of this step, we can talk about error correction of PacBio.

    Also check the following commentary and discussions in the comment sections.

    When we started working on PacBio data one year back, everyone recommended PacBioToCA. Pause for a moment to imagine how summer of 2012 was. Everyone was talking about Illumina, 454, de Bruijn graph, Velvet assembler and so on, and these ‘weird’ reads show up from nowhere. Using an analogy, everyone is talking about pizza and BioMickWatson shows five other foods that are like genome assembly, namely Eton mess, spaghetti Bolognese, Marmite, ‘macaroni’ cheese and anchovite. The initial impulse is to turn all those into toppings for pizza to make them attractive.



    If all those are too complicated, please email me at samanta at homolog.us, and we can discuss further.
    Last edited by samanta; 07-30-2013, 08:40 AM. Reason: error in text
    http://homolog.us

    Comment

    • horvathdp
      Member
      • Dec 2011
      • 66

      #3
      Thanks for the reply! So, the readme file is really sparce, and I could not find a link to a manual (even in the associated paper published in BMC). Any chance you have a link to the manual? Also, let me run my process by you just to see if I am on the right track:

      open an instance in iPlant atmosphere (ubuntu or linuxbiocloud-32bit?).
      run the script:

      wget -L "http://minia.genouest.org/dsk/dsk-1.5280.tar.gz"
      tar -xzf dsk-1.5280.tar.gz
      cd ./dsk
      make

      From here I am pretty lost. Without a manual, I am not even sure what format my input files need to be in, nor do I have a list of the arguments or the order that they are supposed to be presented. If you had a model script, I’d be most appreciative. From the paper, it seems clear that I need to convert my Fastq files into fasta-no problem there. However, it is unclear if the paired end reads should/could be interlaced, or if I should/could combine my four libraries (2 are have inserts of about 270 bases and two have inserts of about 390 bases). Any thoughts or suggestions?

      Comment

      • samanta
        Senior Member
        • Feb 2010
        • 108

        #4
        Originally posted by horvathdp View Post
        From here I am pretty lost. Without a manual, I am not even sure what format my input files need to be in, nor do I have a list of the arguments or the order that they are supposed to be presented. If you had a model script, I’d be most appreciative. From the paper, it seems clear that I need to convert my Fastq files into fasta-no problem there. However, it is unclear if the paired end reads should/could be interlaced, or if I should/could combine my four libraries (2 are have inserts of about 270 bases and two have inserts of about 390 bases). Any thoughts or suggestions?
        Oh well.

        Please send me an email and I will try to walk you through the steps. Maybe we need to start in a different way.
        http://homolog.us

        Comment

        • rchikhi
          Member
          • Jan 2013
          • 11

          #5
          Originally posted by horvathdp View Post
          wget -L "http://minia.genouest.org/dsk/dsk-1.5280.tar.gz"
          tar -xzf dsk-1.5280.tar.gz
          cd ./dsk
          make

          From here I am pretty lost. Without a manual, I am not even sure what format my input files need to be in, nor do I have a list of the arguments or the order that they are supposed to be presented. If you had a model script, I’d be most appreciative. From the paper, it seems clear that I need to convert my Fastq files into fasta-no problem there. However, it is unclear if the paired end reads should/could be interlaced, or if I should/could combine my four libraries (2 are have inserts of about 270 bases and two have inserts of about 390 bases). Any thoughts or suggestions?
          Hello,

          I regret that you had issues running DSK. You were on the right track though.
          If anyone reads this and wonders what the answers to his questions are:
          • The input data needs not be FASTA. The README files provides some guidance:
            * File input can be fasta, fastq, gzipped or not.
            * To pass several files as input : create a file with the list of file names (one per line), and pass this file to dsk
          • Format of paired-end reads (interlaced or not), and whether to combine libraries of different inserts or not: how the reads are paired does not matter, DSK sees the reads as a multiset of k-mers.


          However, there are easier ways to correct PacBio reads using Illumina than re-inventing the wheel. There are at least two existing tools, PacBioToCA and LSC:

          Comment

          Latest Articles

          Collapse

          • SEQadmin2
            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
            by SEQadmin2


            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
            ...
            06-02-2026, 10:05 AM
          • SEQadmin2
            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
            by SEQadmin2


            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


            Introduction

            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
            05-22-2026, 06:42 AM
          • SEQadmin2
            Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
            by SEQadmin2

            Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


            Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
            05-06-2026, 09:04 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, Yesterday, 08:59 AM
          0 responses
          11 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 12:03 PM
          0 responses
          21 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 11:40 AM
          0 responses
          17 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 05-28-2026, 11:40 AM
          0 responses
          31 views
          0 reactions
          Last Post SEQadmin2  
          Working...