Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Novice looking to use PacBio data

    Hey all,

    I am a complete novice with only the barest understanding of working command line interfaces (currently using resources at iPlant). I have a bunch of PacBio sequences and about 30X coverage with illumina (100 base pair end). I'd like to be able to correct the PacBio sequences with my illumina reads. Anyone care to give me a step by step? Alternatively, I'd happily give authorship rights to anyone who wants to help me with my correction when I publish this work. I am dealing with a non-model invasive weed species (leafy spurge- Euphorbia esula) - mostlytrying to assemble gene space with promoters to help leverage a bunch of transcriptomics (microarray) data we have generated over the last 5 years.

  • #2
    Hello,

    Seems like an interesting problem. Here is what you need to do.

    (i) Please draw a k-mer distribution of the Illumina reads. I think your Illumina coverage (30) is slightly on the lower side, but we do not know until we see the chart. You can draw k-mer distribution by using SOAPdenovo, DSK (http://minia.genouest.org/dsk/) and many other k-mer counting packages.

    For new readers, easiest way to follow us is through our twitter feed. The feed is updated, whenever we post a commentary here.



    (ii) Use any de Bruijn graph-based assembler to assemble the Illumina reads first up to contig level. My favorites are SOAPdenovo (because it can handle PE) and Minia (http://minia.genouest.org/) for being light-weight. Ideally you need to do the assembly at multiple k-mer values.


    (iii) Once you have the the Illumina reads assembled, use BLASR (a tool distributed by PacBIO) to map the Illumina contigs on to large PacBio reads.

    Only after we have results of this step, we can talk about error correction of PacBio.

    Also check the following commentary and discussions in the comment sections.

    When we started working on PacBio data one year back, everyone recommended PacBioToCA. Pause for a moment to imagine how summer of 2012 was. Everyone was talking about Illumina, 454, de Bruijn graph, Velvet assembler and so on, and these ‘weird’ reads show up from nowhere. Using an analogy, everyone is talking about pizza and BioMickWatson shows five other foods that are like genome assembly, namely Eton mess, spaghetti Bolognese, Marmite, ‘macaroni’ cheese and anchovite. The initial impulse is to turn all those into toppings for pizza to make them attractive.



    If all those are too complicated, please email me at samanta at homolog.us, and we can discuss further.
    Last edited by samanta; 07-30-2013, 08:40 AM. Reason: error in text
    http://homolog.us

    Comment


    • #3
      Thanks for the reply! So, the readme file is really sparce, and I could not find a link to a manual (even in the associated paper published in BMC). Any chance you have a link to the manual? Also, let me run my process by you just to see if I am on the right track:

      open an instance in iPlant atmosphere (ubuntu or linuxbiocloud-32bit?).
      run the script:

      wget -L "http://minia.genouest.org/dsk/dsk-1.5280.tar.gz"
      tar -xzf dsk-1.5280.tar.gz
      cd ./dsk
      make

      From here I am pretty lost. Without a manual, I am not even sure what format my input files need to be in, nor do I have a list of the arguments or the order that they are supposed to be presented. If you had a model script, I’d be most appreciative. From the paper, it seems clear that I need to convert my Fastq files into fasta-no problem there. However, it is unclear if the paired end reads should/could be interlaced, or if I should/could combine my four libraries (2 are have inserts of about 270 bases and two have inserts of about 390 bases). Any thoughts or suggestions?

      Comment


      • #4
        Originally posted by horvathdp View Post
        From here I am pretty lost. Without a manual, I am not even sure what format my input files need to be in, nor do I have a list of the arguments or the order that they are supposed to be presented. If you had a model script, I’d be most appreciative. From the paper, it seems clear that I need to convert my Fastq files into fasta-no problem there. However, it is unclear if the paired end reads should/could be interlaced, or if I should/could combine my four libraries (2 are have inserts of about 270 bases and two have inserts of about 390 bases). Any thoughts or suggestions?
        Oh well.

        Please send me an email and I will try to walk you through the steps. Maybe we need to start in a different way.
        http://homolog.us

        Comment


        • #5
          Originally posted by horvathdp View Post
          wget -L "http://minia.genouest.org/dsk/dsk-1.5280.tar.gz"
          tar -xzf dsk-1.5280.tar.gz
          cd ./dsk
          make

          From here I am pretty lost. Without a manual, I am not even sure what format my input files need to be in, nor do I have a list of the arguments or the order that they are supposed to be presented. If you had a model script, I’d be most appreciative. From the paper, it seems clear that I need to convert my Fastq files into fasta-no problem there. However, it is unclear if the paired end reads should/could be interlaced, or if I should/could combine my four libraries (2 are have inserts of about 270 bases and two have inserts of about 390 bases). Any thoughts or suggestions?
          Hello,

          I regret that you had issues running DSK. You were on the right track though.
          If anyone reads this and wonders what the answers to his questions are:
          • The input data needs not be FASTA. The README files provides some guidance:
            * File input can be fasta, fastq, gzipped or not.
            * To pass several files as input : create a file with the list of file names (one per line), and pass this file to dsk
          • Format of paired-end reads (interlaced or not), and whether to combine libraries of different inserts or not: how the reads are paired does not matter, DSK sees the reads as a multiset of k-mers.


          However, there are easier ways to correct PacBio reads using Illumina than re-inventing the wheel. There are at least two existing tools, PacBioToCA and LSC:

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          27 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          31 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          26 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Working...
          X