Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Novice looking to use PacBio data

    Hey all,

    I am a complete novice with only the barest understanding of working command line interfaces (currently using resources at iPlant). I have a bunch of PacBio sequences and about 30X coverage with illumina (100 base pair end). I'd like to be able to correct the PacBio sequences with my illumina reads. Anyone care to give me a step by step? Alternatively, I'd happily give authorship rights to anyone who wants to help me with my correction when I publish this work. I am dealing with a non-model invasive weed species (leafy spurge- Euphorbia esula) - mostlytrying to assemble gene space with promoters to help leverage a bunch of transcriptomics (microarray) data we have generated over the last 5 years.

  • #2
    Hello,

    Seems like an interesting problem. Here is what you need to do.

    (i) Please draw a k-mer distribution of the Illumina reads. I think your Illumina coverage (30) is slightly on the lower side, but we do not know until we see the chart. You can draw k-mer distribution by using SOAPdenovo, DSK (http://minia.genouest.org/dsk/) and many other k-mer counting packages.

    For new readers, easiest way to follow us is through our twitter feed. The feed is updated, whenever we post a commentary here.



    (ii) Use any de Bruijn graph-based assembler to assemble the Illumina reads first up to contig level. My favorites are SOAPdenovo (because it can handle PE) and Minia (http://minia.genouest.org/) for being light-weight. Ideally you need to do the assembly at multiple k-mer values.


    (iii) Once you have the the Illumina reads assembled, use BLASR (a tool distributed by PacBIO) to map the Illumina contigs on to large PacBio reads.

    Only after we have results of this step, we can talk about error correction of PacBio.

    Also check the following commentary and discussions in the comment sections.

    When we started working on PacBio data one year back, everyone recommended PacBioToCA. Pause for a moment to imagine how summer of 2012 was. Everyone was talking about Illumina, 454, de Bruijn graph, Velvet assembler and so on, and these ‘weird’ reads show up from nowhere. Using an analogy, everyone is talking about pizza and BioMickWatson shows five other foods that are like genome assembly, namely Eton mess, spaghetti Bolognese, Marmite, ‘macaroni’ cheese and anchovite. The initial impulse is to turn all those into toppings for pizza to make them attractive.



    If all those are too complicated, please email me at samanta at homolog.us, and we can discuss further.
    Last edited by samanta; 07-30-2013, 08:40 AM. Reason: error in text
    http://homolog.us

    Comment


    • #3
      Thanks for the reply! So, the readme file is really sparce, and I could not find a link to a manual (even in the associated paper published in BMC). Any chance you have a link to the manual? Also, let me run my process by you just to see if I am on the right track:

      open an instance in iPlant atmosphere (ubuntu or linuxbiocloud-32bit?).
      run the script:

      wget -L "http://minia.genouest.org/dsk/dsk-1.5280.tar.gz"
      tar -xzf dsk-1.5280.tar.gz
      cd ./dsk
      make

      From here I am pretty lost. Without a manual, I am not even sure what format my input files need to be in, nor do I have a list of the arguments or the order that they are supposed to be presented. If you had a model script, I’d be most appreciative. From the paper, it seems clear that I need to convert my Fastq files into fasta-no problem there. However, it is unclear if the paired end reads should/could be interlaced, or if I should/could combine my four libraries (2 are have inserts of about 270 bases and two have inserts of about 390 bases). Any thoughts or suggestions?

      Comment


      • #4
        Originally posted by horvathdp View Post
        From here I am pretty lost. Without a manual, I am not even sure what format my input files need to be in, nor do I have a list of the arguments or the order that they are supposed to be presented. If you had a model script, I’d be most appreciative. From the paper, it seems clear that I need to convert my Fastq files into fasta-no problem there. However, it is unclear if the paired end reads should/could be interlaced, or if I should/could combine my four libraries (2 are have inserts of about 270 bases and two have inserts of about 390 bases). Any thoughts or suggestions?
        Oh well.

        Please send me an email and I will try to walk you through the steps. Maybe we need to start in a different way.
        http://homolog.us

        Comment


        • #5
          Originally posted by horvathdp View Post
          wget -L "http://minia.genouest.org/dsk/dsk-1.5280.tar.gz"
          tar -xzf dsk-1.5280.tar.gz
          cd ./dsk
          make

          From here I am pretty lost. Without a manual, I am not even sure what format my input files need to be in, nor do I have a list of the arguments or the order that they are supposed to be presented. If you had a model script, I’d be most appreciative. From the paper, it seems clear that I need to convert my Fastq files into fasta-no problem there. However, it is unclear if the paired end reads should/could be interlaced, or if I should/could combine my four libraries (2 are have inserts of about 270 bases and two have inserts of about 390 bases). Any thoughts or suggestions?
          Hello,

          I regret that you had issues running DSK. You were on the right track though.
          If anyone reads this and wonders what the answers to his questions are:
          • The input data needs not be FASTA. The README files provides some guidance:
            * File input can be fasta, fastq, gzipped or not.
            * To pass several files as input : create a file with the list of file names (one per line), and pass this file to dsk
          • Format of paired-end reads (interlaced or not), and whether to combine libraries of different inserts or not: how the reads are paired does not matter, DSK sees the reads as a multiset of k-mers.


          However, there are easier ways to correct PacBio reads using Illumina than re-inventing the wheel. There are at least two existing tools, PacBioToCA and LSC:

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Exploring the Dynamics of the Tumor Microenvironment
            by seqadmin




            The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
            07-08-2024, 03:19 PM
          • seqadmin
            Exploring Human Diversity Through Large-Scale Omics
            by seqadmin


            In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
            06-25-2024, 06:43 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 07:20 AM
          0 responses
          23 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 07-16-2024, 05:49 AM
          0 responses
          38 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 07-15-2024, 06:53 AM
          0 responses
          43 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 07-10-2024, 07:30 AM
          0 responses
          41 views
          0 likes
          Last Post seqadmin  
          Working...
          X