Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jpummil
    Member
    • Apr 2014
    • 85

    Genome Size Estimation from PacBio Raw Reads

    So, working on a de novo assembly using Canu, and it seems to be VERY sensitive to the genomeSize=XXX parameter which is required. As it is a new project, no one has an actual "size" on it (checked T Ryan Gregory's site...nothing similar there either).

    So, I am using BBMap suite, specifically...the "kmercountexact.sh" component. Waiting on a compute node right now with >64GB of ram to run, but have it set as follows: kmercountexact.sh in=filtered_subreads.fastq khist=khist.txt peaks=peaks.txt out=genomesize.txt

    As Brian Bushnell is active on here, I was hoping to inquire about using this on PacBio specifically...anything I need to be more specific about on the options? Also, can I specify both of my PacBio files as arguments? I have both a .fastq of the long reads as well as a .fasta of much shorter reads supplied by the sequencer people. I know it can do PE files as in= and in2=, but what about to essentially "single" reads?
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    Hi Jeff,

    Unfortunately, I don't have a good method for this. I've tried kmercountexact, and it does not work on raw PacBio reads due to the high error rate. I do not know of a better method for genome size estimation than assembling, with Falcon, for example. Sorry!

    If you have multiple files, though, you can enter them comma-delimited, like this:

    Code:
    kmercountexact.sh in=filtered1.fq,filtered2.fq
    Not all tools support that, but Tadpole, KmerCountExact, and Dedupe do.

    Comment

    • jpummil
      Member
      • Apr 2014
      • 85

      #3
      Thanks for the quick response, Brian!

      Good to know about the comma-delimited method for multiple entries. Unfortunate to hear about the PacBio error issue when trying to determine genome size. I thought about this a bit and am wondering if the pre-processing Canu does to the data could be used prior to trying kmercountexact? It outputs a couple of files during its run which trim, then correct the reads:

      <filename>.trimmedReads.fasta.gz

      then

      <filename>.correctedReads.fasta.gz

      Of course, they have been processed WITH the genomeSize estimate provided at run time and I'm not certain of how much that might have influenced any trimming or correction. I might try and contact Phillippy or Koren and inquire further ;-)

      Comment

      • wdecoster
        Member
        • Oct 2015
        • 97

        #4
        Perhaps a quick and dirty assembly with miniasm can give you an idea? https://github.com/lh3/miniasm

        Comment

        • jpummil
          Member
          • Apr 2014
          • 85

          #5
          Originally posted by wdecoster View Post
          Perhaps a quick and dirty assembly with miniasm can give you an idea? https://github.com/lh3/miniasm
          Thanks for the suggestion wdecoster I think I've avoided miniasm thus far because it appears to only output .gfa files? Kind of limits further evaluation of the assembly as most common tools seem to still only take .fasta.

          Update: Found a note in another thread about converting .gfa to .fasta Trying it now...

          awk '/^S/{print ">"$2"\n"$3}' in.gfa | fold > out.fa
          Last edited by jpummil; 07-15-2016, 09:41 AM. Reason: Additional info

          Comment

          • jsoghigian
            Junior Member
            • Sep 2016
            • 1

            #6
            Originally posted by jpummil View Post
            Thanks for the suggestion wdecoster I think I've avoided miniasm thus far because it appears to only output .gfa files? Kind of limits further evaluation of the assembly as most common tools seem to still only take .fasta.

            Update: Found a note in another thread about converting .gfa to .fasta Trying it now...

            awk '/^S/{print ">"$2"\n"$3}' in.gfa | fold > out.fa
            About to try this method myself - jpummil, were you successful in estimating genome size from your raw reads?

            Comment

            • jpummil
              Member
              • Apr 2014
              • 85

              #7
              Originally posted by jsoghigian View Post
              About to try this method myself - jpummil, were you successful in estimating genome size from your raw reads?
              The assembly itself using miniasm and the conversion script from gfa to fasta worked fine, though the assembly isn't as "good" as from Canu.

              Still no really good way to estimate a genome size from the PacBio reads. Schatz put together a really nice tool called GenomeScope, but currently only works with Illumina reads.

              Comment

              • Markiyan
                Senior Member
                • Sep 2010
                • 126

                #8
                Try analysing the reads from the short inserts (multipass ones).

                You can try extracting the long raw reads from the short library inserts, which pass the insert multiple times (CCS-like reads), doing self error correction, and than using kmer counter software designed for Illumina, 454 or Sanger data.

                Also please be aware, that you may have to screen out the high copy number DNA (mitochondrial/plastid genomes) before doing kmer counting.

                Also you may get some PCR-Free miseq data to complement your pacbio assembly. (Can be cheaper if your coverage is still too low).

                Comment

                • kartika
                  Banned
                  • Oct 2019
                  • 1

                  #9
                  thanks you

                  Comment

                  Latest Articles

                  Collapse

                  • SEQadmin2
                    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                    by SEQadmin2


                    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                    ...
                    06-02-2026, 10:05 AM
                  • SEQadmin2
                    Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                    by SEQadmin2


                    With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                    Introduction

                    Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                    05-22-2026, 06:42 AM
                  • SEQadmin2
                    Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                    by SEQadmin2

                    Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                    Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                    05-06-2026, 09:04 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, 06-02-2026, 12:03 PM
                  0 responses
                  21 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 11:40 AM
                  0 responses
                  14 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 05-28-2026, 11:40 AM
                  0 responses
                  29 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 05-26-2026, 10:12 AM
                  0 responses
                  31 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...