No announcement yet.

Low read counts from old PacBio data

  • Filter
  • Time
  • Show
Clear All
new posts

  • Low read counts from old PacBio data

    I've been helping other people with processing PacBio data using the SMRTanalysis software. Basically, for our purposes, I've just been using the reads of insert protocol. This has worked great for pacbio data we have from several species, but now I'm having issues trying to process the data for one particular species where it's producing an order of magnitude fewer reads than expected (i.e., about 300-500 versus the 5000+). Adjusting the quality and coverage parameters makes minimal difference. The data is originally from 2013, but data for another species sequenced at the same time appears to be fine.

    Comparing the folders of the problematic species side-by-side with species where we had no issues, it appears as though all the files are present. All of the raw data files appear to be consistent in size. However, the generated ccs and subread fasta/fastq files that we received with the raw data are all an order of magnitude small for the problematic species, which leads me to believe that the problem doesn't have to do with the analysis, but rather with the original sequencing process.

    So, the question: what could have gone wrong that the raw .h5 data files all appear to be a typical size (~1GB each), but analysis software is only detecting <10% of the reads expected? Coverage and quality seem fine for the reads that it does detect.

    Thank you.
    Last edited by anjama; 09-20-2016, 08:34 AM.

  • #2
    Pacbio machines to not generate a fixed number of reads, the number of reads from the reads of insert protocol is highly dependent on loading, which can be variable between libraries. Do you have the P0, P1 and P2 statistics? These are generated by the reads of insert protocol (loading report) when running on the command line or gui.


    • #3
      I don't see anything a report called a loading report, or that gives p0, p1, p2 statistics. I've attached screenshots of the reports I do have. The left side is the problematic species. The right side is a similar species that was submitted at the same time, and produced results similar to what I've seen from several other species sequenced recently.

      For the problematic species, this was apparently the second attempt they tried sequencing it. The first time it produced abnormally small output files (about 50% the size of everything else we typically have gotten), so they did it again. Running the reads of insert protocol with the original bad run gives similar results to the supposed good run.

      I'm not really involved with any the sample preparation or sequencing aspect of this, so I only have a basic understanding of how pacbio works. I don't know how to tell if this was an issue with the sample preparation itself, the sequencing machine, or the data output. What has me curious/confused is why the raw data files are so large, yet it seems like the analysis tools are seeing only a small amount of data. Particularly because the data it does see seems to have perfectly fine coverage and read quality. Mean length is a touch lower than I typically see, but not enough that I can confidently call it abnormal.
      Attached Files


      • #4
        This is characteristic of a loading issue as rhall indicated.

        Unfortunately the ReadsOfInsert protocols don't give you the loading report information, however you can lump all of the cells together in a job (or do them separately) and map them to a reference using the RS_Resequencing workflow. One of the reports generated is a loading efficiency report that should give you an indication of how well the SMRTCell was loaded relative to the other ones.

        You can use any reference if you don't care about the mapping and only want the loading report.

        Attached is a sample loading report :
        productivity 0 - means empty ZMWs
        productivity 1 - ZMWs loaded with a single polymerase (what you want)
        productivity 2 - ZMWs loaded with more than one polymerase (mostly unusable)
        Attached Files


        • #5
          Okay, I generated a loading report. The first row is the species from which we got typical results. The second row is the first attempt at the problematic species. The third row is the second attempt.

          Judging by this thread:

          None of these results look particularly good. Even the one that I thought was good appears like it might be underloaded. Granted, I don't know what typical values are, or how they might vary across techniques.

          What should I be taking away from these numbers? Thanks
          Attached Files
          Last edited by anjama; 09-20-2016, 08:35 AM.


          • #6
            It looks like the first two cells were underloaded, and produced mostly nothing; while the third cell was massively overloaded, so most of the data was unusable. My understanding is that when you overload cells, the usable portion tends to be short-insert due to diffusion speed.