Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Systematic errors in PacBio?

    Please post what are the typical issues with data quality re:PacBio sequencing data (along with other sequencing technologies), specifically RS using C2 chemistry.

    For example, I know that 454 is pretty horrendous when it comes to short mononucleotide repeats (4-10 bases) due to in-well/on-bead PCR and a single combined signal, whereas Sanger, using dNTPs typically gives well separated peaks (at least for 400 or so bases). When I'm looking at my PacBio data, I often see a disparity in repeat length between CCS reads and the "filtered" (and "error corrected) reads where the CCS reads will frequently have a repeat length 1 longer than MOST of the filtered and EC'd reads. Often these appear to be in runs of "T".

    Error correction is being performed on the filtered data using PacBioToCA with the 454 data as the reference dataset. In the locations where I have issue with repeat length, most (but not all) of the 454 reads are typically the longer length, whereas the filtered (but not CCS) reads are shorter by one base. I have not ruled out that PacBioToCA is introducing a bias toward shorter repeats.

    (edit: says "3" for my following question, which is now in italics)
    How many times does a circular piece of DNA have to be sequenced for the PacBio algorithms to declare it as a CCS read? Is just partial overlap needed, or does the whole circle need to be sequenced many times? In other words, do I trust each CCS read about 3-5x more than an individual other filtered read? Or is the quality non-linear with respect to coverage.

    Is there a sweet-spot for accuracy when it comes to total length in PacBio? Is the middle 50% more accurately read than the 25% on each "end"? Is my observation that SN-indels occur mostly with a "T" accurate? Can I use that information to bias my base call from the reverse direction (which should have a run of "A"s)?

    Please let me know of other systematic biases that you know of when it comes to PacBio data

    PacBio Advantages vs 454/2nd gen
    Longer read
    No pre-sequencing amplification step
    Less subject to large indels

    PB Disadvantages
    Lower individual read accuracy - more read-to-read variability
    More algorithmically intense to assemble
    Last edited by pag; 06-20-2013, 07:20 PM.

  • #2
    Base call accuracy in PacBio should be independent of read position; I believe this has been shown but don't know the reference.

    What fold coverage of your sample do you have? Have your tried using HGAP instead of PacBioToCA? Have you tried Quiver?


    • #3
      Actually, you start to get closer to correct assemblies using greedy overlap (forming unitigs), so in a sense it is less algorithmically intense to assemble. Finding the overlaps is a bit more complicated than building a hash table (as in de Bruijn assembly), but it is just pairwise alignment and fast.


      Originally posted by pag View Post
      PB Disadvantages
      Lower individual read accuracy - more read-to-read variability
      More algorithmically intense to assemble


      • #4
        Fold coverage from MIRA during a mapping assembly to a single chromosome (manually assembled from MIRA and newbler runs, along with some Sanger data):
        Coverage assessment (calculated from contigs >= 5000):
        Avg. total coverage: 61.41
        Avg. coverage per sequencing technology
        454: 29.99
        PcBioHQ: 30.47

        denovo assembly using MIRA or PacBioToCA appeared to "split" too much rather than "lump": basically generating multiple contigs that were covering the same region of the genome. MIRA in particular appeared to then erroneously join across long repeats where there was no evidence that two (non-repetitive) regions of the genome were actually adjacent to the same copy of the repeat.

        I have not used HGAP, as I understand that would work with only the PacBio data and not the 454. Quiver I have not used yet as I do not have access to an installation of the SMRTPortal and "RS_Mapping_QVs." It is my understanding that Quiver operates on the data set of filtered reads from PacBio, not the PacBioToCA-corrected reads. Is this correct?


        • #5
          Quiver author here. You are correct that Quiver works from the raw PacBio (sub)reads, not corrected reads, and you need to provide it with a cmp.h5 file with all QVs loaded for best results.

          To get this cmp.h5 file without SMRTportal, you will have to (presently) use
          - compareSequences [to create the cmp.h5 without QVs], and
          - loadPulses [to import the QV tracks into the cmp.h5]

          Both of these tools are in the SMRTanalysis package.

          In the near future, we will be releasing a new tool called pbalign that makes this procedure easier, more automatic.

          Feel free to message me and we can help walk you through the exact details of using compareSequences, loadPulses ... and then I can add these details to the docs on GitHub.


          • #6
            Last edited by dalexander; 06-23-2013, 08:46 AM.


            • #7
              I believe I have Quiver installed now, but I get the following perplexing error
              TypeError: run() takes exactly 1 argument (2 given)
              File "smrtanalysis-2.0.1/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/EGG-INFO/scripts/", line 895, in run
     self.options.tmpDir )

              log.debug( self.options.tmpDir ) immediately above gives "/scratch"
              specifying self.options.tmpDir[1] only sends the "s" character, but I get the same error with run. How is one character 2 arguments? Also, why are we telling a directory to run?

              Edit: this was without options --algorithm=blasr --h5fn=aligned.cmp.h5 specified
              Last edited by pag; 06-21-2013, 07:29 PM.


              • #8
                running --algorithm=blasr --h5fn=aligned.cmp.h5 --debug --circularReference filtered_subreads.fasta reference_genome.fasta worked

                Edit: running loadPulses on the reads pre-filter seemed to "work" but see next post.

                but then loadPulses blah_s1_p0.bas.h5 aligned.cmp.h5
                WARNING: There is insufficient data to compute metric: ClassifierQV in the file blah_s1_p0.bas.h5 It will be ignored.
                WARNING: There is insufficient data to compute metric: pkmid in the file blah_s1_p0.bas.h5 It will be ignored.
                loading 33396 alignments for movie 1
                ERROR, the query sequence does not match the aligned query sequence.
                HoleNumber: 46311, MovieName: blah_s1_p0, ReadIndex: 46311, qStart: 17, qEnd: 2852
                Last edited by pag; 06-22-2013, 06:38 AM.


                • #9
                  after reconfiguring SWIG and several other items, I get
                  "Input CmpH5 file must be sorted." when running Quiver

                  I cannot seem to locate a sort script. Also, is it supposed to sort by quality, start position on the map, or by what criteria?
                  Last edited by pag; 06-22-2013, 05:49 PM.


                  • #10
                    Originally posted by dalexander View Post
                    I guess there isn't messaging functionality in this forum
                    Look at the top right where you see your name "dalexander". "private message" link is right below it.

                    People can use your forum name to email you privately.

                    If you want to take out your email address feel free.


                    • #11
                      Hi Matt, so it looks like you've got quiver and SMRTanalysis installed andyour current issue is sorting the cmp.h5.

                      In your SMRTanalysis install, there is a program called If you do

                      $ sort my.cmp.h5

                      it should produce a file out.cmp.h5 that is sorted.



                      • #12
                        That didn't appear to generate a new cmp.h5, but rather altered the existing one (aligned5; at least I think so based on the timestamp)

                        I then tried
                        loadPulses blah.bas.h5 aligned5.cmp.h5
                        WARNING: There is insufficient data to compute metric: ClassifierQV in the file blah.bas.h5 It will be ignored.
                        WARNING: There is insufficient data to compute metric: pkmid in the file blah.bas.h5 It will be ignored.

                        I also tried with switches -metrics QualityValue,ClassifierQV,InsertionQV,MergeQV,DeletionQV,SubstitutionQV,ClassifierQV

                        When trying to run Quiver, I get:
                        2013-06-23 13:20:26,158 [WARNING] This .cmp.h5 file lacks some of the QV data tracks that are required for optimal performance of the Quiver algorithm. For optimal results use the ResequencingQVs workflow in SMRTPortal with bas.h5 files from an instrument using software version 1.3.1 or later.


                        • #13
                          looks like I can get quiver to run if I specify every single track...I'm not sure which are the important ones other than the QV ones.

                          however, it runs in about 2 seconds flat, which seems like it probably didn't do anything other than generate the fasta file. I had been specifying my consensus in the parameters (-r), not my entire assembly. As my assembly was performed in MIRA and consed on PacBioToCA error-corrected filtered reads (and renamed), what file should be used?

                          Or did one of the other programs already perform the polishing and quiver just put it in fasta format?
                          Last edited by pag; 06-23-2013, 02:45 PM.


                          • #14
                            Try running quiver with -v to see what it is doing.

                            The QVs you need for quiver are listed here:
                            GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

                            In particular, it looks like you are missing the SubstitutionTag.


                            • #15
                              quiver -v -j4 cp6.cmp.h5 -r consensus.fasta -o quiver-assembly.fasta
                              2013-06-23 17:03:40,711 [INFO] h5py version: 2.0.1
                              2013-06-23 17:03:40,711 [INFO] hdf5 version: 1.8.4
                              2013-06-23 17:03:40,714 [INFO] ConsensusCore version: 0.6.1
                              2013-06-23 17:03:40,714 [INFO] Starting.
                              2013-06-23 17:03:41,224 [INFO] Peeking at CmpH5 file cp6.cmp.h5
                              2013-06-23 17:03:41,225 [INFO] Input CmpH5 data: numAlnHits=28557
                              2013-06-23 17:03:41,225 [INFO] Loading reference
                              2013-06-23 17:03:41,437 [INFO] Loaded 1 of 1 reference groups from consensus.fasta
                              2013-06-23 17:03:41,491 [INFO] Using Quiver parameter set unknown.AllQVsModel
                              2013-06-23 17:03:41,506 [INFO] Available CPUs: 4
                              2013-06-23 17:03:41,506 [INFO] Requested workers: 4
                              2013-06-23 17:03:41,506 [INFO] Parallel Mode: Process
                              2013-06-23 17:03:41,560 [INFO] Launched compute slaves.
                              2013-06-23 17:03:41,563 [INFO] Launched collector slave.
                              2013-06-23 17:03:41,648 [INFO] Quiver operating on (consensus fasta):14-112
                              2013-06-23 17:03:41,649 [INFO] Quiver operating on (consensus fasta):928102-928132
                              2013-06-23 17:03:41,651 [INFO] Usable coverage in (consensus fasta):14-112: [(14, 107)]
                              2013-06-23 17:03:41,651 [INFO] Quiver operating on (consensus fasta):1121891-1121935
                              2013-06-23 17:03:41,652 [INFO] Usable coverage in (consensus fasta):928102-928132: [(928107, 928127)]
                              2013-06-23 17:03:41,664 [INFO] Usable coverage in (consensus fasta):1121891-1121935: [(1121896, 1121919), (1121919, 1121935)]
                              2013-06-23 17:03:41,664 [INFO] Quiver operating on (consensus fasta):1121949-1121964
                              2013-06-23 17:03:41,668 [INFO] Usable coverage in (consensus fasta):1121949-1121964: [(1121949, 1121962)]
                              2013-06-23 17:03:44,407 [INFO] Quiver operating on (consensus fasta):1198804-1198837
                              2013-06-23 17:03:44,415 [INFO] Usable coverage in (consensus fasta):1198804-1198837: [(1198809, 1198829)]
                              2013-06-23 17:03:47,709 [INFO] Analysis completed.
                              2013-06-23 17:03:47,709 [INFO] Output files completed.
                              2013-06-23 17:03:48,625 [INFO] Finished.

                              what does this mean? only very restricted regions were usably covered? Or these were the only regions it saw fit to correct? Edit: setting -X 5 or --minCoverage 5 doesn't seem to make a difference.
                              Last edited by pag; 06-23-2013, 04:37 PM.


                              Latest Articles


                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin

                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin

                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM





                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              Last Post seqadmin