Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to measure the quality of PacBio data

    Hi,

    How to measure the quality of PacBio reads? if you have multi cells What is the tools and How to measure the quality of the read to be able to judge if you can proceed with assembly and further analysis or you need to resequence ?!
    Last edited by Medhat; 08-30-2014, 02:19 AM.

  • #2
    Do you have access to the analysis reports that are generated by SMRTanalysis package (simple filtering/"reads of insert" analysis are good ones to start)? You could ask your sequence provider for them.

    If you have the *.h5 and metadata.xml files for the SMRTcells then you could generate them yourself by using SMRTanalysis/SMRTPortal software. If you are not IT savvy then it is not a simple task to get SMRTanalysis setup right (fair warning).

    There is information/wiki available here: https://github.com/PacificBioscience...-Training/wiki

    Also video tutorials/webinars: https://github.com/PacificBioscience...atics-Workshop
    Last edited by GenoMax; 08-30-2014, 09:38 AM.

    Comment


    • #3
      I already installed SMRTPorta on my pc, also I have the .h5 , but what protocol shall I use to generate the data?

      Comment


      • #4
        Originally posted by Medhat View Post
        I already installed SMRTPorta on my pc, also I have the .h5 , but what protocol shall I use to generate the data?
        Once you import the SMRTcell data into SMRTportal you can start with the basic filtering (RS_subread) protocol. This will filter the adapters and give you all subreads. Depending on how this run was done (length of movie etc) and the size of the inserts it may be worth running the "RS_reads_of_insert" protocol next. This would give you the consensus reads (used to be called CCS). If you have adequate amount of coverage (20x or more with good median length of reads, 3-4kb or more) and this is a small genome (e.g. bacteria) then you can try the RS_HGAP assembly next.

        Watch this tutorial first for an overview of the process: http://aa314.gondor.co/webinar/secondary-analysis/
        Last edited by GenoMax; 08-30-2014, 11:21 AM.

        Comment


        • #5
          Thanks a lot for the detailed answer,
          I have a big genome plant one, not bacteria.
          Is there is a special way to measure the coverage from the pacbio subreads?

          Ok right now I imported all the cells and I am running this protocol RS_subread, but also in the analysis directory there is some files like this;
          ..._s1_p0.2.subreads.fasta
          ..._s1_p0.1.subreads.fasta
          ..._s1_p0.3.subreads.fasta


          Is that means the company runs this protocol for me first because of the existing of this subreads.fasta ?!
          Last edited by Medhat; 01-15-2015, 01:38 AM.

          Comment


          • #6
            Those files are part of the primary data produced on the instrument itself.

            Has the subread filtering finished (should be reasonably quick) for each SMRTcell. The analyzed data should end up under /path_to/smrtanalysis/common/jobs/0xx/0xxxxx/data directory. The 0xxxxx number is the job ID that is generated by SMRTportal. You can get at the fastq/fasta files from the web interface (without having to dig through the directory hierarchy above). They are generally called "filtered_subreads.fasta/q" for each "job_id". A good graphical overview report is produced for each job.

            How many cells do you have? With a plant genome your coverage is not going to be great unless you ran a lot of SMRTcells.

            In general you want the dimer/adapter % to be very small. Your mean read length should be long and the productivity (look for the % number for P1) to be max as compared to P0 (no sequence) or P2 (more than one sequence) generating wells.

            You must have received some guidance on the quality of the data from the provider.
            Last edited by GenoMax; 09-01-2014, 04:23 AM.

            Comment


            • #7
              I have 18 cell
              the results of the RS_subread "using default parameters"
              Job Metric Value
              Adapter Dimers (0-10bp)
              0.01%
              Short Inserts (11-100bp)
              0.01%
              Number of Bases
              7,604,261,565
              Number of Reads
              1,218,435
              N50 Read Length
              10,405
              Mean Read Length
              6,241
              Mean Read Score
              0.83

              "Of course there is the generated fasta and fastaq file generated in the data section in the smrt portal shall I download them to begin the next protocol RS_reads_of_insert"

              Comment


              • #8
                Those numbers look good.

                RS_reads_of_insert should be run with the original SMRTcell data. Since you have fairly long reads the CCS passes you get may not be very high (if you have a long read the polymerase may only go around that insert a few times during the run).

                The fasta/fastq you generated in filter step can be set aside for now.

                What is it that you want to finally do with this data? Answer of what to do next would partially depend on that.

                Comment


                • #9
                  Ok right now I started the RS_reads_ofinsert_1
                  The target Is to do Assembly which will include illumena reads,
                  after that I shall begin to do comparative genomics with other isolates in the genome database ,

                  Other question how can you know from the report resulted from the RS_subread that I have good or bad reads also Mean Read Score 0.83 how the measure it what is the upper and lower limits "meaning what is good or bad "
                  and for the assembly I need to calculate the coverage how can I do that with pacbio reads

                  thank you very much I highly appreciate your time and help
                  Last edited by Medhat; 06-03-2016, 10:54 AM.

                  Comment


                  • #10
                    Do you have an idea of the size of the genome of the plant you are working with? I suppose you can calculate theoretical coverage based on the number of bases you have in your dataset (7.6 gb). How was this library prepared? Do you expect the representation to be random?

                    You would want to use the RS_CeleraAssembler protocol to include your Illumina reads with the PacBio data.

                    Dr. Hall from PacBio participates on this forum and perhaps he can chime in with additional suggestions.

                    Comment


                    • #11
                      genome of of about 2.4 gb

                      mean while the result of RS_reads_ofinsert_1;

                      Job Metric Value
                      Read Bases of Insert
                      688,951,419
                      Mean Read Length of Insert
                      2,289.0
                      Read Quality of Insert
                      97.18%
                      Mean Number of Passes
                      6.0
                      Last edited by Medhat; 09-03-2014, 04:54 AM.

                      Comment


                      • #12
                        How much Illumina data do you have?

                        You only have about 3x coverage with total PacBio data but with "reads_of_insert" that number falls below 1x. Not unexpected since you would need to run a ton of SMRTcells to get really deep coverage.

                        The CCS reads should be of good quality and if you blast a few of them at NCBI you should see good hits (just to confirm that you have the right sequence/plant) to either your plant or a close relative.

                        Comment


                        • #13
                          ok that is great, but here,
                          RS_subread
                          Number of Bases
                          7,604,261,565
                          Number of Reads
                          1,218,435

                          and
                          RS_reads_ofinsert_1;


                          Read Bases of Insert
                          688,951,419
                          Mean Read Length of Insert
                          2,289.0

                          how the coverage in the first be 3x and the second only 1x ?!

                          Comment


                          • #14
                            Isn't the genome you are working with ~2,400,000,000 bp? Take that into consideration with the total number of bases that you are getting from your stats (this is a x/y kind of calculation and the distribution of those bases is not going to be uniform across the genome).

                            Comment


                            • #15
                              Just catching up with the thread, for a hybrid (illumina + PacBio) assembly ~3x is not enough coverage. I would suggest looking at assembling your illumina data and using the PacBio CLR (Continuous Long Reads, not CCS) for gap filling with PBJelly.
                              http://www.plosone.org/article/info%...l.pone.0047768
                              https://github.com/PacificBioscience...Bio-Long-Reads

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X