Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I calculate read score?

    The SMRT pipeline can use "read score" to filter out reads. "Read Score" seems to be defined as the percentage of bases that is expected to be correct in a read. It is a number between 0 to 1.

    But how do I calculate this metric? Can I do it from the filtered fastq file? Or do I need to read the bax.h5 files?

  • #2
    I don't really use the SMRT pipeline, but you can use reformat to filter the fastq files based on the expected error rate (phred-scaled). For example:

    reformat.sh in=subreads.fq out=filtered.fq maq=20


    ...will give you only the reads with an expected error rate of at most 1%.

    Comment


    • #3
      You can calculate a read score from the base QVs in the fastq, but it will not be exactly the same as the "Read Score" defined in the bax.h5.
      I would be careful using either the Read Score, or the QVs in the fastq for PacBio data, they are not well calibrated. Using either a 0.75, or 0.8 Read Score filters out a lot of junk reads, but I would not have a lot of confidence in it for distinguishing a 0.8 from a 0.82 read.

      Comment


      • #4
        I mostly agree, in terms of raw data, that the QVs are not particularly useful. But they are quite useful in consensus (CCS, reads of insert, or error-corrected) data.

        Comment


        • #5
          Originally posted by rhall View Post
          You can calculate a read score from the base QVs in the fastq, but it will not be exactly the same as the "Read Score" defined in the bax.h5.
          I would be careful using either the Read Score, or the QVs in the fastq for PacBio data, they are not well calibrated. Using either a 0.75, or 0.8 Read Score filters out a lot of junk reads, but I would not have a lot of confidence in it for distinguishing a 0.8 from a 0.82 read.
          Thanks for your reply. I once tried to increase the filter from 0.8 to 0.83 for a 80x E coli dataset but it made no differences in the assembly.

          I wanted to learn more about read score because I want to tweak parameters in my HGAP pipeline to improve the assembly but so far I tried changing minReadScore, xCoverage and minLongReadLength but nothing changed. The default filter filters out reads shorter than 500bp. Do you think increasing the cutoff can improve assembly?

          Comment


          • #6
            From a lot of experience with HGAP and bacterial assembly, the one thing I can say is that it is extremely robust. Changing parameter will not have much of an affect on the quality of an assembly. Time is much better spent diagnosing why the assembly did not give the expected results with the default parameters.
            The first things to check are the PreAssembled read N50, Preassembled coverage (# Preassembled bases / genome size) and the preassembled yield. An N50 < 5kb will result in problems for many bacterial assemblies as you will not be able to span common repeats. A yield of <0.70 indicates that something is going wrong with the PreAssembly. The PreAssembled coverage should be >15x.

            Comment


            • #7
              Originally posted by rhall View Post
              From a lot of experience with HGAP and bacterial assembly, the one thing I can say is that it is extremely robust. Changing parameter will not have much of an affect on the quality of an assembly. Time is much better spent diagnosing why the assembly did not give the expected results with the default parameters.
              The first things to check are the PreAssembled read N50, Preassembled coverage (# Preassembled bases / genome size) and the preassembled yield. An N50 < 5kb will result in problems for many bacterial assemblies as you will not be able to span common repeats. A yield of <0.70 indicates that something is going wrong with the PreAssembly. The PreAssembled coverage should be >15x.
              Thanks a lot for your reply. My Pre-assembler report for P6C4 is:

              Polymerase Read Bases 779850648 Length Cutoff 21586
              Seed Bases 138056215 Pre-Assembled bases 44327787
              Pre-Assembled Yield 0.32108505220138045 Pre-Assembled Reads 4630
              Pre-Assembled Reads Length 9574 Pre-Assembled N50 16443

              Apparently, yield is 0.32<0.7 and Preassembled coverage is 9.63x<15x. How do I fix this? I basically use the example HGAP3 params.xml included in the package but I changed the genomeSize to 4600000.

              Comment


              • #8
                The P5C3 stat looks much better and satisfy all the criteria you mentioned:

                Polymerase Read Bases 373874428 Length Cutoff 13715
                Seed Bases 139256931 Pre-Assembled bases 113989293
                Pre-Assembled Yield 0.8185538212098039 Pre-Assembled Reads 8887
                Pre-Assembled Reads Length 12826 Pre-Assembled N50 15326

                I think improving Pre-Assembled bases can help my P6C4 but how can I do that?

                Comment


                • #9
                  Something looks amiss with the P6C4 assembly / data, how is the loading for the cell, % P0, P1 and P2? Can you post the subread histogram. Can you provide any more information on how the libraries were prepped?

                  Comment


                  • #10


                    I was using the data I downloaded from pacbio. It is a 20kb size selected library. Does that mean I should filter out reads significantly longer than 20kb, e.g. filter out >30kb reads?

                    Comment


                    • #11
                      Originally posted by ymc View Post
                      https://github.com/PacificBioscience...ary-with-P6-C4

                      I was using the data I downloaded from pacbio. It is a 20kb size selected library. Does that mean I should filter out reads significantly longer than 20kb, e.g. filter out >30kb reads?
                      No. Those are the most valuable reads. Size-selection is not perfect, and the goal in this case is mainly to aim for 20kbp+, anyway - not to prevent longer reads from being sequenced.

                      Comment


                      • #12
                        I just looked at the data with the latest version of SMRT Analysis 2.3.0 - patch 4 and it looks like the subreadlength distribution is such that the automatic calculation of the seed read length is being too aggressive. The resulting yield of preassembled reads is a little low to get a good assembly. If you un-check "Compute Minimum Seed Read Length" and set the seed length to 15000 the result is a single circular contig with overlapping ends that can be circularized.
                        Note, although the ecoli is K12 it is not base for base the same as the published reference, some of the variations have been validated. You may want to check any differences between the assembly and the reference, in a lot of cases the assembly is correct it is just that the strain being sequences has diverged a little from the reference.

                        Comment


                        • #13
                          Originally posted by rhall View Post
                          I just looked at the data with the latest version of SMRT Analysis 2.3.0 - patch 4 and it looks like the subreadlength distribution is such that the automatic calculation of the seed read length is being too aggressive. The resulting yield of preassembled reads is a little low to get a good assembly. If you un-check "Compute Minimum Seed Read Length" and set the seed length to 15000 the result is a single circular contig with overlapping ends that can be circularized.
                          Note, although the ecoli is K12 it is not base for base the same as the published reference, some of the variations have been validated. You may want to check any differences between the assembly and the reference, in a lot of cases the assembly is correct it is just that the strain being sequences has diverged a little from the reference.
                          Thank you for your reply. I will give that a try.

                          I suppose "Pre-assembled Yield" is not that important but rather "Preassembled Bases" should be >25x of genomeSize, correct?

                          Do you know for the quiver step in the HGAP pipeline, would the reads I filtered out at the P_Filter step be used for the quiver step? When I run quiver after PBcR, I always use all the reads to polish even if the input fastq to PBcR is what's left after filtering. It would be nice to know HGAP uses all the reads to correct but not just filtered reads.

                          Comment


                          • #14
                            I found that I can improve preassembled yield of P6C4 just by changing targetChunks to 1 and splitBestn to the same value as totalBestn which is 24.

                            Polymerase Read Bases 779850648 Length Cutoff 21553
                            Seed Bases 139264087 Pre-Assembled bases 50571157
                            Pre-Assembled Yield 0.36313135776346994 Pre-Assembled Reads 6001
                            Pre-Assembled Reads Length 8427 Pre-Assembled N50 15684

                            But it ended up with broken contigs
                            unitig_0|quiver 3838171
                            unitig_1|quiver 828945
                            unitig_2|quiver 33788

                            Why was that? What does it do?

                            Comment


                            • #15
                              The alignment and quiver correction will use the post filter reads. P_Filter only happens once, and all downstream analysis use these results.
                              The targetChunks will have a small effect on the results, but not in any significant way. It relates to the splitting up of the first alignment of all reads to the seed reads, in order to optimize computation. The defaults should be reasonable for bacterial assemblies, if you increase the size of genome you are assembling the number of targetChunks will have to be increased. The splitBestn is the number of alignments allowed for each read within the alignment chunk, the totalBestn is the total across all chunks.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X