Seqanswers Leaderboard Ad

**rhall** · 04-22-2015, 09:22 AM

Substitutions are very rare in PacBio data, the predominant error being indels. So long as the minor variants are SNPs it should be straightforward to detect heteroduplexes from the CCS QV scores.
I believe this has been tested before, I'll look to see if I can dig up any data on the magnitude of the effect on QV, it's not completely straightforward, given that the CCS QV's are not perfectly calibrated.

**rhall** · 04-22-2015, 02:49 PM

Having looked at some data that was generated from known heteroduplexes, the only reliable way to detect mismatches using the QV is by focusing specifically on the Substitution QV, not the base QV that is calculated for the fastq file. To get access to this you will have to work from the ccs.h5 files which include all the QV values for every base.

**verheytb** · 04-23-2015, 09:03 AM

Originally posted by rhall View Post

Having looked at some data that was generated from known heteroduplexes, the only reliable way to detect mismatches using the QV is by focusing specifically on the Substitution QV, not the base QV that is calculated for the fastq file. To get access to this you will have to work from the ccs.h5 files which include all the QV values for every base.

Thanks, that's really helpful! There will also be the possibility of indel heteroduplexes. Since I'm filtering my data for 10 pass reads and higher, can I rely on the indel QVs at all?

**rhall** · 04-23-2015, 10:20 AM

Unfortunately the indel QV at 10 passes will likely generate a lot of false positives. In particular you will see lots of low QVs for indels in homoployper regions.
One option for very high pass CCS reads would be to flag based on QV, then calculate consensus for both forward and reverse independently, comparing the results.

**verheytb** · 04-26-2015, 02:16 PM

I like the idea of generating strand-specific CCS reads.

Is there an easy way to get the forward and the reverse subreads? I see in the bas.h5 reference guide that each subread has information for each pass, including direction, but pbcore.io doesn't seem to have any documented way to access it.

Also, what is the best way to generate the circular consensus from the set of subreads from a particular strand? I can't find documentation on how the P_ReadsOfInsert does it.

**rhall** · 04-28-2015, 07:41 AM

Unfortunately 'it's easier said than done'
I don't see any way to generate a quality aware consensus for forward and reverse strands using either CCS consensus of Quiver code.
One method would be to extract the forward and reverse sequences from a filtered_subreads.fasta file generated using standard filtering, then aligning against a common reference using blasr (the alignment does not have to be high quality, you could simply use the first subread as a reference) then call consensus using pbdagcon . The problem is pbdagcon was really developed for speed and does not use the rich quality values that are used in CCS and Quiver consensus generation. I'm therefore not sure if the differences will be detectable above noise.

**rhall** · 04-30-2015, 01:22 PM

OK, so it is possible to generate a high quality strand specific consensus, but this behavior seems to be broken post SMRT Analysis 2.3.0 patch2. I'm not sure exactly when it was broken, or how this relates to github versions of the tools, but assuming SMRT Analysis 2.3.0, align all the subreads to a reference using a standard pipeline, the using the resulting cmp.h5:

Code:

cmph5tools.py select --where "(Movie=='<movieName>') & (HoleNumber==<ZMW>) & (Strand==0)" --outFile <ZMW>_0.cmp.h5 aligned_reads.cmp.h5
cmph5tools.py sort <ZMW>_0.cmp.h5
quiver --referenceFilename <reference fasta> -o <output gff and/or fasta> <ZMW>_0.cmp.h5

**verheytb** · 04-30-2015, 04:04 PM

I will give that a try! Thanks very much.

Topics	Statistics	Last Post
Study Highlights Challenges in Cellular Reprogramming for Regenerative Medicine by seqadmin Started by seqadmin, Today, 06:25 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 06:25 AM
New DNA Modification Discovered as Key to Gene Activation in Early Development by seqadmin Started by seqadmin, Yesterday, 01:02 PM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 01:02 PM
Wastewater Analysis Unlocks New Method for Identifying Public Health Threats by seqadmin Started by seqadmin, 09-18-2024, 06:39 AM	0 responses 14 views 0 likes	Last Post by seqadmin 09-18-2024, 06:39 AM
Molecular Markers Shared Across Dementias by seqadmin Started by seqadmin, 09-11-2024, 02:44 PM	0 responses 14 views 0 likes	Last Post by seqadmin 09-11-2024, 02:44 PM

Seqanswers Leaderboard Ad

Announcement

Identifying heteroduplexes in PacBio CCS reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News