Seqanswers Leaderboard Ad

**rhall** · 06-05-2013, 07:29 AM

pbh5tools has the functinality you need:

https://github.com/PacificBiosciences/pbh5tools

GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

Build software better, together

https://github.com/PacificBiosciences/pbh5tools/blob/master/doc/index.rst

GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

Within SMRT portal a way to achieve the same filter, if you have a fixed amplicon size, is to limit the read length to 5x the expected insert size.

**metheuse** · 06-05-2013, 08:04 AM

Originally posted by rhall View Post

pbh5tools has the functinality you need:

Build software better, together

https://github.com/PacificBiosciences/pbh5tools

GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

Build software better, together

https://github.com/PacificBiosciences/pbh5tools/blob/master/doc/index.rst

GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

Within SMRT portal a way to achieve the same filter, if you have a fixed amplicon size, is to limit the read length to 5x the expected insert size.

Thanks a lot! This is what I was looking for!
Just several more simple questions:

For a CCS pacbio run, what does this file "*_s*_p0.bas.h5" contain? I know "*_s*_p0.fasta" has the long reads, and "*_s*_p0.ccs.fasta" has the subreads.

If I use pbh5tools and I want to output NumberOfPasses, the input should be "*_s*_p9.bas.h5", so the read type of it should be "CCS", right? I'm just confused by how this "bas.h5" can tell how many subreads are from a long read? Does it have information from both long read and subread?

Sorry I'm fresh new for pacbio, so please forgive my stupid questions.

**rhall** · 06-05-2013, 08:22 AM

The "*_s*_p0.fasta" and "*_s*_p0.ccs.fasta" files are generated from the "*_s*_p0.bas.h5" file.

The standard file format for PacBio data is the bas.h5 (bax.h5 for RSII), it includes a much richer set of data than both the fasta files. It contains all the sequencing information, the entire read from each ZMW, kinetic information for methylation analysis and the rich QVs needed for some PacBio specific pipelines (Quiver consensus calling, HGAP assembly).
The pdf here:

Build software better, together

https://github.com/PacificBiosciences/SMRT-Analysis/wiki/Official-Documentation#file-formats---release-v20

GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

has the full spec of the bas.h5 file.

More useful info on PacBio basics:

Build software better, together

https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Bioinformatics-Workshop

GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

In particular the Primary analysis one goes into bas.h5 details:

gondor.co - gondor Resources and Information.

http://aa314.gondor.co/webinar/primary-analysis/

gondor.co is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, gondor.co has it all. We hope you find what you are searching for!

**metheuse** · 06-05-2013, 10:48 AM

Originally posted by rhall View Post

The "*_s*_p0.fasta" and "*_s*_p0.ccs.fasta" files are generated from the "*_s*_p0.bas.h5" file.

The standard file format for PacBio data is the bas.h5 (bax.h5 for RSII), it includes a much richer set of data than both the fasta files. It contains all the sequencing information, the entire read from each ZMW, kinetic information for methylation analysis and the rich QVs needed for some PacBio specific pipelines (Quiver consensus calling, HGAP assembly).
The pdf here:

Build software better, together

https://github.com/PacificBiosciences/SMRT-Analysis/wiki/Official-Documentation#file-formats---release-v20

GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

has the full spec of the bas.h5 file.

More useful info on PacBio basics:

Build software better, together

https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Bioinformatics-Workshop

GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

In particular the Primary analysis one goes into bas.h5 details:
http://aa314.gondor.co/webinar/primary-analysis/

Thanks a lot!
I asked those questions because I was kind of confused when seeing this parameter in pbh5tools:

Code:

  --readType READTYPE   read type (CCS or Raw) [Raw]

My data is CCS, so this parameter should be "CCS". Does "Raw" actually mean other types of data?

**rhall** · 06-05-2013, 11:06 AM

I see the confusion, there is no difference between a CCS run and a non CCS run. The only sequencing variables are movie time and chemistry. The bas.h5 includes the raw read sequence and the CCS sequence (if more then 3 passes have been made of the inserted DNA between the adapters).

**metheuse** · 06-05-2013, 11:17 AM

Originally posted by rhall View Post

I see the confusion, there is no difference between a CCS run and a non CCS run. The only sequencing variables are movie time and chemistry. The bas.h5 includes the raw read sequence and the CCS sequence (if more then 3 passes have been made of the inserted DNA between the adapters).

Thanks so much!

**zacf** · 06-05-2013, 11:22 AM

Hi metheuse,

By any chance, are you in Ann Arbor? I am also new to PacBio and am trying to do the exact same thing, funny this SeqAnswers thread exists just yesterday. I'm a post-doc at UM, so if you are local and want to work together at this shoot me an email at [email protected].

**metheuse** · 08-29-2013, 12:00 PM

Hi, I have further questions on this thread.
The default smrt analysis pipeline keeps the long reads that have at least 3 passes of CCS. For my project, I want to get >=6 passes.
I know I can filter them by the polymerase read length (eg. >= 6xCCS + 7xadpater). But the length of each amplicon differs, so I want to filter each long read based on its own CCS length.
I know bash5tools.py can filter the reads by number of passes, but it only outputs fasta, fastq, or data format result. After filtering, I want to use blasr to map the reads. Blasr has the option to map CCS first, and then map the subreads to the window that CCS maps, so I think I'd better use bas.h5 instead of fastq as the input.
So the question is, how to filter CCS reads by minimum number of passes and output the filtered data as bas.h5?
Thanks.

**rhall** · 08-29-2013, 12:24 PM

Interesting use case, unfortunately not something that is straightforward. I can think of an around about way:
Generate a list of the read ids that have >=6 passes using bash5tools.py:
https://github.com/PacificBioscience.../doc/index.rst
Parse into whitelist format:
https://github.com/PacificBioscience...filtering-step
Then run filtering with the whitelist to produce a region file:

Code:

filterPlsH5 input.fofn –filter=’ReadWhitelist=<file containing list of included reads>’

This will produce a reads.rgn.h5 and filtered_regions.fofn file. You can then run blasr:

Code:

blasr input.fofn reference.fasta –regionTable filtered_regions.fofn ...

Hope that helps
Richard.

**metheuse** · 08-29-2013, 01:02 PM

Thanks a lot, Richard! This is exactly what I need!
One question: should I use bas.h5 or pls.h5 for the input to filterPls.h5? My understanding is that alignment only involves bases, right?
I just tried using both pls.h5 and bas.h5 for filterpls.h5. They both gave me the rgn.h5 and .fofn outputs. But they both said this:
No handlers could be found for logger "pbpy.io.MovieHDF5IO"
Can this be a problem?
Thanks!

**metheuse** · 08-29-2013, 01:07 PM

By the way, how do you know all of this? I searched online really hard but couldn't find much helpful information. The manuals of most smrtanalysis related programs don't have many details.

**rhall** · 08-29-2013, 01:18 PM

I work for PacBio. We don't always do a great job of documenting non standard workflows, we are attempting to address this with the devnet / github content.
filterplsH5.py is so named for historic reasons and works perfectly well with bas.h5 files. the error you are seeing should not be a problem, it is due to logging and not the actual filtering.

**metheuse** · 08-29-2013, 04:12 PM

I see. Thanks a lot for your help, Richard.
For blasr, I should let it output sam format, then convert it to h5 for quiver, right?

My project is to find low freq variants by PacBio's CCS method. So the pipeline is:
filter certain numbers of passes by bash5tools -> mapping by blasr -> samtoh5 -> variant calling by quiver

Do you have any resource that I can read for parameter settings in each step? Especially, in blasr and quiver, there are a lot parameters to control the mapping quality and stringency, variant quality, etc.

For example,

1. the "-best" parameter that controls how many top alignments to report, the default is 10, but I'd better set it to 1, right?

2. For quiver, I'm reading the FAQ at: https://github.com/PacificBioscience.../QuiverFAQ.rst
This is a helpful page and it gives some recommendation for some parameter settings. But if you know more than it, please let me know, especially for the following parameters:

--coverage COVERAGE, -X COVERAGE
A designation of the maximum coverage level to be used
for analysis. Exact interpretation is algorithm-
specific.

--mapQvThreshold MAPQVTHRESHOLD, -m MAPQVTHRESHOLD

--variantConfidenceThreshold VARIANTCONFIDENCETHRESHOLD, -q VARIANTCONFIDENCETHRESHOLD

--variantCoverageThreshold VARIANTCOVERAGETHRESHOLD, -x VARIANTCOVERAGETHRESHOLD

--referenceChunkOverlap REFERENCECHUNKOVERLAP

--noEvidenceConsensusCall {nocall,reference,lowercasereference}
The consensus base that will be output for sites with
no effective coverage.
etc.

Sorry that I have too many to ask. This is my first time to use quiver.
Thanks so much.

**rhall** · 08-29-2013, 04:51 PM

blasr -> quiver is not as simple as converting to cmp.h5 from sam, the cmp.h5 for use in quiver has to have a lot of extra information. blasr has always been considered an underlying program, in standard PacBio pipelines compareSequences.py is used as a wrapper for blasr. A new program pbalign: https://github.com/PacificBioscience.../doc/howto.rst
Takes all the headaches out of aligning data for use with quiver. maxHits, (similar to bestn, but blasr has two parameters that interplay nCandidates & bestn) should be set to 1.
For quiver I would just stick to defaults as a first pass:
variantCaller.py --algorithm=quiver <cmp.h5> -r <reference> -o <gff> -o <consensus.fasta> -o <consensus.fastq>

One question, why this complex workflow, are you mapping to something that is highly repetitive? blasr and quiver are designed to work with normal read data. Also quiver will only call haploid variants.

Topics	Statistics	Last Post
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 16 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 20 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM

Seqanswers Leaderboard Ad

Announcement

How to screen sequences for a certain fold coverage?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News