The default smrt pipeline only outputs CCS reads that are >=3x coverage. But I would like to require that CCS reads have >=5X coverage. I'm wondering if anyone know how to modify the SMRT portal to achieve that goal (or if there is other software/script for this purpose). Thanks.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
pbh5tools has the functinality you need:
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
Within SMRT portal a way to achieve the same filter, if you have a fixed amplicon size, is to limit the read length to 5x the expected insert size.
-
Originally posted by rhall View Postpbh5tools has the functinality you need:
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
Within SMRT portal a way to achieve the same filter, if you have a fixed amplicon size, is to limit the read length to 5x the expected insert size.
Just several more simple questions:
For a CCS pacbio run, what does this file "*_s*_p0.bas.h5" contain? I know "*_s*_p0.fasta" has the long reads, and "*_s*_p0.ccs.fasta" has the subreads.
If I use pbh5tools and I want to output NumberOfPasses, the input should be "*_s*_p9.bas.h5", so the read type of it should be "CCS", right? I'm just confused by how this "bas.h5" can tell how many subreads are from a long read? Does it have information from both long read and subread?
Sorry I'm fresh new for pacbio, so please forgive my stupid questions.
Comment
-
The "*_s*_p0.fasta" and "*_s*_p0.ccs.fasta" files are generated from the "*_s*_p0.bas.h5" file.
The standard file format for PacBio data is the bas.h5 (bax.h5 for RSII), it includes a much richer set of data than both the fasta files. It contains all the sequencing information, the entire read from each ZMW, kinetic information for methylation analysis and the rich QVs needed for some PacBio specific pipelines (Quiver consensus calling, HGAP assembly).
The pdf here:
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
has the full spec of the bas.h5 file.
More useful info on PacBio basics:
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
In particular the Primary analysis one goes into bas.h5 details:
Comment
-
Originally posted by rhall View PostThe "*_s*_p0.fasta" and "*_s*_p0.ccs.fasta" files are generated from the "*_s*_p0.bas.h5" file.
The standard file format for PacBio data is the bas.h5 (bax.h5 for RSII), it includes a much richer set of data than both the fasta files. It contains all the sequencing information, the entire read from each ZMW, kinetic information for methylation analysis and the rich QVs needed for some PacBio specific pipelines (Quiver consensus calling, HGAP assembly).
The pdf here:
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
has the full spec of the bas.h5 file.
More useful info on PacBio basics:
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
In particular the Primary analysis one goes into bas.h5 details:
http://aa314.gondor.co/webinar/primary-analysis/
Thanks a lot!
I asked those questions because I was kind of confused when seeing this parameter in pbh5tools:
Code:--readType READTYPE read type (CCS or Raw) [Raw]
Comment
-
I see the confusion, there is no difference between a CCS run and a non CCS run. The only sequencing variables are movie time and chemistry. The bas.h5 includes the raw read sequence and the CCS sequence (if more then 3 passes have been made of the inserted DNA between the adapters).
Comment
-
Originally posted by rhall View PostI see the confusion, there is no difference between a CCS run and a non CCS run. The only sequencing variables are movie time and chemistry. The bas.h5 includes the raw read sequence and the CCS sequence (if more then 3 passes have been made of the inserted DNA between the adapters).
Comment
-
Hi metheuse,
By any chance, are you in Ann Arbor? I am also new to PacBio and am trying to do the exact same thing, funny this SeqAnswers thread exists just yesterday. I'm a post-doc at UM, so if you are local and want to work together at this shoot me an email at [email protected].
Comment
-
Hi, I have further questions on this thread.
The default smrt analysis pipeline keeps the long reads that have at least 3 passes of CCS. For my project, I want to get >=6 passes.
I know I can filter them by the polymerase read length (eg. >= 6xCCS + 7xadpater). But the length of each amplicon differs, so I want to filter each long read based on its own CCS length.
I know bash5tools.py can filter the reads by number of passes, but it only outputs fasta, fastq, or data format result. After filtering, I want to use blasr to map the reads. Blasr has the option to map CCS first, and then map the subreads to the window that CCS maps, so I think I'd better use bas.h5 instead of fastq as the input.
So the question is, how to filter CCS reads by minimum number of passes and output the filtered data as bas.h5?
Thanks.
Comment
-
Interesting use case, unfortunately not something that is straightforward. I can think of an around about way:
Generate a list of the read ids that have >=6 passes using bash5tools.py:
https://github.com/PacificBioscience.../doc/index.rst
Parse into whitelist format:
https://github.com/PacificBioscience...filtering-step
Then run filtering with the whitelist to produce a region file:
Code:filterPlsH5 input.fofn –filter=’ReadWhitelist=<file containing list of included reads>’
Code:blasr input.fofn reference.fasta –regionTable filtered_regions.fofn ...
Richard.
Comment
-
Thanks a lot, Richard! This is exactly what I need!
One question: should I use bas.h5 or pls.h5 for the input to filterPls.h5? My understanding is that alignment only involves bases, right?
I just tried using both pls.h5 and bas.h5 for filterpls.h5. They both gave me the rgn.h5 and .fofn outputs. But they both said this:
No handlers could be found for logger "pbpy.io.MovieHDF5IO"
Can this be a problem?
Thanks!
Comment
-
I work for PacBio. We don't always do a great job of documenting non standard workflows, we are attempting to address this with the devnet / github content.
filterplsH5.py is so named for historic reasons and works perfectly well with bas.h5 files. the error you are seeing should not be a problem, it is due to logging and not the actual filtering.
Comment
-
I see. Thanks a lot for your help, Richard.
For blasr, I should let it output sam format, then convert it to h5 for quiver, right?
My project is to find low freq variants by PacBio's CCS method. So the pipeline is:
filter certain numbers of passes by bash5tools -> mapping by blasr -> samtoh5 -> variant calling by quiver
Do you have any resource that I can read for parameter settings in each step? Especially, in blasr and quiver, there are a lot parameters to control the mapping quality and stringency, variant quality, etc.
For example,
1. the "-best" parameter that controls how many top alignments to report, the default is 10, but I'd better set it to 1, right?
2. For quiver, I'm reading the FAQ at: https://github.com/PacificBioscience.../QuiverFAQ.rst
This is a helpful page and it gives some recommendation for some parameter settings. But if you know more than it, please let me know, especially for the following parameters:
--coverage COVERAGE, -X COVERAGE
A designation of the maximum coverage level to be used
for analysis. Exact interpretation is algorithm-
specific.
--mapQvThreshold MAPQVTHRESHOLD, -m MAPQVTHRESHOLD
--variantConfidenceThreshold VARIANTCONFIDENCETHRESHOLD, -q VARIANTCONFIDENCETHRESHOLD
--variantCoverageThreshold VARIANTCOVERAGETHRESHOLD, -x VARIANTCOVERAGETHRESHOLD
--referenceChunkOverlap REFERENCECHUNKOVERLAP
--noEvidenceConsensusCall {nocall,reference,lowercasereference}
The consensus base that will be output for sites with
no effective coverage.
etc.
Sorry that I have too many to ask. This is my first time to use quiver.
Thanks so much.
Comment
-
blasr -> quiver is not as simple as converting to cmp.h5 from sam, the cmp.h5 for use in quiver has to have a lot of extra information. blasr has always been considered an underlying program, in standard PacBio pipelines compareSequences.py is used as a wrapper for blasr. A new program pbalign: https://github.com/PacificBioscience.../doc/howto.rst
Takes all the headaches out of aligning data for use with quiver. maxHits, (similar to bestn, but blasr has two parameters that interplay nCandidates & bestn) should be set to 1.
For quiver I would just stick to defaults as a first pass:
variantCaller.py --algorithm=quiver <cmp.h5> -r <reference> -o <gff> -o <consensus.fasta> -o <consensus.fastq>
One question, why this complex workflow, are you mapping to something that is highly repetitive? blasr and quiver are designed to work with normal read data. Also quiver will only call haploid variants.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 05-02-2024, 08:06 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
05-02-2024, 08:06 AM
|
||
Started by seqadmin, 04-30-2024, 12:17 PM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
04-30-2024, 12:17 PM
|
||
Started by seqadmin, 04-29-2024, 10:49 AM
|
0 responses
25 views
0 likes
|
Last Post
by seqadmin
04-29-2024, 10:49 AM
|
||
Started by seqadmin, 04-25-2024, 11:49 AM
|
0 responses
28 views
0 likes
|
Last Post
by seqadmin
04-25-2024, 11:49 AM
|
Comment