Blasr/Quiver - Best Practice for Current Versions?

gconcepcion replied

11-29-2018, 02:27 PM
Originally posted by ro2006 View Post

Hi everybody,
I hope my question is not too naive, but I'm quite new to the De Novo assembly. I did a De Novo assembly starting from PacBio reads using Canu.
Now I should polish the assembly and perform variant calling using arrow. My question is: in order to run pbalign as mentioned above, what's your reference.fasta file from the Canu output? Would this be the Contigs.fasta file?
Thanks in advance for any help

Yes, the contigs.fasta that is output from Canu is what you want to `polish` so you will select that as your `reference.fasta` to which you map your raw reads using pbalign. Then variantCaller can subsequently call consensus for each base in your reference.fasta based on the raw read coverage
Leave a comment:
ro2006 replied

11-20-2018, 06:01 AM
Originally posted by rhall View Post

Basic workflow using the data you have and the latest versions of software installed using pitchfork:
bax.h5 -> bax2bam -> reads.bam
reads.bam + reference.fasta -> pbalign -> aligned_reads.bam
aligned_reads.bam -> variant_caller (quiver or arrow) -> consensus.fasta

Hi everybody,
I hope my question is not too naive, but I'm quite new to the De Novo assembly. I did a De Novo assembly starting from PacBio reads using Canu.
Now I should polish the assembly and perform variant calling using arrow. My question is: in order to run pbalign as mentioned above, what's your reference.fasta file from the Canu output? Would this be the Contigs.fasta file?
Thanks in advance for any help
Leave a comment:
jpummil replied

01-21-2017, 06:57 PM
Apologies, attached is Quast report comparing base Canu assembly, Base + Quiver polishing, and Quiver + Pilon....

Note, still using --glimmer for gene finding in Quast. Need to switch over to GeneMark-ES (soon).
Attached Files

quast-report.txt (3.2 KB, 48 views)
Leave a comment:
jpummil replied

01-20-2017, 08:02 AM
Results of using Pilon AFTER having used Quiver....

So, Pilon just completed after 40+ hours of computation....another small increase in file size from what Quiver provided...

-rw-rw-r-- 1 jpummil jpummil 121958163 Jan 20 08:50 Q1133.012017.GS170M.Canu-Base.fasta

-rw-rw-r-- 1 jpummil jpummil 122467877 Jan 20 02:16 Q1133.122116.GS170M.Quiver.fasta (+ 509,714 bytes)

-rw-rw-r-- 1 jpummil jpummil 122649681 Jan 20 02:16 Q1133.012017.GS170M.Pilon.fasta (+ 181,804 bytes)

Still need to do a Quast assessment, maybe sometime today...
Leave a comment:
jpummil replied

01-18-2017, 07:08 AM
So, I still need to write up a brief process workflow to describe what I've doe thus far, but thought I'd update anyone following this thread on another step I heard about thru the grapevine (people at PAG this week sending me email).

APPARENTLY....many are performing an additional step after the Quiver polishing and running the updated assembly thru BROAD's Pilon tool. Of course, you have to create a new alignment file with the original bax.h5 files (converted to .bams with bax2bam utility) and the Quiver polished assembly....

It's running now, so I'll report back (bad or good) when it completes and I can run a quick comparison of the three with Quast.

It should be said that I have been just using default settings with most of the tools, so probably some incremental gains to be had by tweaking settings IF you know your data well. That said, it seems developers do a pretty nice job of setting defaults these days!
Leave a comment:
jpummil replied

12-21-2016, 09:15 AM
SUCCESS!

Not a huge improvement on the original assembly (using Canu)...but improved gene's found using Quast (comparison attached).

I used default parameters in Quiver, not sure tweaking would've changed a whole lot.

I'm "officially" on vacation 'til Jan 9th, but will try and toss together a short "How-To" based on the steps I followed (and those tips graciously supplied by rhall and gconcepcion!) so that others just starting to work with PacBio data and tools can do this as well.
Attached Files

report.txt (2.0 KB, 47 views)
Leave a comment:
jpummil replied

12-20-2016, 11:43 AM
Originally posted by gconcepcion View Post

$ ls *.subreads.bam >input.fofn
$ pbalign --nproc 32 input.fofn Q1133.092116.GS170M.contigs.fasta Q1133.aligned.bam

pbalign takes individual sequence files as input, or a File of File Names (*.fofn)

From pbalign --help:
"Input can be a fasta, pls.h5, bas.h5 or ccs.h5 file or a fofn (file of file names). Output can be in SAM or BAM format."

Thanks gconcepcion! That definitely did the trick. Ran into a /tmp space issue next. THINK I have it solved by supplying the --tmpDir option in the command line and setting it to the /local_scratch on the compute node as well...almost a terabyte available there. Will know very soon ;-)

Onwards!
Leave a comment:
gconcepcion replied

12-19-2016, 04:31 PM
Originally posted by jpummil View Post

Used bax2bam to convert all of the bax.h5 files to .bam format...no problem.

When I try to use pbalign on the 10ea .bam files, it complains after it appears to read and accept the first three files:

pbalign: error: unrecognized arguments: movie_04.subreads.bam movie_05.subreads.bam movie_06.subreads.bam movie_07.subre
ads.bam movie_08.subreads.bam movie_09.subreads.bam movie_10.subreads.bam Q1133.092116.GS170M.contigs.fasta Q1133.aligne
d.bam

PBalign comand as follows:
pbalign --nproc 32 *.subreads.bam Q1133.092116.GS170M.contigs.fasta Q1133.aligned.bam

*.subreads.bam corresponds to the following filenames:
movie_01.subreads.bam
thru
movie_10.subreads.bam

I don't see anything obvious in the pbalign --help details about running multiple .bam files for alignment.

My initial tests using just a single lane worked as expected. It's when I get to the task of running 10ea lanes thru pbalign that things go south.

I'll keep looking for docs, etc...but any tips are appreciated. It just seems as if the PacBio help documents are a bit lacking on details...

$ ls *.subreads.bam >input.fofn
$ pbalign --nproc 32 input.fofn Q1133.092116.GS170M.contigs.fasta Q1133.aligned.bam

pbalign takes individual sequence files as input, or a File of File Names (*.fofn)

From pbalign --help:
"Input can be a fasta, pls.h5, bas.h5 or ccs.h5 file or a fofn (file of file names). Output can be in SAM or BAM format."

Last edited by gconcepcion; 12-19-2016, 04:34 PM.
Leave a comment:
jpummil replied

12-19-2016, 04:09 PM
Used bax2bam to convert all of the bax.h5 files to .bam format...no problem.

When I try to use pbalign on the 10ea .bam files, it complains after it appears to read and accept the first three files:

pbalign: error: unrecognized arguments: movie_04.subreads.bam movie_05.subreads.bam movie_06.subreads.bam movie_07.subre
ads.bam movie_08.subreads.bam movie_09.subreads.bam movie_10.subreads.bam Q1133.092116.GS170M.contigs.fasta Q1133.aligne
d.bam

PBalign comand as follows:
pbalign --nproc 32 *.subreads.bam Q1133.092116.GS170M.contigs.fasta Q1133.aligned.bam

*.subreads.bam corresponds to the following filenames:
movie_01.subreads.bam
thru
movie_10.subreads.bam

I don't see anything obvious in the pbalign --help details about running multiple .bam files for alignment.

My initial tests using just a single lane worked as expected. It's when I get to the task of running 10ea lanes thru pbalign that things go south.

I'll keep looking for docs, etc...but any tips are appreciated. It just seems as if the PacBio help documents are a bit lacking on details...
Leave a comment:
rhall replied

12-16-2016, 11:19 AM
FYI, the scraps file includes all the sequence not in the HQ (high quality) region, and the adapter sequences. The old bax format saved everything with indexes into the usable data. The subreads.bam contains all the usable data.
Leave a comment:
jpummil replied

12-16-2016, 10:57 AM
Originally posted by rhall View Post

Yes the bax2bam should be ran per movie (3 files). PBalign can then be ran using all the data.

Thanks rhall! Was just about to post an update when you posted this ;-)

-rw-r--r-- 1 jpummil jpummil 2.3G Dec 16 12:06 m160611_230902_00125_c101003722550000001823226809161614_s1_p0.1.bax.h5

-rw-r--r-- 1 jpummil jpummil 2.2G Dec 16 12:06 m160611_230902_00125_c101003722550000001823226809161614_s1_p0.2.bax.h5

-rw-r--r-- 1 jpummil jpummil 2.2G Dec 16 12:06 m160611_230902_00125_c101003722550000001823226809161614_s1_p0.3.bax.h5

BECOMES (using bax2bam):

-rw-rw-r-- 1 jpummil jpummil 1.3G Dec 16 12:18 movie.scraps.bam

-rw-rw-r-- 1 jpummil jpummil 2.3G Dec 16 12:18 movie.subreads.bam

Now running PBalign (blasr aligner) of the one lane of converted data to create "Q1133.aligned.bam".

I'll then see about a Quiver run of that file just to validate the process. Afterwards, I'll go back and tidy things up to process all ten lanes of data, put thru PBalign, then Quiver.
Leave a comment:
rhall replied

12-16-2016, 10:52 AM
Yes the bax2bam should be ran per movie (3 files). PBalign can then be ran using all the data.
Leave a comment:
jpummil replied

12-15-2016, 05:39 PM
Hmmm...apparently some misunderstanding about "movies" on my part.

Using the following bax2bam command:
bax2bam Q1133.bax.h5/*.bax.h5 -o movie --subread

ERROR: multiple movies detected:
ERROR: m160611_230902_00125_c101003722550000001823226809161614_s1_p0
ERROR: m160612_032816_00125_c101003722550000001823226809161615_s1_p0
ERROR: m160612_074931_00125_c101003722550000001823226809161616_s1_p0
ERROR: m160614_131714_00125_c101010592550000001823230710211662_s1_p0
ERROR: m160614_194151_00125_c101010592550000001823230710211663_s1_p0
ERROR: m160615_020159_00125_c101010592550000001823230710211664_s1_p0
ERROR: m160615_062522_00125_c101010592550000001823230710211665_s1_p0
ERROR: m160615_104711_00125_c101010592550000001823230710211666_s1_p0
ERROR: m160615_150655_00125_c101010592550000001823230710211667_s1_p0
ERROR: m160615_213630_00125_c101009882550000001823230710211620_s1_p0

Contents of my bax.h5 directory:

[jpummil@tres-l1 Q1133.bax.h5]$ ls
m160611_230902_00125_c101003722550000001823226809161614_s1_p0.1.bax.h5
m160611_230902_00125_c101003722550000001823226809161614_s1_p0.2.bax.h5
m160611_230902_00125_c101003722550000001823226809161614_s1_p0.3.bax.h5
m160612_032816_00125_c101003722550000001823226809161615_s1_p0.1.bax.h5
m160612_032816_00125_c101003722550000001823226809161615_s1_p0.2.bax.h5
m160612_032816_00125_c101003722550000001823226809161615_s1_p0.3.bax.h5
m160612_074931_00125_c101003722550000001823226809161616_s1_p0.1.bax.h5
m160612_074931_00125_c101003722550000001823226809161616_s1_p0.2.bax.h5
m160612_074931_00125_c101003722550000001823226809161616_s1_p0.3.bax.h5
m160614_131714_00125_c101010592550000001823230710211662_s1_p0.1.bax.h5
m160614_131714_00125_c101010592550000001823230710211662_s1_p0.2.bax.h5
m160614_131714_00125_c101010592550000001823230710211662_s1_p0.3.bax.h5
m160614_194151_00125_c101010592550000001823230710211663_s1_p0.1.bax.h5
m160614_194151_00125_c101010592550000001823230710211663_s1_p0.2.bax.h5
m160614_194151_00125_c101010592550000001823230710211663_s1_p0.3.bax.h5
m160615_020159_00125_c101010592550000001823230710211664_s1_p0.1.bax.h5
m160615_020159_00125_c101010592550000001823230710211664_s1_p0.2.bax.h5
m160615_020159_00125_c101010592550000001823230710211664_s1_p0.3.bax.h5
m160615_062522_00125_c101010592550000001823230710211665_s1_p0.1.bax.h5
m160615_062522_00125_c101010592550000001823230710211665_s1_p0.2.bax.h5
m160615_062522_00125_c101010592550000001823230710211665_s1_p0.3.bax.h5
m160615_104711_00125_c101010592550000001823230710211666_s1_p0.1.bax.h5
m160615_104711_00125_c101010592550000001823230710211666_s1_p0.2.bax.h5
m160615_104711_00125_c101010592550000001823230710211666_s1_p0.3.bax.h5
m160615_150655_00125_c101010592550000001823230710211667_s1_p0.1.bax.h5
m160615_150655_00125_c101010592550000001823230710211667_s1_p0.2.bax.h5
m160615_150655_00125_c101010592550000001823230710211667_s1_p0.3.bax.h5
m160615_213630_00125_c101009882550000001823230710211620_s1_p0.1.bax.h5
m160615_213630_00125_c101009882550000001823230710211620_s1_p0.2.bax.h5
m160615_213630_00125_c101009882550000001823230710211620_s1_p0.3.bax.h5

So, does one just process the three files per group into a single .bam file, do the next, etc...then repeat the PBalign step per each .bam created?
Leave a comment:
jpummil replied

12-14-2016, 10:49 AM
Originally posted by rhall View Post

Just let me know if you need any more details on a particular step. If it's a particularly large dataset http://pacificbiosciences.github.io/...coretools.html has info on how to do a lightweight chunking without using the full workflow engine.

Genome size is about 170Mb....and I have ~65GB of bax.h5 files. BUT...I also have a pretty robust linux server to run it on.
Leave a comment:
rhall replied

12-14-2016, 10:47 AM
Just let me know if you need any more details on a particular step. If it's a particularly large dataset http://pacificbiosciences.github.io/...coretools.html has info on how to do a lightweight chunking without using the full workflow engine.
Leave a comment:

Previous 1 2 template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News