Seqanswers Leaderboard Ad

**GenoMax** · 08-30-2014, 09:33 AM

Do you have access to the analysis reports that are generated by SMRTanalysis package (simple filtering/"reads of insert" analysis are good ones to start)? You could ask your sequence provider for them.

If you have the *.h5 and metadata.xml files for the SMRTcells then you could generate them yourself by using SMRTanalysis/SMRTPortal software. If you are not IT savvy then it is not a simple task to get SMRTanalysis setup right (fair warning).

There is information/wiki available here: https://github.com/PacificBioscience...-Training/wiki

Also video tutorials/webinars: https://github.com/PacificBioscience...atics-Workshop

**Medhat** · 08-30-2014, 10:13 AM

I already installed SMRTPorta on my pc, also I have the .h5 , but what protocol shall I use to generate the data?

**GenoMax** · 08-30-2014, 10:57 AM

Originally posted by Medhat View Post

I already installed SMRTPorta on my pc, also I have the .h5 , but what protocol shall I use to generate the data?

Once you import the SMRTcell data into SMRTportal you can start with the basic filtering (RS_subread) protocol. This will filter the adapters and give you all subreads. Depending on how this run was done (length of movie etc) and the size of the inserts it may be worth running the "RS_reads_of_insert" protocol next. This would give you the consensus reads (used to be called CCS). If you have adequate amount of coverage (20x or more with good median length of reads, 3-4kb or more) and this is a small genome (e.g. bacteria) then you can try the RS_HGAP assembly next.

Watch this tutorial first for an overview of the process: http://aa314.gondor.co/webinar/secondary-analysis/

**Medhat** · 09-01-2014, 12:31 AM

Thanks a lot for the detailed answer,
I have a big genome plant one, not bacteria.
Is there is a special way to measure the coverage from the pacbio subreads?

Ok right now I imported all the cells and I am running this protocol RS_subread, but also in the analysis directory there is some files like this;
..._s1_p0.2.subreads.fasta
..._s1_p0.1.subreads.fasta
..._s1_p0.3.subreads.fasta

Is that means the company runs this protocol for me first because of the existing of this subreads.fasta ?!

**GenoMax** · 09-01-2014, 04:20 AM

Those files are part of the primary data produced on the instrument itself.

Has the subread filtering finished (should be reasonably quick) for each SMRTcell. The analyzed data should end up under /path_to/smrtanalysis/common/jobs/0xx/0xxxxx/data directory. The 0xxxxx number is the job ID that is generated by SMRTportal. You can get at the fastq/fasta files from the web interface (without having to dig through the directory hierarchy above). They are generally called "filtered_subreads.fasta/q" for each "job_id". A good graphical overview report is produced for each job.

How many cells do you have? With a plant genome your coverage is not going to be great unless you ran a lot of SMRTcells.

In general you want the dimer/adapter % to be very small. Your mean read length should be long and the productivity (look for the % number for P1) to be max as compared to P0 (no sequence) or P2 (more than one sequence) generating wells.

You must have received some guidance on the quality of the data from the provider.

**Medhat** · 09-01-2014, 05:08 AM

I have 18 cell
the results of the RS_subread "using default parameters"
Job Metric Value
Adapter Dimers (0-10bp)
0.01%
Short Inserts (11-100bp)
0.01%
Number of Bases
7,604,261,565
Number of Reads
1,218,435
N50 Read Length
10,405
Mean Read Length
6,241
Mean Read Score
0.83

"Of course there is the generated fasta and fastaq file generated in the data section in the smrt portal shall I download them to begin the next protocol RS_reads_of_insert"

**GenoMax** · 09-01-2014, 05:51 AM

Those numbers look good.

RS_reads_of_insert should be run with the original SMRTcell data. Since you have fairly long reads the CCS passes you get may not be very high (if you have a long read the polymerase may only go around that insert a few times during the run).

The fasta/fastq you generated in filter step can be set aside for now.

What is it that you want to finally do with this data? Answer of what to do next would partially depend on that.

**Medhat** · 09-02-2014, 01:27 AM

Ok right now I started the RS_reads_ofinsert_1
The target Is to do Assembly which will include illumena reads,
after that I shall begin to do comparative genomics with other isolates in the genome database ,

Other question how can you know from the report resulted from the RS_subread that I have good or bad reads also Mean Read Score 0.83 how the measure it what is the upper and lower limits "meaning what is good or bad "
and for the assembly I need to calculate the coverage how can I do that with pacbio reads

thank you very much I highly appreciate your time and help

**GenoMax** · 09-03-2014, 04:00 AM

Do you have an idea of the size of the genome of the plant you are working with? I suppose you can calculate theoretical coverage based on the number of bases you have in your dataset (7.6 gb). How was this library prepared? Do you expect the representation to be random?

You would want to use the RS_CeleraAssembler protocol to include your Illumina reads with the PacBio data.

Dr. Hall from PacBio participates on this forum and perhaps he can chime in with additional suggestions.

**Medhat** · 09-03-2014, 04:25 AM

genome of of about 2.4 gb

mean while the result of RS_reads_ofinsert_1;

Job Metric Value
Read Bases of Insert
688,951,419
Mean Read Length of Insert
2,289.0
Read Quality of Insert
97.18%
Mean Number of Passes
6.0

**GenoMax** · 09-03-2014, 05:23 AM

How much Illumina data do you have?

You only have about 3x coverage with total PacBio data but with "reads_of_insert" that number falls below 1x. Not unexpected since you would need to run a ton of SMRTcells to get really deep coverage.

The CCS reads should be of good quality and if you blast a few of them at NCBI you should see good hits (just to confirm that you have the right sequence/plant) to either your plant or a close relative.

**Medhat** · 09-04-2014, 12:41 AM

ok that is great, but here,
RS_subread
Number of Bases
7,604,261,565
Number of Reads
1,218,435

and
RS_reads_ofinsert_1;

Read Bases of Insert
688,951,419
Mean Read Length of Insert
2,289.0

how the coverage in the first be 3x and the second only 1x ?!

**GenoMax** · 09-04-2014, 03:15 AM

Isn't the genome you are working with ~2,400,000,000 bp? Take that into consideration with the total number of bases that you are getting from your stats (this is a x/y kind of calculation and the distribution of those bases is not going to be uniform across the genome).

**rhall** · 09-04-2014, 10:52 AM

Just catching up with the thread, for a hybrid (illumina + PacBio) assembly ~3x is not enough coverage. I would suggest looking at assembling your illumina data and using the PacBio CLR (Continuous Long Reads, not CCS) for gap filling with PBJelly.
http://www.plosone.org/article/info%...l.pone.0047768
https://github.com/PacificBioscience...Bio-Long-Reads

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

how to measure the quality of PacBio data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News