Seqanswers Leaderboard Ad

**GenoMax** · 05-23-2013, 03:25 AM

How about using "crossmatch" from Phred/Phrap to mask the vector. It is part of the Phred/Phrap suite (http://www.phrap.org/).

Repeat masker may work too http://www.repeatmasker.org/

You would have to edit the masked sequence out before doing assembly.

**krobison** · 05-24-2013, 10:06 AM

If I were faced with this situation (which I may be very soon) I would write a quick little program to slice the reads up when masked in the middle (I started to write it, then wasn't sure if you wanted FASTA or FASTQ splitting; both quite doable).

You might also post this to the MIRA mailing list; Bastien likes knowing how scientists are using MIRA & he sometimes knows some cool tricks that are thinly documented (or not yet released).

**Champi** · 05-24-2013, 09:57 PM

Originally posted by krobison View Post

If I were faced with this situation (which I may be very soon) I would write a quick little program to slice the reads up when masked in the middle (I started to write it, then wasn't sure if you wanted FASTA or FASTQ splitting; both quite doable).

Thank you very much krobison. I was thinking on writing a perl script to do that when found non-masked sequences at both sides of masked one in a read, splitting that read into reads a and b. However, my level of Perl is (below) very beginner

, and it will take time, although I think it's a good exercise. I'm working with fastq, what I suppose will make the thing double complicated, but will try.

Originally posted by krobison View Post

You might also post this to the MIRA mailing list; Bastien likes knowing how scientists are using MIRA & he sometimes knows some cool tricks that are thinly documented (or not yet released).

Will do for sure! Thank you for letting me know

Cheers

**mchaisso** · 05-25-2013, 10:41 AM

Hi, we address the same problem in our lab.

There are two solutions, one if you are doing things by hand, the other if you are going to incorporate this into a production pipeline and use SMRTPipe to handle assembly for you. Both need access to the original bas.h5 file.

For the first method
# Create a file of the usable subreads:
# this program is included in the blasr distribution under pbihdfutils
pls2fasta m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5 reads.fasta -trimByRegion

# Next map the reads to the vector using blasr. allowing for multiple hits
blasr reads.fasta vector.fasta -m 4 -bestn 10 -out alignments.tovector.m4

# use your own filtering program to cut out the masked sequences
filterSequences.py reads.fasta alignments.tovector.m4 reads.split.fasta

... do the rest of your assembly with the split file.

If you want to use smrtpipe to do all the heavy lifting, you need to do some modifications to the original bas.h5 file (ideally a copy of it, and more ideally a file that encodes a single dataset from this file).

So, a little, ok, a long, background about PacBio sequences if you are new to it.

A pacbio "read" will contain 3 parts: a low quality prefix, a usable middle portion, and a low quality suffix. The reason read is quoted is because the middle portion may contain one or more passes over the sequencing template which are often called subreads so you may consider a read to be the consensus of the subreads, or the subreads themselves. The low quality prefix and suffix exist because there is noise at the beginning and ending of a movie that are recorded and interpreted as bases and called.

The length of the low quality prefix/suffix are different for every read, and may even be 1000s of bases. However, the basecalling is VERY accurate at detecting just what are the low quality prefix and suffix sequences, and there is an annotation of this described next. And don't worry, the statistics on all the 10kb reads are after subtracting off the prefix and suffix, so the numbers are not inflated.

Tucked in the bas.h5 file is a dataset called a RegionTable:
>h5ls -r data/Analysis_Results/m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5
/ Group
/PulseData Group
/PulseData/BaseCalls Group
/PulseData/BaseCalls/Basecall Dataset {468462122/Inf}
/PulseData/BaseCalls/DeletionQV Dataset {468462122/Inf}
...
/PulseData/Regions

This table contains a description of the segments of a read, namely the coordinates of the start and end of the high quality middle, and the coordinates of the adapter and insert for every ZMW in the dataset.

Once you get over the headache of hdf5 coding, you can do nice things like make a dataset self documenting. In this instance, the description of the columns is part of the dataset:
ATTRIBUTE "ColumnNames" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 5 ) / ( H5S_UNLIMITED ) }
DATA {
(0): "HoleNumber", "Region type index", "Region start in bases",
(3): "Region end in bases", "Region score"
}
}

The region types are enumerated, which is also described in the dataset:

ATTRIBUTE "RegionTypes" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 3 ) / ( H5S_UNLIMITED ) }
DATA {
(0): "Adapter", "Insert", "HQRegion"
}
}

Looking at the first few lines of the region table yields this:

h5dump -d /PulseData/Regions data/Analysis_Results/m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5
HDF5 "data/Analysis_Results/m130517_113100_42134_c10052345255000000182308330928
1340_s1_p0.bas.h5" {
DATASET "/PulseData/Regions" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 197012, 5 ) / ( H5S_UNLIMITED, 5 ) }
DATA {
(0,0): 0, 1, 0, 426, -1,
(1,0): 0, 2, 0, 0, 0,
(2,0): 1, 1, 0, 507, -1,
(3,0): 1, 2, 0, 0, 0,
(4,0): 2, 1, 0, 523, -1,
(5,0): 2, 2, 0, 0, 0,
(6,0): 3, 1, 0, 354, -1,
(7,0): 3, 2, 0, 0, 0,
(8,0): 4, 1, 0, 15159, -1,
(9,0): 4, 1, 15211, 22459, -1,
(10,0): 4, 0, 15159, 15211, 750,
(11,0): 4, 2, 17273, 22458, 894,

You can see reads 0-3 have no usable sequence, which is expected due to the Poisson loading of the smrtcell.

Read 4 has a high-quality segment from 17273 - 22458. Paradoxically, it is possible for the low quality prefix and annotated insert regions to overlap.

Ok, now finally to the vector screening. This was mainly set up by our bioinformatics specialist with a little feedback from me. We use blasr to map the original fasta reads to the vector sequence, allowing for multiple hits:
blasr m130517_113100_42134_c100523452550000001823083309281340_s1_p0.fasta vector.fa -m 4 -bestn 10 -out alignment.m4 -header

The next step requires some python coding. HDF is extremely painful to use in C/C++, but pretty easy just a bit slow to use in python. If you modify the rows in the RegionTable to reflect the masked off sequence and write this back out, you can feed the modified bas.h5 files into smrtpipe, and assembly away, sans-vector.

Finally, an even better solution is to write out an entirely new file that contains only the region table m130517_113100_42134_c100523452550000001823083309281340_s1_p0.rgn.h5. Smrtpipe allows one to ignore the region table included in the bas.h5 file, and use the custom region table in this file, for scenarios just like this masking one here.

This post now sums up more text than I've ever posted on seqanswers since starting. Enjoy.

-mark

**Champi** · 05-26-2013, 04:33 PM

@mchaisso, that is an overwhelming amount of information

Since I don't have access to the original bas.h5 files (the provider gave me already filtered .fastq files), I will go for the first solution as I said just in my previous post: align the BAC vector sequence to the (already corrected) reads using SSAHA2 (although I think I could use blasr as you say) and then use the information from the output to split the reads with an in-house script and proceed with the resulting split-reads file.

If I get the original bas.h5 files (which I will ask the provider for), I will try also the second option that you propose.

Thank you very very much!!!

**skycreative** · 05-27-2013, 01:59 AM

I like this way, and if there is a seq contain the vector sequence in the middle,
it is also can be splited as paired reads for assembly.

Originally posted by GenoMax View Post

How about using "crossmatch" from Phred/Phrap to mask the vector. It is part of the Phred/Phrap suite (http://www.phrap.org/).

Repeat masker may work too http://www.repeatmasker.org/

You would have to edit the masked sequence out before doing assembly.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 23 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

BAC vector sequece masking for de novo assembly using PacBio C2

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News