Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • skycreative
    replied
    I like this way, and if there is a seq contain the vector sequence in the middle,
    it is also can be splited as paired reads for assembly.
    Originally posted by GenoMax View Post
    How about using "crossmatch" from Phred/Phrap to mask the vector. It is part of the Phred/Phrap suite (http://www.phrap.org/).

    Repeat masker may work too http://www.repeatmasker.org/

    You would have to edit the masked sequence out before doing assembly.

    Leave a comment:


  • Champi
    replied
    @mchaisso, that is an overwhelming amount of information

    Since I don't have access to the original bas.h5 files (the provider gave me already filtered .fastq files), I will go for the first solution as I said just in my previous post: align the BAC vector sequence to the (already corrected) reads using SSAHA2 (although I think I could use blasr as you say) and then use the information from the output to split the reads with an in-house script and proceed with the resulting split-reads file.

    If I get the original bas.h5 files (which I will ask the provider for), I will try also the second option that you propose.

    Thank you very very much!!!

    Leave a comment:


  • mchaisso
    replied
    Hi, we address the same problem in our lab.

    There are two solutions, one if you are doing things by hand, the other if you are going to incorporate this into a production pipeline and use SMRTPipe to handle assembly for you. Both need access to the original bas.h5 file.

    For the first method
    # Create a file of the usable subreads:
    # this program is included in the blasr distribution under pbihdfutils
    pls2fasta m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5 reads.fasta -trimByRegion

    # Next map the reads to the vector using blasr. allowing for multiple hits
    blasr reads.fasta vector.fasta -m 4 -bestn 10 -out alignments.tovector.m4

    # use your own filtering program to cut out the masked sequences
    filterSequences.py reads.fasta alignments.tovector.m4 reads.split.fasta

    ... do the rest of your assembly with the split file.

    If you want to use smrtpipe to do all the heavy lifting, you need to do some modifications to the original bas.h5 file (ideally a copy of it, and more ideally a file that encodes a single dataset from this file).

    So, a little, ok, a long, background about PacBio sequences if you are new to it.

    A pacbio "read" will contain 3 parts: a low quality prefix, a usable middle portion, and a low quality suffix. The reason read is quoted is because the middle portion may contain one or more passes over the sequencing template which are often called subreads so you may consider a read to be the consensus of the subreads, or the subreads themselves. The low quality prefix and suffix exist because there is noise at the beginning and ending of a movie that are recorded and interpreted as bases and called.

    The length of the low quality prefix/suffix are different for every read, and may even be 1000s of bases. However, the basecalling is VERY accurate at detecting just what are the low quality prefix and suffix sequences, and there is an annotation of this described next. And don't worry, the statistics on all the 10kb reads are after subtracting off the prefix and suffix, so the numbers are not inflated.

    Tucked in the bas.h5 file is a dataset called a RegionTable:
    >h5ls -r data/Analysis_Results/m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5
    / Group
    /PulseData Group
    /PulseData/BaseCalls Group
    /PulseData/BaseCalls/Basecall Dataset {468462122/Inf}
    /PulseData/BaseCalls/DeletionQV Dataset {468462122/Inf}
    ...
    /PulseData/Regions

    This table contains a description of the segments of a read, namely the coordinates of the start and end of the high quality middle, and the coordinates of the adapter and insert for every ZMW in the dataset.

    Once you get over the headache of hdf5 coding, you can do nice things like make a dataset self documenting. In this instance, the description of the columns is part of the dataset:
    ATTRIBUTE "ColumnNames" {
    DATATYPE H5T_STRING {
    STRSIZE H5T_VARIABLE;
    STRPAD H5T_STR_NULLTERM;
    CSET H5T_CSET_ASCII;
    CTYPE H5T_C_S1;
    }
    DATASPACE SIMPLE { ( 5 ) / ( H5S_UNLIMITED ) }
    DATA {
    (0): "HoleNumber", "Region type index", "Region start in bases",
    (3): "Region end in bases", "Region score"
    }
    }

    The region types are enumerated, which is also described in the dataset:

    ATTRIBUTE "RegionTypes" {
    DATATYPE H5T_STRING {
    STRSIZE H5T_VARIABLE;
    STRPAD H5T_STR_NULLTERM;
    CSET H5T_CSET_ASCII;
    CTYPE H5T_C_S1;
    }
    DATASPACE SIMPLE { ( 3 ) / ( H5S_UNLIMITED ) }
    DATA {
    (0): "Adapter", "Insert", "HQRegion"
    }
    }

    Looking at the first few lines of the region table yields this:

    h5dump -d /PulseData/Regions data/Analysis_Results/m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5
    HDF5 "data/Analysis_Results/m130517_113100_42134_c10052345255000000182308330928
    1340_s1_p0.bas.h5" {
    DATASET "/PulseData/Regions" {
    DATATYPE H5T_STD_I32LE
    DATASPACE SIMPLE { ( 197012, 5 ) / ( H5S_UNLIMITED, 5 ) }
    DATA {
    (0,0): 0, 1, 0, 426, -1,
    (1,0): 0, 2, 0, 0, 0,
    (2,0): 1, 1, 0, 507, -1,
    (3,0): 1, 2, 0, 0, 0,
    (4,0): 2, 1, 0, 523, -1,
    (5,0): 2, 2, 0, 0, 0,
    (6,0): 3, 1, 0, 354, -1,
    (7,0): 3, 2, 0, 0, 0,
    (8,0): 4, 1, 0, 15159, -1,
    (9,0): 4, 1, 15211, 22459, -1,
    (10,0): 4, 0, 15159, 15211, 750,
    (11,0): 4, 2, 17273, 22458, 894,

    You can see reads 0-3 have no usable sequence, which is expected due to the Poisson loading of the smrtcell.

    Read 4 has a high-quality segment from 17273 - 22458. Paradoxically, it is possible for the low quality prefix and annotated insert regions to overlap.


    Ok, now finally to the vector screening. This was mainly set up by our bioinformatics specialist with a little feedback from me. We use blasr to map the original fasta reads to the vector sequence, allowing for multiple hits:
    blasr m130517_113100_42134_c100523452550000001823083309281340_s1_p0.fasta vector.fa -m 4 -bestn 10 -out alignment.m4 -header

    The next step requires some python coding. HDF is extremely painful to use in C/C++, but pretty easy just a bit slow to use in python. If you modify the rows in the RegionTable to reflect the masked off sequence and write this back out, you can feed the modified bas.h5 files into smrtpipe, and assembly away, sans-vector.

    Finally, an even better solution is to write out an entirely new file that contains only the region table m130517_113100_42134_c100523452550000001823083309281340_s1_p0.rgn.h5. Smrtpipe allows one to ignore the region table included in the bas.h5 file, and use the custom region table in this file, for scenarios just like this masking one here.


    This post now sums up more text than I've ever posted on seqanswers since starting. Enjoy.

    -mark

    Leave a comment:


  • Champi
    replied
    Originally posted by krobison View Post
    If I were faced with this situation (which I may be very soon) I would write a quick little program to slice the reads up when masked in the middle (I started to write it, then wasn't sure if you wanted FASTA or FASTQ splitting; both quite doable).
    Thank you very much krobison. I was thinking on writing a perl script to do that when found non-masked sequences at both sides of masked one in a read, splitting that read into reads a and b. However, my level of Perl is (below) very beginner , and it will take time, although I think it's a good exercise. I'm working with fastq, what I suppose will make the thing double complicated, but will try.

    Originally posted by krobison View Post
    You might also post this to the MIRA mailing list; Bastien likes knowing how scientists are using MIRA & he sometimes knows some cool tricks that are thinly documented (or not yet released).
    Will do for sure! Thank you for letting me know

    Cheers

    Leave a comment:


  • krobison
    replied
    If I were faced with this situation (which I may be very soon) I would write a quick little program to slice the reads up when masked in the middle (I started to write it, then wasn't sure if you wanted FASTA or FASTQ splitting; both quite doable).

    You might also post this to the MIRA mailing list; Bastien likes knowing how scientists are using MIRA & he sometimes knows some cool tricks that are thinly documented (or not yet released).

    Leave a comment:


  • GenoMax
    replied
    How about using "crossmatch" from Phred/Phrap to mask the vector. It is part of the Phred/Phrap suite (http://www.phrap.org/).

    Repeat masker may work too http://www.repeatmasker.org/

    You would have to edit the masked sequence out before doing assembly.
    Last edited by GenoMax; 05-23-2013, 03:27 AM.

    Leave a comment:


  • BAC vector sequece masking for de novo assembly using PacBio C2

    Dear all,

    I have PacBio reads from a pool of BACs (eventually, 100X coverage taking into account an average of 120 Kb of insert size in the BAC), and am having problems to find a decent solution to mask the vector sequences. The vector is 8 Kb long, and given the fact that a considerable proportion of my reads are longer than 8 Kb, that means that I will have in many cases the vector sequence in the middle, and the sequences I am interested in at the ends of the reads...

    My idea is to use SSAHA2 as recommended in the MIRA manual, and then try to assemble with MIRA (all this after correcting the reads using the PacBioToCA pipeline, which I got to run it without any problem). However, how would MIRA use a read with a sequence masked in the middle? Would it use the two extremes as independent reads (wanted effect)? Or would join the two extremes (unwanted effect)?

    I have tried to assemble without masking the vector, but since it's the same vector for all the BACs, I am getting problems and to many contigs (including quimaeras...)

    Thanks a lot!
    Last edited by Champi; 05-23-2013, 04:37 AM.

Latest Articles

Collapse

  • seqadmin
    An Introduction to the Technologies Transforming Precision Medicine
    by seqadmin


    In recent years, precision medicine has become a major focus for researchers and healthcare professionals. This approach offers personalized treatment and wellness plans by utilizing insights from each person's unique biology and lifestyle to deliver more effective care. Its advancement relies on innovative technologies that enable a deeper understanding of individual variability. In a joint documentary with our colleagues at Biocompare, we examined the foundational principles of precision...
    01-27-2025, 07:46 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 09:30 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-05-2025, 10:34 AM
0 responses
28 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-03-2025, 09:07 AM
0 responses
27 views
0 likes
Last Post seqadmin  
Started by seqadmin, 01-31-2025, 08:31 AM
0 responses
35 views
0 likes
Last Post seqadmin  
Working...
X