Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BAC vector sequece masking for de novo assembly using PacBio C2

    Dear all,

    I have PacBio reads from a pool of BACs (eventually, 100X coverage taking into account an average of 120 Kb of insert size in the BAC), and am having problems to find a decent solution to mask the vector sequences. The vector is 8 Kb long, and given the fact that a considerable proportion of my reads are longer than 8 Kb, that means that I will have in many cases the vector sequence in the middle, and the sequences I am interested in at the ends of the reads...

    My idea is to use SSAHA2 as recommended in the MIRA manual, and then try to assemble with MIRA (all this after correcting the reads using the PacBioToCA pipeline, which I got to run it without any problem). However, how would MIRA use a read with a sequence masked in the middle? Would it use the two extremes as independent reads (wanted effect)? Or would join the two extremes (unwanted effect)?

    I have tried to assemble without masking the vector, but since it's the same vector for all the BACs, I am getting problems and to many contigs (including quimaeras...)

    Thanks a lot!
    Last edited by Champi; 05-23-2013, 04:37 AM.

  • #2
    How about using "crossmatch" from Phred/Phrap to mask the vector. It is part of the Phred/Phrap suite (http://www.phrap.org/).

    Repeat masker may work too http://www.repeatmasker.org/

    You would have to edit the masked sequence out before doing assembly.
    Last edited by GenoMax; 05-23-2013, 03:27 AM.

    Comment


    • #3
      If I were faced with this situation (which I may be very soon) I would write a quick little program to slice the reads up when masked in the middle (I started to write it, then wasn't sure if you wanted FASTA or FASTQ splitting; both quite doable).

      You might also post this to the MIRA mailing list; Bastien likes knowing how scientists are using MIRA & he sometimes knows some cool tricks that are thinly documented (or not yet released).

      Comment


      • #4
        Originally posted by krobison View Post
        If I were faced with this situation (which I may be very soon) I would write a quick little program to slice the reads up when masked in the middle (I started to write it, then wasn't sure if you wanted FASTA or FASTQ splitting; both quite doable).
        Thank you very much krobison. I was thinking on writing a perl script to do that when found non-masked sequences at both sides of masked one in a read, splitting that read into reads a and b. However, my level of Perl is (below) very beginner , and it will take time, although I think it's a good exercise. I'm working with fastq, what I suppose will make the thing double complicated, but will try.

        Originally posted by krobison View Post
        You might also post this to the MIRA mailing list; Bastien likes knowing how scientists are using MIRA & he sometimes knows some cool tricks that are thinly documented (or not yet released).
        Will do for sure! Thank you for letting me know

        Cheers

        Comment


        • #5
          Hi, we address the same problem in our lab.

          There are two solutions, one if you are doing things by hand, the other if you are going to incorporate this into a production pipeline and use SMRTPipe to handle assembly for you. Both need access to the original bas.h5 file.

          For the first method
          # Create a file of the usable subreads:
          # this program is included in the blasr distribution under pbihdfutils
          pls2fasta m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5 reads.fasta -trimByRegion

          # Next map the reads to the vector using blasr. allowing for multiple hits
          blasr reads.fasta vector.fasta -m 4 -bestn 10 -out alignments.tovector.m4

          # use your own filtering program to cut out the masked sequences
          filterSequences.py reads.fasta alignments.tovector.m4 reads.split.fasta

          ... do the rest of your assembly with the split file.

          If you want to use smrtpipe to do all the heavy lifting, you need to do some modifications to the original bas.h5 file (ideally a copy of it, and more ideally a file that encodes a single dataset from this file).

          So, a little, ok, a long, background about PacBio sequences if you are new to it.

          A pacbio "read" will contain 3 parts: a low quality prefix, a usable middle portion, and a low quality suffix. The reason read is quoted is because the middle portion may contain one or more passes over the sequencing template which are often called subreads so you may consider a read to be the consensus of the subreads, or the subreads themselves. The low quality prefix and suffix exist because there is noise at the beginning and ending of a movie that are recorded and interpreted as bases and called.

          The length of the low quality prefix/suffix are different for every read, and may even be 1000s of bases. However, the basecalling is VERY accurate at detecting just what are the low quality prefix and suffix sequences, and there is an annotation of this described next. And don't worry, the statistics on all the 10kb reads are after subtracting off the prefix and suffix, so the numbers are not inflated.

          Tucked in the bas.h5 file is a dataset called a RegionTable:
          >h5ls -r data/Analysis_Results/m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5
          / Group
          /PulseData Group
          /PulseData/BaseCalls Group
          /PulseData/BaseCalls/Basecall Dataset {468462122/Inf}
          /PulseData/BaseCalls/DeletionQV Dataset {468462122/Inf}
          ...
          /PulseData/Regions

          This table contains a description of the segments of a read, namely the coordinates of the start and end of the high quality middle, and the coordinates of the adapter and insert for every ZMW in the dataset.

          Once you get over the headache of hdf5 coding, you can do nice things like make a dataset self documenting. In this instance, the description of the columns is part of the dataset:
          ATTRIBUTE "ColumnNames" {
          DATATYPE H5T_STRING {
          STRSIZE H5T_VARIABLE;
          STRPAD H5T_STR_NULLTERM;
          CSET H5T_CSET_ASCII;
          CTYPE H5T_C_S1;
          }
          DATASPACE SIMPLE { ( 5 ) / ( H5S_UNLIMITED ) }
          DATA {
          (0): "HoleNumber", "Region type index", "Region start in bases",
          (3): "Region end in bases", "Region score"
          }
          }

          The region types are enumerated, which is also described in the dataset:

          ATTRIBUTE "RegionTypes" {
          DATATYPE H5T_STRING {
          STRSIZE H5T_VARIABLE;
          STRPAD H5T_STR_NULLTERM;
          CSET H5T_CSET_ASCII;
          CTYPE H5T_C_S1;
          }
          DATASPACE SIMPLE { ( 3 ) / ( H5S_UNLIMITED ) }
          DATA {
          (0): "Adapter", "Insert", "HQRegion"
          }
          }

          Looking at the first few lines of the region table yields this:

          h5dump -d /PulseData/Regions data/Analysis_Results/m130517_113100_42134_c100523452550000001823083309281340_s1_p0.bas.h5
          HDF5 "data/Analysis_Results/m130517_113100_42134_c10052345255000000182308330928
          1340_s1_p0.bas.h5" {
          DATASET "/PulseData/Regions" {
          DATATYPE H5T_STD_I32LE
          DATASPACE SIMPLE { ( 197012, 5 ) / ( H5S_UNLIMITED, 5 ) }
          DATA {
          (0,0): 0, 1, 0, 426, -1,
          (1,0): 0, 2, 0, 0, 0,
          (2,0): 1, 1, 0, 507, -1,
          (3,0): 1, 2, 0, 0, 0,
          (4,0): 2, 1, 0, 523, -1,
          (5,0): 2, 2, 0, 0, 0,
          (6,0): 3, 1, 0, 354, -1,
          (7,0): 3, 2, 0, 0, 0,
          (8,0): 4, 1, 0, 15159, -1,
          (9,0): 4, 1, 15211, 22459, -1,
          (10,0): 4, 0, 15159, 15211, 750,
          (11,0): 4, 2, 17273, 22458, 894,

          You can see reads 0-3 have no usable sequence, which is expected due to the Poisson loading of the smrtcell.

          Read 4 has a high-quality segment from 17273 - 22458. Paradoxically, it is possible for the low quality prefix and annotated insert regions to overlap.


          Ok, now finally to the vector screening. This was mainly set up by our bioinformatics specialist with a little feedback from me. We use blasr to map the original fasta reads to the vector sequence, allowing for multiple hits:
          blasr m130517_113100_42134_c100523452550000001823083309281340_s1_p0.fasta vector.fa -m 4 -bestn 10 -out alignment.m4 -header

          The next step requires some python coding. HDF is extremely painful to use in C/C++, but pretty easy just a bit slow to use in python. If you modify the rows in the RegionTable to reflect the masked off sequence and write this back out, you can feed the modified bas.h5 files into smrtpipe, and assembly away, sans-vector.

          Finally, an even better solution is to write out an entirely new file that contains only the region table m130517_113100_42134_c100523452550000001823083309281340_s1_p0.rgn.h5. Smrtpipe allows one to ignore the region table included in the bas.h5 file, and use the custom region table in this file, for scenarios just like this masking one here.


          This post now sums up more text than I've ever posted on seqanswers since starting. Enjoy.

          -mark

          Comment


          • #6
            @mchaisso, that is an overwhelming amount of information

            Since I don't have access to the original bas.h5 files (the provider gave me already filtered .fastq files), I will go for the first solution as I said just in my previous post: align the BAC vector sequence to the (already corrected) reads using SSAHA2 (although I think I could use blasr as you say) and then use the information from the output to split the reads with an in-house script and proceed with the resulting split-reads file.

            If I get the original bas.h5 files (which I will ask the provider for), I will try also the second option that you propose.

            Thank you very very much!!!

            Comment


            • #7
              I like this way, and if there is a seq contain the vector sequence in the middle,
              it is also can be splited as paired reads for assembly.
              Originally posted by GenoMax View Post
              How about using "crossmatch" from Phred/Phrap to mask the vector. It is part of the Phred/Phrap suite (http://www.phrap.org/).

              Repeat masker may work too http://www.repeatmasker.org/

              You would have to edit the masked sequence out before doing assembly.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              18 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              22 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              17 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Working...
              X