Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filtering PacBio reads

    Hello

    I am dealing with a rather common problem which gives me real headaches with PacBio data:

    How can you filter your reads prior assembly ?

    Note: I am not talking about subread filtering, quality filtering and so on, but filtering for "contamination".

    E.g. Assembling a genome and filtering known metagenomes, or filtering only mitochondrial reads and assembling its genome from a sequenced organism ,.....

    So, what I am looking for is a possibility to filter the *.h5 files and only use certain ones for the assembly. I guess one would have to parse the files, but with the h5 format that is a little bit beyond my scope.
    For some modules (e.g. Allora) FASTA can be provided, but for other ones not (e.g. HGAP).

    Anybody ideas ?

  • #2
    Have a look here: https://github.com/PacificBioscience...filtering-step. Does that answer your question?

    Comment


    • #3
      On the first glance, it seems exactly what I was looking for - thanks !
      Will test it over night

      Comment


      • #4
        Great, thanks for the link. The tutorial is even working (!).

        Comment


        • #5
          Hello,

          I'm following the whitelist tutorial and have encountered an error where I have to use their python script to generate the whitelist.txt.

          I'm using my own PacBio data as opposed to the data provided, but I don't think that it should make a difference.

          When I run the python script, this is the error I encounter:

          Traceback (most recent call last):
          File "whitelist.py", line 20, in <module>
          hole_num = alignment_id_parts[1]
          IndexError: list index out of range

          Which I believe means that when the list is generated from splitting the first column at the '/', it is not generating any more than one list element.

          Is anybody else encountering this issue?

          Regards,
          Nick

          Comment


          • #6
            Can you post an example of your alignment output?
            Code:
            head <align output file>
            The script isn't very robust, any deviation in the blasr output will cause issues.
            Thanks.

            Comment


            • #7
              I got it working but had to change around my script a bit. I had to change the headers to look like:

              Code:
              qName/qHole/qAlign tName qStrand tStrand score percentSimilarity tStart tEnd tLength qStart qEnd qLength nCells
              m130726_012611_42158_c100524682550000001823080609281382_s1_p0/72/2066_4077 gi|556503834|ref|NC_000913.3| 0 0 -828 91.0891 366290 366482 4641652 2118 2314 4346 4145
              Because it wasn't properly splitting at the space separations.

              I changed the code to just include my one movie file:

              Code:
              import sys
              
              # NOTE: this script assumes a single movie in the blasr alignment
              
              blacklist_alignment = sys.argv[1]
              num_holes = int(sys.argv[2])
              
              blacklist = set()
              movie_prefix = 'm130726_012611_42158_c100524682550000001823080609281382_s1_p0'
              for line in open(blacklist_alignment):
                  if line.startswith('qname'):
                      continue
              
                  cols = line.split()
              
                  alignment_id_parts = cols[0].split('/')
              
                  hole_num = alignment_id_parts[1]
              
                  blacklist.add(hole_num)
              
              for i in range(num_holes):
                  i = str(i)
                  if i not in blacklist:
                      print '/'.join([movie_prefix, i])
              I got my whitelist.txt output to look like such:

              Code:
              m130726_012611_42158_c100524682550000001823080609281382_s1_p0/0
              m130726_012611_42158_c100524682550000001823080609281382_s1_p0/1
              m130726_012611_42158_c100524682550000001823080609281382_s1_p0/2
              m130726_012611_42158_c100524682550000001823080609281382_s1_p0/3
              m130726_012611_42158_c100524682550000001823080609281382_s1_p0/4
              
              ...
              Does including the following parameter in the settings.xml file:

              Code:
               <param name="whiteList" label="Read IDs to whitelist">
                          <value>/home/USERNAME/BACassembly/whitelist.txt</value>
                          </param>
              Automatically let Smrtpipe know that I want to include those reads as on a whitelist?
              In other words, does Smrtpipe have a whitelist function built into it? I would have expected more code to be required.

              Comment


              • #8
                Yes SMRT Pipe should filter out the reads if passed that parameter. Technically an extra filter is passed to the filtering task, you can check workflow/P_Filter/filter_?of?.sh to make sure the ReadWhilelist parameter is present:
                Code:
                filterPlsH5.py --debug --filter='MinReadScore=0.80,MinSRL=500,MinRL=100,ReadWhitelist=/home/USERNAME/BACassembly/whitelist.txt ......'

                Comment


                • #9
                  The whitelisting tutorial online is a good one, but it doesn't specify the format the whitelist would need to be if you wanted to run multiple SMRT cells with their own whitelisted reads.

                  What format would the whitelist file need to be in, in order to run say 8 smrt cells all of which are decontaminated?


                  "...Note whitelisting in this tutorial assumes that you are working on one cell of data at a time (1x bas.h5 file, or 3x bax.h5 files), it could be expanded to run on multiple cells with some changes to the code."

                  Comment


                  • #10
                    The whitelist file is in the same format, e.g.
                    m130726_012611_42158_c100524682550000001823080609281382_s1_p0/0

                    The cell/file name would just be different. The note in the tutorial is because the tutorial whitelists the reads that do not map to ecoli, therefore the script needs to take the 'inverse' of the reads that map to ecoli, this is only really simple for a single cell.
                    To make the script work on multiple cells it would need to be expanded to keep track of all the cells, the alternative is to generate the whitelist for each cell, then join them together to get a whitelist for running a job on all the cells.

                    Comment


                    • #11
                      Awesome, thank you for your reply.

                      So basically a whitelist encompassing more than one cell could look like:

                      <cell_1>_s1_p0/0
                      <cell_1>_s1_p0/2
                      <cell_1>_s1_p0/5
                      ....
                      <cell_1>_s1_p0/81740
                      <cell_2>_s1_p0/1
                      <cell_2>_s1_p0/2
                      <cell_2>_s1_p0/6

                      Such that subsequent whitelists are basically appended to the previous one?

                      Regards,
                      Nick

                      Comment


                      • #12
                        Exactly

                        Comment


                        • #13
                          New link for the whitelist tutorial ?

                          Originally posted by flxlex View Post
                          Have a look here: https://github.com/PacificBioscience...filtering-step. Does that answer your question?
                          Hi all,

                          It seems that the previous link doesn't redirect to the correct page anymore (at least for me).

                          Nonetheless, I found the whitelist tutorial at this location:

                          GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.


                          Can the whitelist procedure be implemented in SMRT Portal ? It is a very standard filtering step which would benefit to be run in a user-friendly way.

                          Seb.

                          Comment


                          • #14
                            Indeed, it can be implemented but it requires some XML coding. If you just follow the tutorial, and don't do the last part of it (the SMRTpipe part), and make your own custom protocol using the whitelist filtering parameter, you can specify the path to your whitelist in smrtportal as a parameter in the filtering step and it will use the whitelist to filter your reads

                            Comment


                            • #15
                              Ok, thanks.

                              But I need to practice more bas.h5 manipulation and protocol creation first, as I'm very new with PacBio data.

                              Actually I don't have my own data yet, so I'm practicing with tutorials data. By the way, do you know where can I find publicly available cosmid sequencing data under the PacBio format ?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X