Announcement

Collapse
No announcement yet.

Identify adapter sequences in pacbio reads

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identify adapter sequences in pacbio reads

    Hello

    I want to filter my pacbio reads to ensure that all apater sequences have been correctly removed. I wish to avoid the situation where there is a failure to correctly identify the adapter sequence, leading to a read consisting of the insert sequence in forward orientation, then the adapter sequence, then the insert sequence in reverse orientation.

    Is there a generic adapter sequence used construct the templates which I can screen my reads for?

    Regards

    Brian

  • #2
    This is the "smartbell" adapter:

    ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT

    However, it's hard to remove because of the high rate of indels. You should use PacBio's adapter-removal tools from SmartPortal since they are tuned for the error profile.

    Comment


    • #3
      Thanks Brian,

      Can the adapter removal tools be run independently of SMRTportal or are they embedded in a specific module?

      regards

      Brian

      Comment


      • #4
        Most of the tools have command line versions, but they have lots of dependencies. Our module system automates the dependency loading, so I don't know, sorry.

        If you deal with "filtered subreads" then adapters are already supposed to be removed (assuming that's a universal name and not our local convention).

        We did have a case where adapters were not being removed by some version of SMRTportal, though, so I made my own tool to remove them, which you can try out if you like, at least to verify they are not present.

        Usage:
        java -ea -Xmx1g -cp /path/to/bbmap/current/ pacbio.RemoveAdapters2 in=raw.fq out=masked.fq

        This will display statistics about how many adapters were detected, and the output file will have adapter sequences masked with "X" symbols. It can accept fasta or fastq but not H5. It's available here:
        http://sourceforge.net/projects/bbmap/

        Comment


        • #5
          Thanks again Brian. I will give it a try and let you know

          regards

          Brian

          Comment


          • #6
            Hi Brian,

            Your tool seems to work well. adapter sequences were identified in ~0.3% of the subreads. However, no adapters were identified in the preassembled reads. I assume the adpaters are filtered out during consensus calling at the preassembly stage?

            I appreciate the help

            Comment


            • #7
              Originally posted by coldturkey View Post
              Hi Brian,

              Your tool seems to work well. adapter sequences were identified in ~0.3% of the subreads. However, no adapters were identified in the preassembled reads. I assume the adpaters are filtered out during consensus calling at the preassembly stage?
              Adapters shouldn't show up in assemblies or self-corrected reads since they should always be at a random place in the read. The big problem with them is not that they trickle into assemblies, but that they cause false hairpins in assemblies if they are not removed.

              BTW, thanks for trying my program. Bear in mind that it does not actually split reads at adapters, it just marks them with X's where they should be split. Anyway, if you're seeing 0.3% of reads with adapters, then the adapters were already removed (before, they should be tens of percent with adapters). Whether or not the removal was complete enough, I suppose, could be tested by whether you discover artifacts in you assembly, like self-complimentary contigs with only one read spanning the hairpin point.

              Comment


              • #8
                If you have SMRTAnalysis installed filter_artifacts.py will find reads with missed / missing adapters. It aligns a read to itself and looks for palindromes (reverse complement blasr alignment > a threshold).
                This is not ran in standard filtering as there are real biological sequences that would get filtered out. Standard adapter finding is done at the primary analysis stage and looks for the adapter sequence rather than palindromic subread, so it sometimes misses adapters, it is also possible to get a SMRT bell construct without an adapter on one side (missing adapter, rare).
                As stated in the previous post, non of these subreads should get through correction, but it is always possible to find a few examples in the filtered subreads.

                Comment


                • #9
                  Thanks Richard,

                  Perhaps it is because I have not yet encountered any of the assembly artefacts described below but I am having difficulty visualising them.

                  Would I be correct in saying that the types of assembly artefact described by Brian will result in contig breaks and what might appear to be an inverted repeat or will they always assembly into separate contigs

                  Regards

                  Brian

                  Comment


                  • #10
                    Yes, you would likely see a contig ending in an inverted repeat. Anytime you do see this I would be suspicious. If the inverted repeat is in the middle of a contig, with unique sequence on both sides, it is difficult to imagine how that would arise from an aberrant read and would likely be real.

                    Comment


                    • #11
                      Thanks Richard,

                      Brian

                      Comment


                      • #12
                        I was just looking at my source code and I noticed that if you include the flag "split", RemoveAdapters2 will in fact split reads into subreads at the adapter locations rather than just masking them with X, which is much more useful. Sorry for not mentioning that before

                        Comment


                        • #13
                          Originally posted by rhall View Post
                          If you have SMRTAnalysis installed filter_artifacts.py will find reads with missed / missing adapters. It aligns a read to itself and looks for palindromes (reverse complement blasr alignment > a threshold).
                          This is not ran in standard filtering as there are real biological sequences that would get filtered out. Standard adapter finding is done at the primary analysis stage and looks for the adapter sequence rather than palindromic subread, so it sometimes misses adapters, it is also possible to get a SMRT bell construct without an adapter on one side (missing adapter, rare).
                          As stated in the previous post, non of these subreads should get through correction, but it is always possible to find a few examples in the filtered subreads.
                          Thank you for this potentially helpful tool. In the usage for filter_artifacts.py it appears that the two positional arguments needed are pls_fofn and rgn_fofn

                          I can locate the rgn_fofn - I assume it is in " data/filtered_regions.fofn" (this "data" dir was generated by running HGAP3).

                          But I cannot locate the "pls_fofn ". Is this output from HGAP3 or does it need to be generated by some other tool?

                          thank you for your help. I have an assembly with a rather high rate of palindromic subreads causing assembly breaks.

                          Comment


                          • #14
                            pls_fofn is just a fofn of the bax.h5 files, in a SMRT pipe job folder the ./input.fofn file.

                            Comment


                            • #15
                              Originally posted by rhall View Post
                              pls_fofn is just a fofn of the bax.h5 files, in a SMRT pipe job folder the ./input.fofn file.
                              Thank you for the quick reply. I never realized that pls_fofn was equivalent to the input.fofn file.

                              I was able to run filter_artifacts.py and get the output .rgn.h5 files (though I did not see a log file generated with this command:

                              filter_artifacts.py bax.fofn data/filtered_regions.fofn --outputDir filtered_regions_dir --outputFofn filtered_regions_out.fofn --debug --logFile filtered_regions_out.log

                              but that is fine. Now having the flagged artifact palindrome reads (I assume contained in the .rgn.h5 files) my next step is to generate a "whitelist" file which contains all reads that I wish to assemble (HGAP3) except these palindromic artifact reads. Is there a tool (say something comparable to bash5tools.py which reads .bas.h5 files) where I can get read names (or even fasta, from which I can then get read names) from the .rgn.h5 files?

                              Thank you again for your help. The incidence of these palindrome artifacts reads is high enough that it is generating a very fragmented assembly (with a lot of redundant sequence) where we would expect a single chromosome given the read lengths and known repeat sizes in the genome (the filtered_longreads.fasta.cutoff is longer than the biggest repeat in the genome). We are getting 87 contigs when we would a single contig.

                              Comment

                              Working...
                              X