Announcement

Collapse
No announcement yet.

Identify adapter sequences in pacbio reads

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • rhall
    replied
    The "--outputSummary" option can be used to generate a list of ZMWs with an associated score, and boolean for passing filter.

    Is the data from an amplified sample? Default HGAP parameters should filter out all stochastic artifacts, I would be concerned about simply filtering out systematic artifacts without understanding where they are coming from. They are not normally observed in sheared genomic DNA sequencing.

    Leave a comment:


  • tshea
    replied
    Originally posted by rhall View Post
    pls_fofn is just a fofn of the bax.h5 files, in a SMRT pipe job folder the ./input.fofn file.
    Thank you for the quick reply. I never realized that pls_fofn was equivalent to the input.fofn file.

    I was able to run filter_artifacts.py and get the output .rgn.h5 files (though I did not see a log file generated with this command:

    filter_artifacts.py bax.fofn data/filtered_regions.fofn --outputDir filtered_regions_dir --outputFofn filtered_regions_out.fofn --debug --logFile filtered_regions_out.log

    but that is fine. Now having the flagged artifact palindrome reads (I assume contained in the .rgn.h5 files) my next step is to generate a "whitelist" file which contains all reads that I wish to assemble (HGAP3) except these palindromic artifact reads. Is there a tool (say something comparable to bash5tools.py which reads .bas.h5 files) where I can get read names (or even fasta, from which I can then get read names) from the .rgn.h5 files?

    Thank you again for your help. The incidence of these palindrome artifacts reads is high enough that it is generating a very fragmented assembly (with a lot of redundant sequence) where we would expect a single chromosome given the read lengths and known repeat sizes in the genome (the filtered_longreads.fasta.cutoff is longer than the biggest repeat in the genome). We are getting 87 contigs when we would a single contig.

    Leave a comment:


  • rhall
    replied
    pls_fofn is just a fofn of the bax.h5 files, in a SMRT pipe job folder the ./input.fofn file.

    Leave a comment:


  • tshea
    replied
    Originally posted by rhall View Post
    If you have SMRTAnalysis installed filter_artifacts.py will find reads with missed / missing adapters. It aligns a read to itself and looks for palindromes (reverse complement blasr alignment > a threshold).
    This is not ran in standard filtering as there are real biological sequences that would get filtered out. Standard adapter finding is done at the primary analysis stage and looks for the adapter sequence rather than palindromic subread, so it sometimes misses adapters, it is also possible to get a SMRT bell construct without an adapter on one side (missing adapter, rare).
    As stated in the previous post, non of these subreads should get through correction, but it is always possible to find a few examples in the filtered subreads.
    Thank you for this potentially helpful tool. In the usage for filter_artifacts.py it appears that the two positional arguments needed are pls_fofn and rgn_fofn

    I can locate the rgn_fofn - I assume it is in " data/filtered_regions.fofn" (this "data" dir was generated by running HGAP3).

    But I cannot locate the "pls_fofn ". Is this output from HGAP3 or does it need to be generated by some other tool?

    thank you for your help. I have an assembly with a rather high rate of palindromic subreads causing assembly breaks.

    Leave a comment:


  • Brian Bushnell
    replied
    I was just looking at my source code and I noticed that if you include the flag "split", RemoveAdapters2 will in fact split reads into subreads at the adapter locations rather than just masking them with X, which is much more useful. Sorry for not mentioning that before

    Leave a comment:


  • coldturkey
    replied
    Thanks Richard,

    Brian

    Leave a comment:


  • rhall
    replied
    Yes, you would likely see a contig ending in an inverted repeat. Anytime you do see this I would be suspicious. If the inverted repeat is in the middle of a contig, with unique sequence on both sides, it is difficult to imagine how that would arise from an aberrant read and would likely be real.

    Leave a comment:


  • coldturkey
    replied
    Thanks Richard,

    Perhaps it is because I have not yet encountered any of the assembly artefacts described below but I am having difficulty visualising them.

    Would I be correct in saying that the types of assembly artefact described by Brian will result in contig breaks and what might appear to be an inverted repeat or will they always assembly into separate contigs

    Regards

    Brian

    Leave a comment:


  • rhall
    replied
    If you have SMRTAnalysis installed filter_artifacts.py will find reads with missed / missing adapters. It aligns a read to itself and looks for palindromes (reverse complement blasr alignment > a threshold).
    This is not ran in standard filtering as there are real biological sequences that would get filtered out. Standard adapter finding is done at the primary analysis stage and looks for the adapter sequence rather than palindromic subread, so it sometimes misses adapters, it is also possible to get a SMRT bell construct without an adapter on one side (missing adapter, rare).
    As stated in the previous post, non of these subreads should get through correction, but it is always possible to find a few examples in the filtered subreads.

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by coldturkey View Post
    Hi Brian,

    Your tool seems to work well. adapter sequences were identified in ~0.3% of the subreads. However, no adapters were identified in the preassembled reads. I assume the adpaters are filtered out during consensus calling at the preassembly stage?
    Adapters shouldn't show up in assemblies or self-corrected reads since they should always be at a random place in the read. The big problem with them is not that they trickle into assemblies, but that they cause false hairpins in assemblies if they are not removed.

    BTW, thanks for trying my program. Bear in mind that it does not actually split reads at adapters, it just marks them with X's where they should be split. Anyway, if you're seeing 0.3% of reads with adapters, then the adapters were already removed (before, they should be tens of percent with adapters). Whether or not the removal was complete enough, I suppose, could be tested by whether you discover artifacts in you assembly, like self-complimentary contigs with only one read spanning the hairpin point.

    Leave a comment:


  • coldturkey
    replied
    Hi Brian,

    Your tool seems to work well. adapter sequences were identified in ~0.3% of the subreads. However, no adapters were identified in the preassembled reads. I assume the adpaters are filtered out during consensus calling at the preassembly stage?

    I appreciate the help

    Leave a comment:


  • coldturkey
    replied
    Thanks again Brian. I will give it a try and let you know

    regards

    Brian

    Leave a comment:


  • Brian Bushnell
    replied
    Most of the tools have command line versions, but they have lots of dependencies. Our module system automates the dependency loading, so I don't know, sorry.

    If you deal with "filtered subreads" then adapters are already supposed to be removed (assuming that's a universal name and not our local convention).

    We did have a case where adapters were not being removed by some version of SMRTportal, though, so I made my own tool to remove them, which you can try out if you like, at least to verify they are not present.

    Usage:
    java -ea -Xmx1g -cp /path/to/bbmap/current/ pacbio.RemoveAdapters2 in=raw.fq out=masked.fq

    This will display statistics about how many adapters were detected, and the output file will have adapter sequences masked with "X" symbols. It can accept fasta or fastq but not H5. It's available here:
    http://sourceforge.net/projects/bbmap/

    Leave a comment:


  • coldturkey
    replied
    Thanks Brian,

    Can the adapter removal tools be run independently of SMRTportal or are they embedded in a specific module?

    regards

    Brian

    Leave a comment:


  • Brian Bushnell
    replied
    This is the "smartbell" adapter:

    ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT

    However, it's hard to remove because of the high rate of indels. You should use PacBio's adapter-removal tools from SmartPortal since they are tuned for the error profile.

    Leave a comment:

Working...
X