Seqanswers Leaderboard Ad

**Brian Bushnell** · 03-17-2014, 03:13 PM

This is the "smartbell" adapter:

ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT

However, it's hard to remove because of the high rate of indels. You should use PacBio's adapter-removal tools from SmartPortal since they are tuned for the error profile.

**coldturkey** · 03-17-2014, 03:33 PM

Thanks Brian,

Can the adapter removal tools be run independently of SMRTportal or are they embedded in a specific module?

regards

Brian

**Brian Bushnell** · 03-17-2014, 03:53 PM

Most of the tools have command line versions, but they have lots of dependencies. Our module system automates the dependency loading, so I don't know, sorry.

If you deal with "filtered subreads" then adapters are already supposed to be removed (assuming that's a universal name and not our local convention).

We did have a case where adapters were not being removed by some version of SMRTportal, though, so I made my own tool to remove them, which you can try out if you like, at least to verify they are not present.

Usage:
java -ea -Xmx1g -cp /path/to/bbmap/current/ pacbio.RemoveAdapters2 in=raw.fq out=masked.fq

This will display statistics about how many adapters were detected, and the output file will have adapter sequences masked with "X" symbols. It can accept fasta or fastq but not H5. It's available here:

BBMap

http://sourceforge.net/projects/bbmap/

Download BBMap for free. BBMap short read aligner, and other bioinformatic tools. This package includes BBMap, a short read aligner, as well as various other bioinformatic tools. It is written in pure Java, can run on any platform, and has no dependencies other than Java being installed (compiled for Java 6 and higher).

**coldturkey** · 03-17-2014, 04:14 PM

Thanks again Brian. I will give it a try and let you know

regards

Brian

**coldturkey** · 03-17-2014, 05:16 PM

Hi Brian,

Your tool seems to work well. adapter sequences were identified in ~0.3% of the subreads. However, no adapters were identified in the preassembled reads. I assume the adpaters are filtered out during consensus calling at the preassembly stage?

I appreciate the help

**Brian Bushnell** · 03-17-2014, 07:45 PM

Originally posted by coldturkey View Post

Hi Brian,

Your tool seems to work well. adapter sequences were identified in ~0.3% of the subreads. However, no adapters were identified in the preassembled reads. I assume the adpaters are filtered out during consensus calling at the preassembly stage?

Adapters shouldn't show up in assemblies or self-corrected reads since they should always be at a random place in the read. The big problem with them is not that they trickle into assemblies, but that they cause false hairpins in assemblies if they are not removed.

BTW, thanks for trying my program. Bear in mind that it does not actually split reads at adapters, it just marks them with X's where they should be split. Anyway, if you're seeing 0.3% of reads with adapters, then the adapters were already removed (before, they should be tens of percent with adapters). Whether or not the removal was complete enough, I suppose, could be tested by whether you discover artifacts in you assembly, like self-complimentary contigs with only one read spanning the hairpin point.

**rhall** · 03-18-2014, 07:18 AM

If you have SMRTAnalysis installed filter_artifacts.py will find reads with missed / missing adapters. It aligns a read to itself and looks for palindromes (reverse complement blasr alignment > a threshold).
This is not ran in standard filtering as there are real biological sequences that would get filtered out. Standard adapter finding is done at the primary analysis stage and looks for the adapter sequence rather than palindromic subread, so it sometimes misses adapters, it is also possible to get a SMRT bell construct without an adapter on one side (missing adapter, rare).
As stated in the previous post, non of these subreads should get through correction, but it is always possible to find a few examples in the filtered subreads.

**coldturkey** · 03-18-2014, 02:47 PM

Thanks Richard,

Perhaps it is because I have not yet encountered any of the assembly artefacts described below but I am having difficulty visualising them.

Would I be correct in saying that the types of assembly artefact described by Brian will result in contig breaks and what might appear to be an inverted repeat or will they always assembly into separate contigs

Regards

Brian

**rhall** · 03-18-2014, 03:01 PM

Yes, you would likely see a contig ending in an inverted repeat. Anytime you do see this I would be suspicious. If the inverted repeat is in the middle of a contig, with unique sequence on both sides, it is difficult to imagine how that would arise from an aberrant read and would likely be real.

**coldturkey** · 03-18-2014, 03:10 PM

Thanks Richard,

Brian

**Brian Bushnell** · 03-20-2014, 10:24 AM

I was just looking at my source code and I noticed that if you include the flag "split", RemoveAdapters2 will in fact split reads into subreads at the adapter locations rather than just masking them with X, which is much more useful. Sorry for not mentioning that before

**tshea** · 06-22-2015, 01:13 PM

Originally posted by rhall View Post

If you have SMRTAnalysis installed filter_artifacts.py will find reads with missed / missing adapters. It aligns a read to itself and looks for palindromes (reverse complement blasr alignment > a threshold).
This is not ran in standard filtering as there are real biological sequences that would get filtered out. Standard adapter finding is done at the primary analysis stage and looks for the adapter sequence rather than palindromic subread, so it sometimes misses adapters, it is also possible to get a SMRT bell construct without an adapter on one side (missing adapter, rare).
As stated in the previous post, non of these subreads should get through correction, but it is always possible to find a few examples in the filtered subreads.

Thank you for this potentially helpful tool. In the usage for filter_artifacts.py it appears that the two positional arguments needed are pls_fofn and rgn_fofn

I can locate the rgn_fofn - I assume it is in " data/filtered_regions.fofn" (this "data" dir was generated by running HGAP3).

But I cannot locate the "pls_fofn ". Is this output from HGAP3 or does it need to be generated by some other tool?

thank you for your help. I have an assembly with a rather high rate of palindromic subreads causing assembly breaks.

**rhall** · 06-22-2015, 01:30 PM

pls_fofn is just a fofn of the bax.h5 files, in a SMRT pipe job folder the ./input.fofn file.

**tshea** · 06-23-2015, 09:50 AM

Originally posted by rhall View Post

pls_fofn is just a fofn of the bax.h5 files, in a SMRT pipe job folder the ./input.fofn file.

Thank you for the quick reply. I never realized that pls_fofn was equivalent to the input.fofn file.

I was able to run filter_artifacts.py and get the output .rgn.h5 files (though I did not see a log file generated with this command:

filter_artifacts.py bax.fofn data/filtered_regions.fofn --outputDir filtered_regions_dir --outputFofn filtered_regions_out.fofn --debug --logFile filtered_regions_out.log

but that is fine. Now having the flagged artifact palindrome reads (I assume contained in the .rgn.h5 files) my next step is to generate a "whitelist" file which contains all reads that I wish to assemble (HGAP3) except these palindromic artifact reads. Is there a tool (say something comparable to bash5tools.py which reads .bas.h5 files) where I can get read names (or even fasta, from which I can then get read names) from the .rgn.h5 files?

Thank you again for your help. The incidence of these palindrome artifacts reads is high enough that it is generating a very fragmented assembly (with a lot of redundant sequence) where we would expect a single chromosome given the read lengths and known repeat sizes in the genome (the filtered_longreads.fasta.cutoff is longer than the biggest repeat in the genome). We are getting 87 contigs when we would a single contig.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 23 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Identify adapter sequences in pacbio reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News