Seqanswers Leaderboard Ad

**Brian Bushnell** · 05-06-2015, 04:15 PM

I wrote a program, CrossBlock, to do automatic de-cross-contamination of assemblies from cross-contaminated samples. However, it is designed for the low levels of contamination - generally, 2.5% or lower - that you might get from barcode misassignment or slight impurities in adapter batches, not the 50% level of physically combining two libraries. It should theoretically work with up to around 10% cross-contamination, though.

It's pretty easy to use. First, give interleave all of your fastq reads so you have one fastq file per library, and give it a unique name. Then give each fasta file (one assembly per library) a unique name also. Then make two text files, one named "readlist.txt" and one named "reflist.txt". Each should have one path per line, pointing to a fastq or fasta file, in the same order. For example:

readlist.txt:

Code:

x.fq.gz
y.fastq.gz
z.fastq.gz

reflist.txt:

Code:

x.fa
y.fa
z.fa

Then run this:

crossblock.sh readnamefile=readlist.txt refnamefile=reflist.txt out=clean/ log=dclog.txt

The result should be clean assemblies. It won't clean the reads for you; you'd have to subsequently use BBSplit for that, against the clean assemblies. Whether CrossBlock will work for strains or subspecies is hard to say. It should have no trouble with different species, but start to have problems once genomes exceed 97% identity.

As for BBSplit, and your first question -

Can you clarify as to whether the assemblies were better before or after splitting and reassembling? It looks like you had quite good assemblies initially, then you ran BBSplit, reassembled, and ended up with highly fragmented assemblies. Is that correct?

So, my recommendation may change in light of your answer, but my current view is that you should use BBSplit just between your different assemblies (ideally, after decontamination). If you have, say, E.coli Strain 1 in your assemblies and you include the reference for type genome E.coli Strain 2 even though it is not one of your assemblies, it will force the reads to split between two very similar things and could cause major fragmentation.

Since you do suspect some degree of heterogenity in the organisms, you may want to run BBSplit with increased sensitivity, with the additional flags "minhits=1 minratio=0.56 maxindel=100" which will give sensitivity similar to BBMap. BBSplit's defaults are low sensitivity which can also increase fragmentation in the presence of assembly errors or strain variation.

And overall, since you asked, the presence of strain variation is expected to give more fragmented assemblies, not nice big contigs like you got in your initial assemblies.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Genome analysis - reference genomes and co-contamination

Comment

Latest Articles

ad_right_rmr

News