Remove identical mapping reads between two indexed samples

Elsie replied

11-16-2015, 01:01 PM
Thank you so much for the ppt - that is very helpful.
This tool will be incredibly useful in our trouble-shooting toolkit - thank you so much for not only making it available, but for putting in the effort to explain it, it is greatly appreciated.
Leave a comment:
Brian Bushnell replied

11-16-2015, 10:08 AM
Hi Elsie,

I've attached the presentation I used to describe how it works.

The columns in the results file are like this:

assembly contig contam length avgFold reads percentCovered avgFold0 reads0 normRatio

They mean:

assembly: Which assembly this contig came from.
contig: The contig name.
contam: Whether or not it was flagged as a contaminant.
length: Length of the contig.
avgFold: Average coverage of the contig by normalized reads.
reads: How many normalized reads mapped to the contig.
percentCovered: Percent of the bases that are covered by mapped normalized reads.
avgFold0: Average coverage by reads prior to normalization.
reads0: Number of reads mapping to the contig prior to normalization.
normRatio: Ratio of coverage before and after normalization.

Contigs that have very low coverage after normalization, but did not have very low coverage prior to normalization, are considered contaminants. But the program is designed to be very sensitive and ensure no contaminated contigs slip through, so additionally, all contigs shorter than 500bp or with fewer than some number of total reads mapped to them are also classified as contaminants and removed, because once a contig gets shorter than ~500bp or has very few reads mapped to it, the statistical basis of the approach weakens and it is hard to determine confidently whether the contig is a contaminant. So to absolutely ensure all contaminant contigs are removed, these get removed too. You can override this behavior by adjusting these flags:

minc=3.5 Min average coverage to retain scaffold.
minp=20 Min percent coverage to retain scaffold.
minr=18 Min mapped reads to retain scaffold.
minl=500 Min length to retain scaffold.

Specifically, you could set "minl=0 minr=0" to bypass the filters that classify a contig as contaminant just because it is very short or has very few reads mapped to it, which may be a good idea for transcriptomics. Note that the tool was developed and optimized for genome assemblies. It will work fine on transcriptomes, in terms of ensuring there is no cross-contamination, but the defaults should probably be adjusted or you'll lose all your transcripts shorter than 500bp.

P.S. For detecting cross-contamination at the read level, we don't use Crossblock, but rather, we use Seal. First, run "fuse.sh" on all of the assemblies to make them a single contig and rename the contig based on the library (which simplifies stats reporting). Then concatenate them into a single file (though actually that's optional). For example:

Code:

fuse.sh in=assembly1.fa out=stdout.fa | rename.sh in=stdin.fa out=renamed1.fa prefix=assembly1 prefixonly ... cat renamed*.fa > all.fa

Then run Seal on each set of reads individually:

seal.sh in=library1.fq ref=all.fa stats=sealstats1.txt mkf=0.4 ambig=toss

Then summarize the results:

summarizeseal.sh sealstats*.txt out=summary.txt

That gives you the read-level cross-contamination in ppm. It underestimates it due to "ambig=toss" which ignores the situation where a contig assembled both in the proper library and also in the contaminant library, but is best in the case when you may have multiple libraries of the same species. "ambig=all" or "ambig=random" on the other hand will over-estimate cross-contamination in that case, but are fine if you are sure you don't have any closely-related organisms multiplexed together.
Attached Files

CrossBlock_Final_ppt_version.ppt (1.27 MB, 25 views)
Last edited by Brian Bushnell; 11-16-2015, 10:18 AM.
Leave a comment:
Elsie replied

11-15-2015, 06:20 PM
Hi Brian,

Is it possible to obtain some further information on this program, e.g. how exactly it works?, the columns of the results.txt file?

Many thanks.
Leave a comment:
Elsie replied

11-15-2015, 06:14 PM
java -Xmx20m -version
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

No java error and a complete list of files, thanks Brian.
Leave a comment:
Elsie replied

11-15-2015, 05:36 PM
java -Xmx20m -version

java version "1.6.0_34"
OpenJDK Runtime Environment (IcedTea6 1.13.6) (rhel-1.13.6.1.el6_6-x86_64)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)

We do have other versions of Java - different programs are pedantic about the version of java they run. I'll try a different version and get back to you.
thanks.
Leave a comment:
Brian Bushnell replied

11-12-2015, 10:23 AM
Hi Elsie,

Thanks for the report! Would you mind running this command:

java -Xmx20m -version

...and posting the result?

Thanks!

P.S. I am not currently confident that your results are correct. You may need to rerun with a different version of Java; that error really should not have happened and I'm not sure what the effects would be.

Last edited by Brian Bushnell; 11-12-2015, 04:00 PM.
Leave a comment:
Elsie replied

11-12-2015, 02:49 AM
Works a treat.
All my transcripts are still remaining, which validates my hypothesis that there is no (or no substantial) cross talk.
Got 1 java error running cross block:

Rename/Merge Phase Start
Input is being processed as paired
Writing interleaved.
Exception in thread "main" java.lang.NoClassDefFoundError: java/lang/ProcessBuilder$Redirect
at fileIO.ReadWrite.getOutputStreamFromProcess(ReadWrite.java:619)
at fileIO.ReadWrite.getPigzStream(ReadWrite.java:550)
at fileIO.ReadWrite.getGZipOutputStream(ReadWrite.java:521)
at fileIO.ReadWrite.getOutputStream(ReadWrite.java:382)
at fileIO.ReadWrite.getOutputStream(ReadWrite.java:346)
at stream.ReadStreamWriter.<init>(ReadStreamWriter.java:69)
at stream.ReadStreamByteWriter.<init>(ReadStreamByteWriter.java:17)
at stream.ConcurrentGenericReadOutputStream.<init>(ConcurrentGenericReadOutputStream.java:39)
at stream.ConcurrentReadOutputStream.getStream(ConcurrentReadOutputStream.java:53)
at stream.ConcurrentReadOutputStream.getStream(ConcurrentReadOutputStream.java:27)
at jgi.DecontaminateByNormalization.renameAndMerge(DecontaminateByNormalization.java:298)
at jgi.DecontaminateByNormalization.process(DecontaminateByNormalization.java:234)
at jgi.DecontaminateByNormalization.main(DecontaminateByNormalization.java:49)
Caused by: java.lang.ClassNotFoundException: java.lang.ProcessBuilder$Redirect
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
... 13 more

But results were still written to the decontaminated folder.
Thanks Brian, this is excellent.
Leave a comment:
Elsie replied

11-10-2015, 01:20 PM
Brian you are a genius.
I'll give this a go and let you know what the results look like, thank you.
Leave a comment:
Brian Bushnell replied

11-09-2015, 01:37 PM
So, I have a different program for this, called CrossBlock; it's specifically for cross-contamination between multiple organisms (and I specifically designed it for multiplexed single cell bacterial libraries). The usage is like this:

First, preprocess (adapter trim, etc) all libraries, and if they are paired in two files, interleave them so you have exactly one file of reads per library (we'll call this "preprocessed"). Then assemble the preprocessed reads into one assembly per library. For single cells, I recommend Spades, though it if fails in some cases you can use Tadpole for those - you need an assembly for every library, but it does not have to be optimal.

Now you have one assembly fasta file and one reads fastq file (preprocessed) per library.

Prepare two text files, one with the paths to the read files (one per library), and one with the paths to the assemblies, in the same order, one line per file. For example:

readlist.txt:

Code:

lib1.fq.gz lib2.fq.gz lib3.fq.gz

reflist.txt:

Code:

lib1.fasta lib2.fasta lib3.fasta

Now run CrossBlock:
crossblock.sh readnamefile=readlist.txt refnamefile=reflist.txt out=decontaminated

The end result will be clean contigs files for each library, without cross-contaminant contigs. If you want clean reads, you can then map the reads to the clean contigs and only keep the ones that map, then optionally reassemble to get your final assemblies.
Leave a comment:
Elsie replied

11-09-2015, 12:50 PM
Ah no reference genomes unfortunately, part of the problem.
Leave a comment:
GenoMax replied

11-09-2015, 12:43 PM
If you have reference genomes available you could split the reads into organism specific files using BBsplit.
Leave a comment:
Elsie replied

11-09-2015, 12:36 PM
Multiplexing ~20 samples of different, complex organisms, on the MiSeq. No red flags but it would be nice to know if there were any good methods for eliminating/investigating possible cross-talk.
thanks.
Leave a comment:
Brian Bushnell replied

11-09-2015, 10:12 AM
Hi Elsie,

This will only remove PCR duplicates between samples, not all crosstalk. Can you describe your experimental setup? Are you multiplexing different organisms, or multiple individuals of the same organism, or...?
Leave a comment:
Elsie replied

11-09-2015, 01:45 AM
Thanks Brian, Genomax and sarvidsson for the helpful comments. Do you feel that this methodology would remove all possible cross-talk?
thanks.
Leave a comment:
Brian Bushnell replied

02-18-2015, 10:28 AM
Originally posted by GenoMax View Post

See this: http://seqanswers.com/forums/showthread.php?t=39270

Post #3.

I think you should be able to provide multiple files as input (separate them by a comma). Brian can confirm.

Dedupe can remove duplicate reads from multiple files simultaneously, if they are comma-delimited (e.g. in=file1.fastq,file2.fastq,file3.fastq). And if you set the flag "uniqueonly=t" then ALL copies of duplicate reads will be removed, as opposed to the default behavior of leaving one copy of duplicate reads.

However, it does not care which file a read came from; in other words, it can't remove only reads that are duplicates across multiple files but leave the ones that are duplicates within a file. That can still be accomplished, though, like this:

1) Run dedupe on each sample individually, so now there are at most 1 copy of a read per sample.
2) Run dedupe again on all of the samples together, with "uniqueonly=t". The only remaining duplicate reads will be the ones duplicated between samples, so that's all that will be removed.
Leave a comment:

Previous 1 2 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News