Seqanswers Leaderboard Ad

**Brian Bushnell** · 12-10-2014, 03:14 PM

I have a program for plotting library uniqueness as you go through the reads. The graphs look like this:

It works by pulling kmers from each input read, and testing whether it has been seen before, then storing it in a table.

The bottom line, "first", tracks whether the first kmer of the read has been seen before (independent of whether it is read 1 or read 2).

The top line, "pair", indicates whether a combined kmer from both read 1 and read 2 has been seen before. The other lines are generally safe to ignore but they track other things, like read1- or read2-specific data, and random kmers versus the first kmer.

It plots a point every X reads (configurable, default 25000).

In noncumulative mode (default), a point indicates "for the last X reads, this percentage had never been seen before". In this mode, once the line hits zero, sequencing more is not useful.

In cumulative mode, a point indicates "for all reads, this percentage had never been seen before", but still only one point is plotted per X reads.

Sample command line:

bbcountunique.sh in=reads.fq out=histogram.txt

Note that the lines are not perfectly smooth; the little peaks are caused by high-error tiles. But it's still useful in that it allows assessment of a library that lacks a reference.

Attached Files

uniqueness.png (31.6 KB, 1042 views)

**luc** · 12-10-2014, 05:55 PM

Thanks Brian ! Once more!

**Brian Bushnell** · 12-10-2014, 08:10 PM

Originally posted by luc View Post

Thanks Brian ! Once more!

You're welcome

**Fernas** · 12-12-2014, 07:40 AM

Thanks very much Brian.

I downloaded BBMap and when I tried to run (bbcountunique.ssh) I got the following error "Exception in thread main java.lang.unsupportedclassversionerror: jgi/Calcuniqueness"

I downloaded a new version of jre but I am wondering how to use it with the code? do I need to modify the code or the path variable? do I need to download another file?

**Brian Bushnell** · 12-12-2014, 10:47 AM

Originally posted by Fernas View Post

Thanks very much Brian.

I downloaded BBMap and when I tried to run (bbcountunique.ssh) I got the following error "Exception in thread main java.lang.unsupportedclassversionerror: jgi/Calcuniqueness"

I downloaded a new version of jre but I am wondering how to use it with the code? do I need to modify the code or the path variable? do I need to download another file?

Assuming you downloaded a Java 7 JRE, and installed it, the program should just work. If you are still getting the same error, then the JRE was not installed correctly; you can resolve that by editing your PATH variable to remove the path to the old java executable (on our system, it is "/usr/common/usg/languages/java/jdk/oracle/1.7.0_51_x86_64/bin/") and put in the path to the new java executable. Alternately, you can edit the shellscript "bbcountunique.sh". Line 80, 5 lines from the bottom, is:

local CMD="java $EA $z -cp $CP jgi.CalcUniqueness $@"

If your new java executable is at, say, "/usr/jdk/oracle/1.7.0_51_x86_64/bin/java", then you would just change that line to:

local CMD="/usr/jdk/oracle/1.7.0_51_x86_64/bin/java $EA $z -cp $CP jgi.CalcUniqueness $@"

If you can't figure out how to install the new jre or don't have the proper permissions, then run "java -version" and copy and paste the output into this thread.

**Fernas** · 12-13-2014, 01:23 AM

It works now.
Thank you very much indeed Brian!.

**arash82** · 02-11-2015, 09:36 PM

Originally posted by Brian Bushnell View Post

In noncumulative mode (default), a point indicates "for the last X reads, this percentage had never been seen before". In this mode, once the line hits zero, sequencing more is not useful.

In cumulative mode, a point indicates "for all reads, this percentage had never been seen before", but still only one point is plotted per X reads.

First, thanks Brian for this tool... I am trying to use it on pilot data to determine how to multiplex my samples.

To my question, I am not entire sure I understand how to interpret the results.

In default mode I get a curve that plateaus around 35% between 30-50M reads. It doesn't seem to move towards zero. I'd like to interpret this that there is no point in sequencing more than 30M reads, but that wouldn't be correct to your statement. It would appear that I keep getting 30% new sequences like forever!?

Could you clarify, or am I doing something wrong? And how should you interpret the cumulative mode?

Thanks,
Arash

PS. I have three columns. Could you also clarify what the rand column means?

**Brian Bushnell** · 02-12-2015, 06:44 PM

Hi Arash,

For each read, the first kmer is created and a kmer from a random location is created. Each of these kmers is looked up in a table to determine if it has been seen before. There is a separate table for first kmers and for random kmers; if you are using paired reads, there are also separate tables for read 1 and read 2. If the kmer has not been seen before, that read is considered "unique" for that metric and the kmer is stored. Otherwise the read is considered "non-unique". Every 25000 reads (by default) a row is printed showing the unique rate. In cumulative mode (which I personally never use!) the numbers in a row apply to all reads (so you can never reach zero!); in noncumulative mode, the number applies to only the last 25000 reads (so you will reach 0% uniqueness as soon as you get a batch of 25000 consecutive reads that have all been seen before).

"First" column is the percent of reads in which the first kmer has never been seen.
"Rand" column is the percent of reads in which a specific randomly-selected kmer has never been seen.
"Pair" column uses a hash of a specific kmer in read 1 and read 2 that has a fixed position, chosen to have a minimal error rate. Meaning that it reflects the number of unique pairs that have been seen before.

I wrote this tool, and I like it, but I designed it largely to other people's specifications so some of the defaults are a bit odd in my opinion, like the "rand" columns - I typically ignore those!

If you run in noncumulative mode, which I recommend, then you will gain no benefit from additional sequencing once the "pair" column approaches zero (for paired reads) or once the "first" column approaches zero (for single-ended reads). With paired reads, "first" will approach zero way before "pair", and once that happens, you are no longer generating unique reads, just reads that you have seen before but with new insert sizes. In general, there is no reason to sequence further once "first" approaches zero in non-cumulative mode!

However, this tool relies on high data quality. If you have low quality data with substitution errors, or very short inserts such that adapters occur in the middle of the reads, the tool will overestimate uniqueness and never reach zero. For example - if 30% of your reads have an error in the first K bases (K is by default 25), then rather than asymptotically approaching 0% uniqueness, it will approach 30% uniqueness, because kmers with errors in them will usually never have been seen before with that specific error. Mapping-based approaches do not have this problem. So, in practice, this program is ideal for high quality data, but mapping is better for low-quality data. All the little spikes in the picture I posted above are due to a bunch of reads that, for whatever reason (like a bubble in the flow cell), had low quality; if the reads were all error-free, the line would be perfectly smooth.

In summary:

1) Don't use cumulative mode for determining how much to sequence; it's only for calculating the total number of unique reads in a dataset.
2) Ignore the rand column.
3) This tool only provides useful information from decent-quality data; for very low quality data (either a high error rate [under Q15], or very short insert sizes) you need to use mapping.
4) You don't need to sequence more once the "first" column approaches zero. How close it approaches depends on your budget and needs; at 50% uniqueness, with even coverage and 100bp reads, you would have a around 100x average coverage.

In some situations, like RNA-seq, single-cell, or metagenomes, in which the sequences have an exponential coverage distribution, you will NEVER reach zero.

-Brian

**arash82** · 02-13-2015, 08:26 AM

Dear Brian,

Thanks for the extensive response and clarification on how the program works. Very much appreceated.

I kind of forgot to mention that I am using it on RNA-seq data from a HiSeq 2500. I currently don't have access to the mapped file, but I'll try it on them as soon as I can.

The thing is I am using the program (right now at least) just to determine if I am sequencing deep enough or if I can multiplex furthere. I don't need a perfect curve, just an estimate. Was thinking maybe to trim and then run, but shouldn't gain much from that...

Thanks,
Arash

PS. Was also thinking that the spikes are nice in a way as a quility indiciation. I have instances of much higher spikes in my data.

**Brian Bushnell** · 02-13-2015, 10:28 AM

Originally posted by arash82 View Post

I kind of forgot to mention that I am using it on RNA-seq data from a HiSeq 2500. I currently don't have access to the mapped file, but I'll try it on them as soon as I can.

Just to clarify, CalcUniqueness does not make any use of mapping information, but it's possible to do a similar analysis with mapping information instead of kmers and there are probably programs that do so.

The thing is I am using the program (right now at least) just to determine if I am sequencing deep enough or if I can multiplex further. I don't need a perfect curve, just an estimate. Was thinking maybe to trim and then run, but shouldn't gain much from that...

If you want to get an advantage from trimming, you'd have to do fixed-length trimming on the left (like, removing the first 5 bases). Quality-trimming the right end won't affect the graphs (other than the "rand" column) unless the reads end up shorter than a kmer, and variable-length trimming on the left end would wreck them because the kmers would no longer start in the same place for previously identical reads. Quality-filtering and adapter-trimming might help, though:

bbduk.sh in=reads.fq out=clean.fq maq=15 ktrim=r k=25 mink=11 hdist=1 tpe tbo ref=truseq.fa.gz minlen=40

Here the "maq=15" will throw away reads with average quality below 15 (in other words, an expected error rate of over 1/30 or so), and reads trimmed shorter than 40bp after adapter removal will also be discarded. These may not be optimal settings for actual RNA-seq analysis (since requiring a high average quality can bias quantification), but it should clean up the data a bit to allow generation of more accurate saturation curves.

**boulund** · 01-24-2017, 01:57 AM

Originally posted by Brian Bushnell View Post

In some situations, like RNA-seq, single-cell, or metagenomes, in which the sequences have an exponential coverage distribution, you will NEVER reach zero.

But could this approach still be used with e.g. metagenomics data to get some kind of feeling for if the sequencing depth is deep enough? I guess what I'm really asking is whether you think it would be reasonable to still expect it decrease (even if it doesn't reach zero, but instead bottoms out somewhere higher)?

**Brian Bushnell** · 01-24-2017, 09:27 AM

It depends on your goals. You can assemble and recover a lot from the higher-depth fraction of most samples. If you can assemble the genes that make up 90% of the DNA by mass in an environment, perhaps that's good enough to determine, for your purposes, what the community looks like and what it does.

**boulund** · 01-24-2017, 11:54 PM

Thanks for your really swift reply Brian!

Sorry, I'm not being very clear...
I'm really wondering whether bbcountunique is still useful somehow as a tool for quantifying the saturation of a metagenomic sample.

**Brian Bushnell** · 01-25-2017, 10:15 AM

We generally use that tool for determining how good a library preparation method was for an isolate of finite size. For a metagenome, by telling you what percent of the reads are unique as you continue to sequence, you can at least get an idea that... for every $1 I spend on additional sequence, $0.99 is spent on things I've already seen. But actually determining the total size of the metagenome from this kind of data is an open research area, and it's not clear to me if the "total size of a metagenome" is meaningful in the wild. So, I think the answer is that it's a little useful, but not a complete answer.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

How to plot the Saturation curves to assess the sequencing depth?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News