How to plot the Saturation curves to assess the sequencing depth?

callumjcparr replied

02-03-2019, 07:58 AM
Originally posted by Brian Bushnell View Post

Hi Tom,

perfect_prob is the average probability of a read being error-free within that interval. It's related to the avg_quality, but calculated independently. Possibly, it would make more sense for me to do this just for the kmer being used to track uniqueness rather than the whole read, but it's easiest this way. The reason I provide it is because low-quality regions in the fastq file will show inflated uniqueness, when uniqueness is tracked using this method.

It looks like you're down to about 70% uniqueness for each individual read, which would be at least ~100x coverage for 150-bp reads... that coverage estimate is weighted by the high-coverage genomes, though.

It's hard to say whether or not to sequence more based on this plot alone. You're obviously still generating more unique reads, but they might simply be giving more coverage to areas you can already assemble well. I think the best course is to assemble and see if you end up with a lot of short, low-coverage contigs (in addition to the high-coverage contigs that you will clearly generate)... in which case you do need to sequence more

would this tool be suitable for understanding saturation for long-reads (i.e. PacBio, ONT)? Or is there smoother other tool.

Now I am subsample various read numbers and mapping them to see tail off in number of unique genes discovered. But of course this takes time, and your Kmer uniqueness looks quick.
Leave a comment:
TomHarrop replied

04-25-2017, 06:41 PM
Thanks for the suggestions.

It's a TruSeq PCR-free library with an insert size around 470 bp according to the BioAnalyser. I did remove adaptors and contaminants with BBDuk2 (adapters.fa and phix174_ill files that ship with bbtools) but that only removed 0.05% of bases, so maybe I should be look again at the adaptor sequences.

It's a diploid insect. I extracted the DNA from a single, whole individual so there may be some [gut] flora in there... or yes, a repetitive genome, but let's hope not.

PacBio is too expensive for this project and we can't get enough DNA, but I'm looking into getting a MinION for gap closing.
Leave a comment:
Brian Bushnell replied

04-25-2017, 06:13 PM
Well, it's a very rough estimate, but...

If 70% of reads are unique, then assuming an even distribution, 30% of the start sites are taken. Meaning there is one read for every 1/0.3 = 3.33 bases. For 150bp reads, that would indicate coverage of 150bp/3.33 = 45x. But since read 1 and read 2 are tracked independently, I doubled it to 90x. Then, since errors artificially inflate uniqueness calculation using this method, and given the % perfect profile, I guessed that maybe I should increase it by ~10%, so I arrived at ~100x coverage, but possibly more if the reads were lower-quality than they seemed based on the mapq.

But, those estimates were based on 150bp reads... for 100bp reads the estimate would have been 66x+, which is not too far off from 55x. I initially thought this was a metagenome because of the sharp decrease in uniqueness at the very beginning of the file, but perhaps you just have a highly repetitive genome, or lots of duplicate reads. Was this library PCR-amplified? And did you trim adapters and remove phiX (if you spiked it in) prior to running the program? Also, is this a Nextera library; or, what method did you use for fragmentation? It's unusual for a PCR-free isolate to have such a sharp decrease in uniqueness at the beginning; that indicates there is some sequence that is extremely abundant in the library. Notably, the drop is not present in the paired uniqueness, which is completely linear. I'm not entirely sure what this means.

At any rate, for an isolate, it looks like you've sequenced enough (for a diploid/haploid). Sometimes you can get a better assembly with more coverage, though, up to around 100x. And you certainly can't beat longer reads!

Last edited by Brian Bushnell; 04-25-2017, 06:39 PM.
Leave a comment:
TomHarrop replied

04-25-2017, 05:50 PM
Great, thanks. How did you estimate the average coverage from the % uniqueness? I know I can do it more accurately with BBNorm (which says read depth median at 55x) but I'm curious how you did it from that plot. These are 100 b reads, but it's about 50 Gb of sequencing from a genome we are expecting to be ~600 Mbp, so you are not far off if coverage were even.
Leave a comment:
Brian Bushnell replied

04-25-2017, 04:24 PM
Hi Tom,

perfect_prob is the average probability of a read being error-free within that interval. It's related to the avg_quality, but calculated independently. Possibly, it would make more sense for me to do this just for the kmer being used to track uniqueness rather than the whole read, but it's easiest this way. The reason I provide it is because low-quality regions in the fastq file will show inflated uniqueness, when uniqueness is tracked using this method.

It looks like you're down to about 70% uniqueness for each individual read, which would be at least ~100x coverage for 150-bp reads... that coverage estimate is weighted by the high-coverage genomes, though.

It's hard to say whether or not to sequence more based on this plot alone. You're obviously still generating more unique reads, but they might simply be giving more coverage to areas you can already assemble well. I think the best course is to assemble and see if you end up with a lot of short, low-coverage contigs (in addition to the high-coverage contigs that you will clearly generate)... in which case you do need to sequence more
Leave a comment:
TomHarrop replied

04-25-2017, 02:59 PM
Hi Brian,

Thanks for this tool.

What is the perfect_prob column in the results? I can't see it in the docs. Is it the "probability of correctness" for k-mers (reads?) in that bin based on avg_quality?

Also, if your percentage uniqueness for read 1 is only approaching 60% after ~250 M reads, would you keep sequencing?

Cheers,

Tom
Attached Files

uniqueness_histogram.pdf (303.7 KB, 68 views)
Leave a comment:
Brian Bushnell replied

01-25-2017, 10:15 AM
We generally use that tool for determining how good a library preparation method was for an isolate of finite size. For a metagenome, by telling you what percent of the reads are unique as you continue to sequence, you can at least get an idea that... for every $1 I spend on additional sequence, $0.99 is spent on things I've already seen. But actually determining the total size of the metagenome from this kind of data is an open research area, and it's not clear to me if the "total size of a metagenome" is meaningful in the wild. So, I think the answer is that it's a little useful, but not a complete answer.
Leave a comment:
boulund replied

01-24-2017, 11:54 PM
Thanks for your really swift reply Brian!

Sorry, I'm not being very clear...
I'm really wondering whether bbcountunique is still useful somehow as a tool for quantifying the saturation of a metagenomic sample.
Leave a comment:
Brian Bushnell replied

01-24-2017, 09:27 AM
It depends on your goals. You can assemble and recover a lot from the higher-depth fraction of most samples. If you can assemble the genes that make up 90% of the DNA by mass in an environment, perhaps that's good enough to determine, for your purposes, what the community looks like and what it does.
Leave a comment:
boulund replied

01-24-2017, 01:57 AM
Originally posted by Brian Bushnell View Post

In some situations, like RNA-seq, single-cell, or metagenomes, in which the sequences have an exponential coverage distribution, you will NEVER reach zero.

But could this approach still be used with e.g. metagenomics data to get some kind of feeling for if the sequencing depth is deep enough? I guess what I'm really asking is whether you think it would be reasonable to still expect it decrease (even if it doesn't reach zero, but instead bottoms out somewhere higher)?
Leave a comment:
Brian Bushnell replied

02-13-2015, 10:28 AM
Originally posted by arash82 View Post

I kind of forgot to mention that I am using it on RNA-seq data from a HiSeq 2500. I currently don't have access to the mapped file, but I'll try it on them as soon as I can.

Just to clarify, CalcUniqueness does not make any use of mapping information, but it's possible to do a similar analysis with mapping information instead of kmers and there are probably programs that do so.

The thing is I am using the program (right now at least) just to determine if I am sequencing deep enough or if I can multiplex further. I don't need a perfect curve, just an estimate. Was thinking maybe to trim and then run, but shouldn't gain much from that...

If you want to get an advantage from trimming, you'd have to do fixed-length trimming on the left (like, removing the first 5 bases). Quality-trimming the right end won't affect the graphs (other than the "rand" column) unless the reads end up shorter than a kmer, and variable-length trimming on the left end would wreck them because the kmers would no longer start in the same place for previously identical reads. Quality-filtering and adapter-trimming might help, though:

bbduk.sh in=reads.fq out=clean.fq maq=15 ktrim=r k=25 mink=11 hdist=1 tpe tbo ref=truseq.fa.gz minlen=40

Here the "maq=15" will throw away reads with average quality below 15 (in other words, an expected error rate of over 1/30 or so), and reads trimmed shorter than 40bp after adapter removal will also be discarded. These may not be optimal settings for actual RNA-seq analysis (since requiring a high average quality can bias quantification), but it should clean up the data a bit to allow generation of more accurate saturation curves.
Leave a comment:
arash82 replied

02-13-2015, 08:26 AM
Dear Brian,

Thanks for the extensive response and clarification on how the program works. Very much appreceated.

I kind of forgot to mention that I am using it on RNA-seq data from a HiSeq 2500. I currently don't have access to the mapped file, but I'll try it on them as soon as I can.

The thing is I am using the program (right now at least) just to determine if I am sequencing deep enough or if I can multiplex furthere. I don't need a perfect curve, just an estimate. Was thinking maybe to trim and then run, but shouldn't gain much from that...

Thanks,
Arash

PS. Was also thinking that the spikes are nice in a way as a quility indiciation. I have instances of much higher spikes in my data.
Leave a comment:
Brian Bushnell replied

02-12-2015, 06:44 PM
Hi Arash,

For each read, the first kmer is created and a kmer from a random location is created. Each of these kmers is looked up in a table to determine if it has been seen before. There is a separate table for first kmers and for random kmers; if you are using paired reads, there are also separate tables for read 1 and read 2. If the kmer has not been seen before, that read is considered "unique" for that metric and the kmer is stored. Otherwise the read is considered "non-unique". Every 25000 reads (by default) a row is printed showing the unique rate. In cumulative mode (which I personally never use!) the numbers in a row apply to all reads (so you can never reach zero!); in noncumulative mode, the number applies to only the last 25000 reads (so you will reach 0% uniqueness as soon as you get a batch of 25000 consecutive reads that have all been seen before).

"First" column is the percent of reads in which the first kmer has never been seen.
"Rand" column is the percent of reads in which a specific randomly-selected kmer has never been seen.
"Pair" column uses a hash of a specific kmer in read 1 and read 2 that has a fixed position, chosen to have a minimal error rate. Meaning that it reflects the number of unique pairs that have been seen before.

I wrote this tool, and I like it, but I designed it largely to other people's specifications so some of the defaults are a bit odd in my opinion, like the "rand" columns - I typically ignore those!

If you run in noncumulative mode, which I recommend, then you will gain no benefit from additional sequencing once the "pair" column approaches zero (for paired reads) or once the "first" column approaches zero (for single-ended reads). With paired reads, "first" will approach zero way before "pair", and once that happens, you are no longer generating unique reads, just reads that you have seen before but with new insert sizes. In general, there is no reason to sequence further once "first" approaches zero in non-cumulative mode!

However, this tool relies on high data quality. If you have low quality data with substitution errors, or very short inserts such that adapters occur in the middle of the reads, the tool will overestimate uniqueness and never reach zero. For example - if 30% of your reads have an error in the first K bases (K is by default 25), then rather than asymptotically approaching 0% uniqueness, it will approach 30% uniqueness, because kmers with errors in them will usually never have been seen before with that specific error. Mapping-based approaches do not have this problem. So, in practice, this program is ideal for high quality data, but mapping is better for low-quality data. All the little spikes in the picture I posted above are due to a bunch of reads that, for whatever reason (like a bubble in the flow cell), had low quality; if the reads were all error-free, the line would be perfectly smooth.

In summary:

1) Don't use cumulative mode for determining how much to sequence; it's only for calculating the total number of unique reads in a dataset.
2) Ignore the rand column.
3) This tool only provides useful information from decent-quality data; for very low quality data (either a high error rate [under Q15], or very short insert sizes) you need to use mapping.
4) You don't need to sequence more once the "first" column approaches zero. How close it approaches depends on your budget and needs; at 50% uniqueness, with even coverage and 100bp reads, you would have a around 100x average coverage.

In some situations, like RNA-seq, single-cell, or metagenomes, in which the sequences have an exponential coverage distribution, you will NEVER reach zero.

-Brian
Leave a comment:
arash82 replied

02-11-2015, 09:36 PM
Originally posted by Brian Bushnell View Post

In noncumulative mode (default), a point indicates "for the last X reads, this percentage had never been seen before". In this mode, once the line hits zero, sequencing more is not useful.

In cumulative mode, a point indicates "for all reads, this percentage had never been seen before", but still only one point is plotted per X reads.

First, thanks Brian for this tool... I am trying to use it on pilot data to determine how to multiplex my samples.

To my question, I am not entire sure I understand how to interpret the results.

In default mode I get a curve that plateaus around 35% between 30-50M reads. It doesn't seem to move towards zero. I'd like to interpret this that there is no point in sequencing more than 30M reads, but that wouldn't be correct to your statement. It would appear that I keep getting 30% new sequences like forever!?

Could you clarify, or am I doing something wrong? And how should you interpret the cumulative mode?

Thanks,
Arash

PS. I have three columns. Could you also clarify what the rand column means?

Last edited by arash82; 02-11-2015, 09:38 PM. Reason: PS
Leave a comment:
Fernas replied

12-13-2014, 01:23 AM
It works now.
Thank you very much indeed Brian!.
Leave a comment:

Previous 1 2 template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News