Duplicate reads ("same start" reads) in 454 FLX/Titanium shotgun runs

pmiguel replied

08-28-2009, 06:12 AM
Originally posted by greigite View Post

OK, I understand now- very good point and thanks for elaborating. Is there another way you suggest that might work to differentiate between fragments sharing the same 5' end and PCR duplicates?

If standard shotgun techniques are being used then it should be possible to distinguish between PCR duplication and shearing bias.

If shearing bias is the issue then one expects equal numbers of reads with the shear site against the PA adaptor as against the PB adaptor. This can only be seen in amplicons short enough to be read across into the PB adaptor. But if you are mapping against a reference genome or clustering you would also expect to see an equal number of reads mapping immediately before the shear site (on the opposite strand) as after the "fragile" shear site. If you do not, then you would suspect PCR duplication.

Phillip
Leave a comment:
greigite replied

08-20-2009, 07:39 PM
OK, I understand now- very good point and thanks for elaborating. Is there another way you suggest that might work to differentiate between fragments sharing the same 5' end and PCR duplicates? Perhaps there is no way to do it when the shotgun fragments are drawn from a large population of an organism with very little variation, such that even two fragments which happen to share the same 5' end are 100% identical. I guess one has to assume the relative probabilities of PCR duplication versus shearing at the same point, and this probably scales with the coverage (e.g. for lower coverage regions if two reads start at the same point they are more likely to be PCR duplicates).

Originally posted by kmcarr View Post

Yes, you've got it. Your question (at least as I understood it) was whether or not you could differentiate between independent fragments which happen to share the same 5' end and PCR duplicates by the length of the read obtained. No, you can't. Using 454 cycle sequencing, if two reads start at the same position they will end up being exactly the same length as each other, whether they were originally independent fragments or PCR generated duplicates. This statement relies on an assumption of 454 library preparation that the fragment size if your library is longer than the expected read length. That is the only point I was making about original fragment sizes.
Leave a comment:
kmcarr replied

08-20-2009, 07:17 PM
Yes, you've got it. Your question (at least as I understood it) was whether or not you could differentiate between independent fragments which happen to share the same 5' end and PCR duplicates by the length of the read obtained. No, you can't. Using 454 cycle sequencing, if two reads start at the same position they will end up being exactly the same length as each other, whether they were originally independent fragments or PCR generated duplicates. This statement relies on an assumption of 454 library preparation that the fragment size if your library is longer than the expected read length. That is the only point I was making about original fragment sizes.
Leave a comment:
greigite replied

08-20-2009, 04:41 PM
Originally posted by kmcarr View Post

This rule would only work if you are sequencing the entire fragment; that is to say the sequence is reaching the 3' adapter and thus you can determine the exact size of the original fragment. This should not be the case. The average fragmentation size should be sufficiently large (500 - 800 bp) such that the 200 cycles of sequencing on the FLX Titanium never reach the 3' adapter. Therefore you would have no way of knowing for a given read what it's underlying fragment size is.

I think I'm missing something here. If I understood your earlier post correctly (quoted below), shotgun fragments that are randomly sheared at exactly the same location on one end should also have the same sequence length, regardless of the size of the original fragment. Why would it be necessary to know the underlying fragment size?

Originally posted by kmcarr View Post

<snip>
If I have 10 fragments all originating from the 5' end of 10 copies of the same cDNA but all of varying lengths between 500-800 nt, and the 454 adapters are ligated in the same orientation, then I should get exactly the same sequence from all of them. The sequence will start at the 5' end and will stop when the machine has completed its 42 cycles, regardless of how long the inserted fragments are. This is what I mean by "if two reads start.....".
Leave a comment:
kmcarr replied

08-20-2009, 10:19 AM
Originally posted by greigite View Post

Could we implement a rule along the lines of that if reads start in the same position, but are different in length by more than 1% (could get incorporation errors changing the length of duplicate reads) then they are not duplicates?

This rule would only work if you are sequencing the entire fragment; that is to say the sequence is reaching the 3' adapter and thus you can determine the exact size of the original fragment. This should not be the case. The average fragmentation size should be sufficiently large (500 - 800 bp) such that the 200 cycles of sequencing on the FLX Titanium never reach the 3' adapter. Therefore you would have no way of knowing for a given read what it's underlying fragment size is.
Leave a comment:
greigite replied

08-20-2009, 09:24 AM
Could you give a citation? I'm not familiar with the Turner et al paper you mention.
Leave a comment:
cgb replied

08-19-2009, 09:17 PM
might be your library prep. see the Turner et al paper.
Leave a comment:
greigite replied

08-19-2009, 03:19 PM
stacking by chance alone?

Thought I'd bring this topic back up again to see if anyone can offer some additional advice. We are seeing this stacking effect in our shotgun library (reads with the same start). however, we have a dominant organism (~75% of the sample) which leads to an extremely high read depth in some regions (>700X). Couldn't we get reads starting in the same position by chance alone with such a high depth? Naively, let's look at a 500 bp region with 1000x coverage. Say one new read starts every 5 bp in the region, meaning that there are 100 total read starts. 1000x coverage/100 read starts = 10x coverage per read start by chance alone. How can this be differentiated from the duplicate read effect generated by emPCR? By read length or identity over the whole read? Could we implement a rule along the lines of that if reads start in the same position, but are different in length by more than 1% (could get incorporation errors changing the length of duplicate reads) then they are not duplicates?

There's also an interesting twist in some cases. In one instance, a bunch of reads start in the same location with a homopolymer run (say TTTT). Some reads have 3 T's, some have 2 T's, some have 4 T. Should we interpret this as being sequencing error alone?
Leave a comment:
anar replied

04-16-2009, 03:53 PM
Originally posted by jnfass View Post

Note that newbler (gsAssembler) and gsMapper account for this by default; I don't know if they collapse identical reads and then treat them as one read, or if they collapse them to one, but add to the base qualities because of the technical replication, but in any case the code is "aware" of this issue.

Can anyone provide insight on how exactly newbler and gsMapper "account for" stacking reads? Or know where this is documented? This could be crucial for certain sequencing designs/applications...
Leave a comment:
jnfass replied

04-01-2009, 11:11 AM
I'll chime in to say that I've heard (through a colleague, who heard from someone else, etc.) that this is indeed an artifact of the emulsion PCR, where either (like kmcarr's explanation) droplets contained multiple beads but one piece of DNA, or DNA escapes from droplets during the PCR and colonizes empty beads ... in any case, same read start and stop, and base calls.

Note that newbler (gsAssembler) and gsMapper account for this by default; I don't know if they collapse identical reads and then treat them as one read, or if they collapse them to one, but add to the base qualities because of the technical replication, but in any case the code is "aware" of this issue. Doesn't help if you're not exclusively using the 454 pipeline, though. I've used CD-HIT to cluster near 100% identical reads with 1 or 2 overhanging bases ...
Leave a comment:
[c]oma replied

04-01-2009, 06:21 AM
Thank you all for your replies. It is reassuring to see others are seeing the same things we are, so it doesn't seem to be something we are doing wrong. But I still don't really like it...
Leave a comment:
jpp replied

04-01-2009, 01:09 AM
[c]oma,
We have found the same problem with our runs and found similar numbers. We think there is an inherent problem related with the emPCR. We have asked the provider technical assistance and after many inquires they told the normal range is around a 18%. By the way, we have not received any advice to reduce it.
Leave a comment:
behoward replied

03-30-2009, 03:32 PM
thanks, that's very helpful information. I will go back and check if the short pile of reads correlates with the ends of known gene models...
Leave a comment:
kmcarr replied

03-30-2009, 02:09 PM
Yes, the cDNA is nebulized and the average fragment size should be 500-800 bp. (There is an issue that nebulization won't break dsDNA less than ~700 bp so that is another complication when dealing with cDNA; the shearing is not as "random" as it is with genomic DNA.) There is a size selection to remove fragments < 300 bp so the vast majority of the library sequences should be much longer than the expected 100 nt (for GS 20) sequence length.

If I have 10 fragments all originating from the 5' end of 10 copies of the same cDNA but all of varying lengths between 500-800 nt, and the 454 adapters are ligated in the same orientation, then I should get exactly the same sequence from all of them. The sequence will start at the 5' end and will stop when the machine has completed its 42 cycles, regardless of how long the inserted fragments are. This is what I mean by "if two reads start.....".

I would have to look at the data to be sure but I think that the bimodal distribution is an artifact of the cDNA preparation method and the read trimming process. Reads originating from either the 5' or 3' ends of cDNAs would include the SMART kit adapters which would then be clipped off by our trimming pipeline. The most commonly trimmed size from the 5' end is ~30nt. Reads not including these adapters would be closer to their full read length (~100nt) after trimming.
Leave a comment:
behoward replied

03-30-2009, 01:13 PM
thanks!

Hi kmcarr,

Thanks for the response One thing I don't quite understand is when you say "if two reads start at exactly the same point and there are no missed or extra incorporations then they will end at exactly the same point." For this (RNA-Seq) data set, my understanding is that they did a nebulization step to randomly shear the cDNA. Is it possible that there could be differing length fragments at the same start position for this reason? Or should all the fragments be much larger than 100 or so nucleotides? I didn't see any explicit mention of a size selection step.

Also, I have been looking at the read length distribution and there is one peak at around 70 nt or so, and one peak at around 100 nt. I normalized by transcript so that I only count a random 3 reads per transcript. This way genes like rubisco won't have an unfair contribution. What do you think would cause this bimodal distribution? It's important that I understand this because we are trying to model these distributions in an analysis method we are working on.

Brian
Leave a comment:

Previous 1 2 template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News