Hi guys, I have a doubt about the source of multiple identical reads that are generated during SOLEXA sequencing. Indeed what we find currently in our runs is that we get around 12 million custers, which are then filtered (looks like by a read"purity" treshold as well as by their aligment to the corresponding genome...but Im not 100% sure about it) to around 6 million reads aligning to unique sites into the studied genome. Nvertheless a further filtering removes reads containing more than 2 mismatches as well as multiple reads. When we look at the fraction of this "for me unexpected multiple identical reads" we found that indeed such event is more frequent than the mismatches...nevertheless I dont understand the source of multiple identical reads. Indeed, since the fragmentation process for ChIP assays is a completely random process, for me looks quite unlikely to get fragments having the same tips (I meant the DNA ends that are sequenced). Did you see a similar problem and do you know the source of this multiple identical reads??? furthermore, by accident we have seen that if the initial number of clusters is lower (around 7 millions), the fraction of multiple identical reads dropsdown significantly...even though for the moment we dont know if it is pure coincidence. Thanks for your hints
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Interesting question. In ChIP-seq, we often see "odd" stuff, which includes biases to certain sections of clearly unexpected regions of the genome. That often includes large "peaks" in centromeres, or just large stacks of duplicates.
However, while we don't know the sources of all of this "odd" stuff, we can account for most of it with good controls. (I doubt that the fragmentation is completely random, though, regardless of which method you use...)
If you're looking for other sources, many groups do a PCR step on their DNA before sequencing, which might preferentially amplify fragments, and of course, you are isolating DNA from a large population of cells, so it's possible that you're just getting a lot of pulled down material from a whole collection of cells where that signal is strong.
Anyhow, I would also suggest that your pipeline of how you handle the reads also makes a difference. You don't specify the aligner or the filtering techniques being used, so that makes it really hard to get to the bottom of what you're seeing.
Good luck making sense of your data!The more you know, the more you know you don't know. —Aristotle
-
Seems we have similar problem.
I have several identical reads and of course they mapped to the same position.
When I analyze the 454 data, i keep one and remove others, because it is likely caused by some technical problem.
But for Solexa data, I don't know any reason can make me remove them.
Comment
-
During library construction (454/Illumina etc...) almost all protocols have a PCR amplification stage, if only to get enough material to sequence. Unless you are expecting it, I would remove any exactly identical sequence reads if they were going to affect downstream analysis. Removing reads may sound like a bad thing, but we have found that the bias that is caused by keeping replicated reads can be huge (and muddies an already muddy pool!), so although it is conservative, and may be removing useful data, without any way to prove the reads come from idependant sources, i would always remove them. You might consider barcoding your library when you amplify (easy to do) and at least this way, any identical, but idependantly produced, sequences will now be seperable.
Sorry for the long post... ...
Comment
-
ieuanclay is correct. The duplication is caused by the library prep steps. We've found by lowering the number of PCR cycles or doing a 2 stage PCR instead you get less duplication. So basically you get so much sequence you're seeing 2 products of a PCR reaction sequenced.
It only works for paired end sequencing but I judge library diversity by looking at the number of identical paired end reads (same exact start-end for the pair). Weather you want to remove them or not is left up to you as, for a low diversity library, they can cause spurious SNP calls and such depending on the algorithm and the PCR fidelity.
And the purity filter doesn't work on alignment, just call quality. Think of it like trimming away the bad phred scores.
Comment
-
duplicates in ChIPSeq
Hello,
i have exactly the same problem but find this thread just now
Please look at - http://seqanswers.com/forums/showthread.php?t=2592
Many thanks for your help, it is much appreciated!!!
tec
Comment
-
multiple reads having the same sequence...
Hello all,
the problem with duplicate reads still keeps me busy..
Therefore we performed a Topo cloning resequencing check of the library.
Surprisingly, over 75% of the clones were unique - which doesn't correlate with the sequencing run!!!
Does anyone have an idea???
Thanks! tec
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 10:49 AM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
Today, 10:49 AM
|
||
Started by seqadmin, 04-25-2024, 11:49 AM
|
0 responses
23 views
0 likes
|
Last Post
by seqadmin
04-25-2024, 11:49 AM
|
||
Started by seqadmin, 04-24-2024, 08:47 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
04-24-2024, 08:47 AM
|
||
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
62 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
Comment