repeat reads in Solexa

hingamp replied

02-23-2010, 07:23 AM
Many thanks liguow for the reference!
Homogenous amplification is of course expected, in fact required; what is more of an issue is irregular amplification (for some reason one fragment is amplified 100 times more than the average) which messes up any downstream quantitative analyses that relies on read counts. The idea is that exact same reads in the context of a large genome for a given sequencing depth is very unlikely. Again, exact duplicate removal is standard practice for many in the ChIP-SEQ community to avoid spurious peak calls. MACS for instance which is widely used has inbuilt duplicate removal:

Sometimes the same tag can be sequenced repeatedly, more times than expected from a random genome-wide tag distribution. Such tags might arise from biases during ChIP-DNA amplification and sequencing library preparation, and are likely to add noise to the final peak calls. Therefore, MACS removes duplicate tags in excess of what is warranted by the sequencing depth (binomial distribution p-value <10-5). For example, for the 3.9 million FoxA1 ChIP-Seq tags, MACS allows each genomic position to contain no more than one tag and removes all the redundancies.

Application Unavailable | Springer Nature

http://genomebiology.com/2008/9/9/R137
Leave a comment:
Boel replied

02-11-2010, 12:25 PM
artifact?

Would you not expect many identical reads in an RNA seq experiment where amplification has been conducted? If fragments of length 300 are selected, and thereafter amplified (>=15 cycles or so) then I would suspect that many identical clusters will form on the flow cell. This is an effect of PCR amplification, sure, but I would not say that it is an artifact. Or am I missing something?
Leave a comment:
liguow replied

12-31-2009, 09:29 AM
exome sequencing

Originally posted by hingamp View Post

Could you provide more details on that "Nature Sarah B. Ng" reference? I can't locate it from pubmed or google scholar...

I tend to agree that removing exact duplicates is important, especially for any quantitative counting, I've seen huge suspect duplicate pileups that are difficult to explain other than being due to PCR artefacts...

It might be this paper:
Title: Targeted capture and massively parallel sequencing of 12 human exomes

Nature 461, 272-276 (10 September 2009) | doi:10.1038/nature08250; Received 5 June 2009; Accepted 29 June 2009; Published online 16 August 2009

Sarah B. Ng1, Emily H. Turner1, Peggy D. Robertson1, Steven D. Flygare1, Abigail W. Bigham2, Choli Lee1, Tristan Shaffer1, Michelle Wong1, Arindam Bhattacharjee4, Evan E. Eichler1,3, Michael Bamshad2, Deborah A. Nickerson1 & Jay Shendure1

In their particular case, duplicated reads should be filtered out, as their goal is to find mutations (SNPs). But for RNA-seq, it's hard to say if the duplicated reads are artifacts or reflection of real biology.

Last edited by liguow; 12-31-2009, 09:53 AM.
Leave a comment:
hingamp replied

11-25-2009, 12:19 AM
Originally posted by vasvale View Post

it is very important to do it also for single reads but it is not provided by Illumina
if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)

Could you provide more details on that "Nature Sarah B. Ng" reference? I can't locate it from pubmed or google scholar...

I tend to agree that removing exact duplicates is important, especially for any quantitative counting, I've seen huge suspect duplicate pileups that are difficult to explain other than being due to PCR artefacts...
Leave a comment:
Thomas Doktor replied

11-12-2009, 09:14 PM
If your output alignment is in the SAM/BAM format, SAMTools can remove duplicate reads for you.
Leave a comment:
vasvale replied

10-16-2009, 09:32 AM
I'd need it for single-end reads for genomic DNA
don't know how to use MAQ
Leave a comment:
bioinfosm replied

10-16-2009, 09:23 AM
This is an interesting discussion. Here are the points I had

- How do you reach CASAVA with paired-end RNA-Seq data? Gerald has the options of eland_pair of eland_rna, how do you do eland paired rna?

- Vasvale, how do you run casava for PE RNA-Seq data?

- Simon, do you do this filtering of repeats making more than 10% data, on just rna-seq, or all kinds of sequencing?

- I know of tophat, that uses the paired-end info as well, when doing PE RNA-Seq, anyone out there has compared that with CASAV->GenomeStudio results?
Leave a comment:
brasj replied

09-25-2009, 08:17 AM
Originally posted by vasvale View Post

Casava does it only for paired-end reads

it is very important to do it also for single reads but it is not provided by Illumina
if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)

anybody knows a software that can do this?

@vasvale: I believe Maq can do this.

maq rmdup out.rmdup.map in.ori.map
Remove pairs with identical outer coordinates.

Check out http://maq.sourceforge.net/maq-manpage.shtml
Leave a comment:
simonandrews replied

09-18-2009, 12:25 AM
Originally posted by vasvale View Post

if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)

Depending on the nature of your data removing duplicates may or may not be the correct thing to do. If you're using short reads and have high coverage over a region then eventually you're going to start getting exact duplicates - there are only so many places you can put a 36bp read after all! Depending on your application then removing these could be the wrong thing to do and might bias your quantiation.

The approach I've tended to take is to filter our regions where the proportion of reads coming from exact overlaps is above a cutoff (eg 10%) and this seems to work pretty well to remove artefacts. This too will break down where you have lots of reads in a really short region, but it's scaled pretty well in the work we've done so far. I suppose eventually some observed/expected value calculated from the size of region and length and number of reads would be the best way to spot regions which have been affected by PCR or mapping artefacts.
Leave a comment:
vasvale replied

09-17-2009, 05:37 PM
Casava does it only for paired-end reads

it is very important to do it also for single reads but it is not provided by Illumina
if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)

anybody knows a software that can do this?
Leave a comment:
basickler replied

03-12-2009, 12:27 PM
Yes, by default, CASAVA throws away duplicate reads which can account for up to 10% or so of deep coverage sequencing runs (30X on human genome). But ELAND, like bowtie just outputs alignments.
Leave a comment:
swbarnes2 replied

03-12-2009, 12:22 PM
Actually, at the Illumina user meeting, Illumina showed off some of their own software that is downstream of ELAND, and I think they said that GenomeStudio, or maybe CASAVA, which feeds into Genome Studio, will throw away paired reads if they are identical at both ends to the reads from other clusters, figuring (I think rightly) that if your sequence is the same at both ends, its probably a PCR artifact. But if you are looking at something like SOAP or Bowtie, they just output the hits, one at a time. They don't remember where every hit is.
Leave a comment:
basickler replied

03-12-2009, 09:18 AM
Something to remember when using high throughtput sequencing is that you get so much sequence that duplicates from the PCR amplification step during library preparation can become an issue. This is much more of an issue with a lot of PCR cycles or very deep coverage of a particular library. For single read sequence it's hard to tell what a duplicate and what's not but for paired-end sequence you'll see pretty obvious re-sequencing depending on the quality of your library preparation.

But to answer your question ,the default Eland/Phagealign aligners in the Illumina pipeline will keep all reads that map to the genome given it's criteria because for all intents and purposes the sequence was there and it's fine. If you don't use those algorithms, then it's specific to what alignment algorithm you use because the base calling step outputs sequence for all the clusters that pass the filters.
Cheers,

Brad
Leave a comment:
sci_guy replied

03-11-2009, 02:58 PM
How much starting material did you have and how amplified is the material? Overamplified material will align in a 'blocky' fashion no matter what the platform. However, if you're getting multiple alignments spread only 1bp apart then it is most likely high coverage and is fine.

In agreement with swbarnes2, counting the 'pile-up' of sequences is required for a number of applications. De novo sequencing, CHiP-Seq/BS-Seq, RNA-Seq, SNP discovery/allele determination, etc.
Leave a comment:
swbarnes2 replied

02-26-2009, 12:25 PM
If you mean, will the solexa pipeline throw them out, the answer is no. Phix is only 5 kb, so there are lots and lots of identical reads.

But I suppose every aligner is different, though throwing them out would be a little strange...what if you are trying to count coverage of something which has been sequenced very throughly? You'd need all the reads.
Leave a comment:

Previous 1 2 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News