Seqanswers Leaderboard Ad

**swbarnes2** · 02-26-2009, 12:25 PM

If you mean, will the solexa pipeline throw them out, the answer is no. Phix is only 5 kb, so there are lots and lots of identical reads.

But I suppose every aligner is different, though throwing them out would be a little strange...what if you are trying to count coverage of something which has been sequenced very throughly? You'd need all the reads.

**sci_guy** · 03-11-2009, 02:58 PM

How much starting material did you have and how amplified is the material? Overamplified material will align in a 'blocky' fashion no matter what the platform. However, if you're getting multiple alignments spread only 1bp apart then it is most likely high coverage and is fine.

In agreement with swbarnes2, counting the 'pile-up' of sequences is required for a number of applications. De novo sequencing, CHiP-Seq/BS-Seq, RNA-Seq, SNP discovery/allele determination, etc.

**basickler** · 03-12-2009, 09:18 AM

Something to remember when using high throughtput sequencing is that you get so much sequence that duplicates from the PCR amplification step during library preparation can become an issue. This is much more of an issue with a lot of PCR cycles or very deep coverage of a particular library. For single read sequence it's hard to tell what a duplicate and what's not but for paired-end sequence you'll see pretty obvious re-sequencing depending on the quality of your library preparation.

But to answer your question ,the default Eland/Phagealign aligners in the Illumina pipeline will keep all reads that map to the genome given it's criteria because for all intents and purposes the sequence was there and it's fine. If you don't use those algorithms, then it's specific to what alignment algorithm you use because the base calling step outputs sequence for all the clusters that pass the filters.
Cheers,

Brad

**swbarnes2** · 03-12-2009, 12:22 PM

Actually, at the Illumina user meeting, Illumina showed off some of their own software that is downstream of ELAND, and I think they said that GenomeStudio, or maybe CASAVA, which feeds into Genome Studio, will throw away paired reads if they are identical at both ends to the reads from other clusters, figuring (I think rightly) that if your sequence is the same at both ends, its probably a PCR artifact. But if you are looking at something like SOAP or Bowtie, they just output the hits, one at a time. They don't remember where every hit is.

**basickler** · 03-12-2009, 12:27 PM

Yes, by default, CASAVA throws away duplicate reads which can account for up to 10% or so of deep coverage sequencing runs (30X on human genome). But ELAND, like bowtie just outputs alignments.

**vasvale** · 09-17-2009, 05:37 PM

Casava does it only for paired-end reads

it is very important to do it also for single reads but it is not provided by Illumina
if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)

anybody knows a software that can do this?

**simonandrews** · 09-18-2009, 12:25 AM

Originally posted by vasvale View Post

if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)

Depending on the nature of your data removing duplicates may or may not be the correct thing to do. If you're using short reads and have high coverage over a region then eventually you're going to start getting exact duplicates - there are only so many places you can put a 36bp read after all! Depending on your application then removing these could be the wrong thing to do and might bias your quantiation.

The approach I've tended to take is to filter our regions where the proportion of reads coming from exact overlaps is above a cutoff (eg 10%) and this seems to work pretty well to remove artefacts. This too will break down where you have lots of reads in a really short region, but it's scaled pretty well in the work we've done so far. I suppose eventually some observed/expected value calculated from the size of region and length and number of reads would be the best way to spot regions which have been affected by PCR or mapping artefacts.

**brasj** · 09-25-2009, 08:17 AM

Originally posted by vasvale View Post

Casava does it only for paired-end reads

it is very important to do it also for single reads but it is not provided by Illumina
if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)

anybody knows a software that can do this?

@vasvale: I believe Maq can do this.

maq rmdup out.rmdup.map in.ori.map
Remove pairs with identical outer coordinates.

Check out http://maq.sourceforge.net/maq-manpage.shtml

**bioinfosm** · 10-16-2009, 09:23 AM

This is an interesting discussion. Here are the points I had

- How do you reach CASAVA with paired-end RNA-Seq data? Gerald has the options of eland_pair of eland_rna, how do you do eland paired rna?

- Vasvale, how do you run casava for PE RNA-Seq data?

- Simon, do you do this filtering of repeats making more than 10% data, on just rna-seq, or all kinds of sequencing?

- I know of tophat, that uses the paired-end info as well, when doing PE RNA-Seq, anyone out there has compared that with CASAV->GenomeStudio results?

**vasvale** · 10-16-2009, 09:32 AM

I'd need it for single-end reads for genomic DNA
don't know how to use MAQ

**Thomas Doktor** · 11-12-2009, 09:14 PM

If your output alignment is in the SAM/BAM format, SAMTools can remove duplicate reads for you.

**hingamp** · 11-25-2009, 12:19 AM

Originally posted by vasvale View Post

it is very important to do it also for single reads but it is not provided by Illumina
if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)

Could you provide more details on that "Nature Sarah B. Ng" reference? I can't locate it from pubmed or google scholar...

I tend to agree that removing exact duplicates is important, especially for any quantitative counting, I've seen huge suspect duplicate pileups that are difficult to explain other than being due to PCR artefacts...

**liguow** · 12-31-2009, 09:29 AM

exome sequencing

Originally posted by hingamp View Post

Could you provide more details on that "Nature Sarah B. Ng" reference? I can't locate it from pubmed or google scholar...

I tend to agree that removing exact duplicates is important, especially for any quantitative counting, I've seen huge suspect duplicate pileups that are difficult to explain other than being due to PCR artefacts...

It might be this paper:
Title: Targeted capture and massively parallel sequencing of 12 human exomes

Nature 461, 272-276 (10 September 2009) | doi:10.1038/nature08250; Received 5 June 2009; Accepted 29 June 2009; Published online 16 August 2009

Sarah B. Ng1, Emily H. Turner1, Peggy D. Robertson1, Steven D. Flygare1, Abigail W. Bigham2, Choli Lee1, Tristan Shaffer1, Michelle Wong1, Arindam Bhattacharjee4, Evan E. Eichler1,3, Michael Bamshad2, Deborah A. Nickerson1 & Jay Shendure1

In their particular case, duplicated reads should be filtered out, as their goal is to find mutations (SNPs). But for RNA-seq, it's hard to say if the duplicated reads are artifacts or reflection of real biology.

**Boel** · 02-11-2010, 12:25 PM

artifact?

Would you not expect many identical reads in an RNA seq experiment where amplification has been conducted? If fragments of length 300 are selected, and thereafter amplified (>=15 cycles or so) then I would suspect that many identical clusters will form on the flow cell. This is an effect of PCR amplification, sure, but I would not say that it is an artifact. Or am I missing something?

Topics	Statistics	Last Post
Mechanical Forces in DNA Transcription Uncovered by Clemson Researchers by seqadmin Started by seqadmin, 10-02-2024, 04:51 AM	0 responses 13 views 0 likes	Last Post by seqadmin 10-02-2024, 04:51 AM
New Epigenetic Clock Links Cheek Cells to Mortality Risk by seqadmin Started by seqadmin, 10-01-2024, 07:10 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-01-2024, 07:10 AM
AI-Powered Blood Test Shows Promise for Early Ovarian Cancer Detection by seqadmin Started by seqadmin, 09-30-2024, 08:33 AM	0 responses 25 views 0 likes	Last Post by seqadmin 09-30-2024, 08:33 AM
Stem Cell Research Suggests Human Cells May Enter Developmental Pause by seqadmin Started by seqadmin, 09-26-2024, 12:57 PM	0 responses 18 views 0 likes	Last Post by seqadmin 09-26-2024, 12:57 PM

Seqanswers Leaderboard Ad

Announcement

repeat reads in Solexa

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News