Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • hingamp
    replied
    Many thanks liguow for the reference!
    Homogenous amplification is of course expected, in fact required; what is more of an issue is irregular amplification (for some reason one fragment is amplified 100 times more than the average) which messes up any downstream quantitative analyses that relies on read counts. The idea is that exact same reads in the context of a large genome for a given sequencing depth is very unlikely. Again, exact duplicate removal is standard practice for many in the ChIP-SEQ community to avoid spurious peak calls. MACS for instance which is widely used has inbuilt duplicate removal:

    Sometimes the same tag can be sequenced repeatedly, more times than expected from a random genome-wide tag distribution. Such tags might arise from biases during ChIP-DNA amplification and sequencing library preparation, and are likely to add noise to the final peak calls. Therefore, MACS removes duplicate tags in excess of what is warranted by the sequencing depth (binomial distribution p-value <10-5). For example, for the 3.9 million FoxA1 ChIP-Seq tags, MACS allows each genomic position to contain no more than one tag and removes all the redundancies.

    Leave a comment:


  • Boel
    replied
    artifact?

    Would you not expect many identical reads in an RNA seq experiment where amplification has been conducted? If fragments of length 300 are selected, and thereafter amplified (>=15 cycles or so) then I would suspect that many identical clusters will form on the flow cell. This is an effect of PCR amplification, sure, but I would not say that it is an artifact. Or am I missing something?

    Leave a comment:


  • liguow
    replied
    exome sequencing

    Originally posted by hingamp View Post
    Could you provide more details on that "Nature Sarah B. Ng" reference? I can't locate it from pubmed or google scholar...

    I tend to agree that removing exact duplicates is important, especially for any quantitative counting, I've seen huge suspect duplicate pileups that are difficult to explain other than being due to PCR artefacts...
    It might be this paper:
    Title: Targeted capture and massively parallel sequencing of 12 human exomes

    Nature 461, 272-276 (10 September 2009) | doi:10.1038/nature08250; Received 5 June 2009; Accepted 29 June 2009; Published online 16 August 2009

    Sarah B. Ng1, Emily H. Turner1, Peggy D. Robertson1, Steven D. Flygare1, Abigail W. Bigham2, Choli Lee1, Tristan Shaffer1, Michelle Wong1, Arindam Bhattacharjee4, Evan E. Eichler1,3, Michael Bamshad2, Deborah A. Nickerson1 & Jay Shendure1

    In their particular case, duplicated reads should be filtered out, as their goal is to find mutations (SNPs). But for RNA-seq, it's hard to say if the duplicated reads are artifacts or reflection of real biology.
    Last edited by liguow; 12-31-2009, 09:53 AM.

    Leave a comment:


  • hingamp
    replied
    Originally posted by vasvale View Post
    it is very important to do it also for single reads but it is not provided by Illumina
    if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)
    Could you provide more details on that "Nature Sarah B. Ng" reference? I can't locate it from pubmed or google scholar...

    I tend to agree that removing exact duplicates is important, especially for any quantitative counting, I've seen huge suspect duplicate pileups that are difficult to explain other than being due to PCR artefacts...

    Leave a comment:


  • Thomas Doktor
    replied
    If your output alignment is in the SAM/BAM format, SAMTools can remove duplicate reads for you.

    Leave a comment:


  • vasvale
    replied
    I'd need it for single-end reads for genomic DNA
    don't know how to use MAQ

    Leave a comment:


  • bioinfosm
    replied
    This is an interesting discussion. Here are the points I had

    - How do you reach CASAVA with paired-end RNA-Seq data? Gerald has the options of eland_pair of eland_rna, how do you do eland paired rna?

    - Vasvale, how do you run casava for PE RNA-Seq data?

    - Simon, do you do this filtering of repeats making more than 10% data, on just rna-seq, or all kinds of sequencing?

    - I know of tophat, that uses the paired-end info as well, when doing PE RNA-Seq, anyone out there has compared that with CASAV->GenomeStudio results?

    Leave a comment:


  • brasj
    replied
    Originally posted by vasvale View Post
    Casava does it only for paired-end reads

    it is very important to do it also for single reads but it is not provided by Illumina
    if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)

    anybody knows a software that can do this?
    @vasvale: I believe Maq can do this.

    maq rmdup out.rmdup.map in.ori.map
    Remove pairs with identical outer coordinates.

    Check out http://maq.sourceforge.net/maq-manpage.shtml

    Leave a comment:


  • simonandrews
    replied
    Originally posted by vasvale View Post
    if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)
    Depending on the nature of your data removing duplicates may or may not be the correct thing to do. If you're using short reads and have high coverage over a region then eventually you're going to start getting exact duplicates - there are only so many places you can put a 36bp read after all! Depending on your application then removing these could be the wrong thing to do and might bias your quantiation.

    The approach I've tended to take is to filter our regions where the proportion of reads coming from exact overlaps is above a cutoff (eg 10%) and this seems to work pretty well to remove artefacts. This too will break down where you have lots of reads in a really short region, but it's scaled pretty well in the work we've done so far. I suppose eventually some observed/expected value calculated from the size of region and length and number of reads would be the best way to spot regions which have been affected by PCR or mapping artefacts.

    Leave a comment:


  • vasvale
    replied
    Casava does it only for paired-end reads

    it is very important to do it also for single reads but it is not provided by Illumina
    if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)

    anybody knows a software that can do this?

    Leave a comment:


  • basickler
    replied
    Yes, by default, CASAVA throws away duplicate reads which can account for up to 10% or so of deep coverage sequencing runs (30X on human genome). But ELAND, like bowtie just outputs alignments.

    Leave a comment:


  • swbarnes2
    replied
    Actually, at the Illumina user meeting, Illumina showed off some of their own software that is downstream of ELAND, and I think they said that GenomeStudio, or maybe CASAVA, which feeds into Genome Studio, will throw away paired reads if they are identical at both ends to the reads from other clusters, figuring (I think rightly) that if your sequence is the same at both ends, its probably a PCR artifact. But if you are looking at something like SOAP or Bowtie, they just output the hits, one at a time. They don't remember where every hit is.

    Leave a comment:


  • basickler
    replied
    Something to remember when using high throughtput sequencing is that you get so much sequence that duplicates from the PCR amplification step during library preparation can become an issue. This is much more of an issue with a lot of PCR cycles or very deep coverage of a particular library. For single read sequence it's hard to tell what a duplicate and what's not but for paired-end sequence you'll see pretty obvious re-sequencing depending on the quality of your library preparation.

    But to answer your question ,the default Eland/Phagealign aligners in the Illumina pipeline will keep all reads that map to the genome given it's criteria because for all intents and purposes the sequence was there and it's fine. If you don't use those algorithms, then it's specific to what alignment algorithm you use because the base calling step outputs sequence for all the clusters that pass the filters.
    Cheers,

    Brad

    Leave a comment:


  • sci_guy
    replied
    How much starting material did you have and how amplified is the material? Overamplified material will align in a 'blocky' fashion no matter what the platform. However, if you're getting multiple alignments spread only 1bp apart then it is most likely high coverage and is fine.

    In agreement with swbarnes2, counting the 'pile-up' of sequences is required for a number of applications. De novo sequencing, CHiP-Seq/BS-Seq, RNA-Seq, SNP discovery/allele determination, etc.

    Leave a comment:


  • swbarnes2
    replied
    If you mean, will the solexa pipeline throw them out, the answer is no. Phix is only 5 kb, so there are lots and lots of identical reads.

    But I suppose every aligner is different, though throwing them out would be a little strange...what if you are trying to count coverage of something which has been sequenced very throughly? You'd need all the reads.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-25-2024, 11:49 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
62 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X