Announcement

Collapse
No announcement yet.

Removing duplicates is it really necessary?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing duplicates is it really necessary?

    I know samtool and picard can remove duplicates. But is it really necessary? A duplicate could be PCR effect or reading same fragment twice, there is no way to tell.

    Also how do you define a duplicte? Why do both sametools and picard take in bam files as input? In theory, you can remove duplicate from raw data already. Is it because they only check the aligned location not the actual read?

  • #2
    "A duplicate could be PCR effect or reading same fragment twice, there is no way to tell."

    Yes this is the heart of the matter... The degree to which you expect duplicate fragments is highly dependent on the depth of the library and the type of library being sequenced (whole genome, exome, transcriptome, ChIP-Seq, etc.). The main purpose of removing duplicates is to mitigate the effects of PCR amplification bias introduced during library construction. Furthermore you gain the computational benefit of reducing the number of reads to be processed in downstream steps. However, removing duplicates, comes at the cost of setting a hard cap (and a rather low one at that) on the dynamic range of measurement you are going to produce. This has been discussed in various places on this forum (not a complete list):

    Redundant reads are removed from ChIP-seq, what about RNA-seq

    Duplicate reads removal

    Threshold for duplicate removal

    Heavy read stacking on the solexa platform

    Should identical reads be reduced to a single count?

    Removing duplicate reads from multigig .csfasta

    Source of duplicate reads and possible ways of reducing

    Comment


    • #3
      "Also how do you define a duplicate?"

      As you suggest, this can be done in at least two ways. Both have caveats.

      If you define duplicates as reads with identical sequence you may underestimate the presence of duplicates in your library. Read errors present in your library will result in reads with apparently different sequences that could have actually come from PCR amplifications of the same cDNA fragment. Accurate removal of duplicates is therefore dependent on a low error rate in your library. One advantage of this approach is that it reduces the pool of reads before mapping and thereby reduces alignment time.

      Using mapping positions of the reads avoids the influence of library sequencing error rate described above (mostly). However, two reads that have identical mapping positions do not necessarily represent the same underlying cDNA fragment. Especially if you consider only the outer mapping coordinates (which is a common strategy). Reads derived from transcripts and then mapped to the genome might appear to have the same outer coordinates, but have variations internally (corresponding to alternative isoforms with different exon boundaries, exons skipped, etc.). Furthermore, two reads may have completely identical mapping coordinates, but still correspond to distinct cDNA fragments. Imagine two identically mapping reads that have a single base difference that corresponds to a polymorphism. cDNA fragments may be derived from both the maternal and paternal alleles of a diploid genome. These are distinct fragments containing biologically pertinent information. If you removed duplicates purely on the basis of mapping coordinates, you might lose some of this information.

      As you can see there is no simple answer to this question. Identifying true duplicate fragments is not black-and-white. This perhaps argues for not trying to remove them in the first place. But on the other hand, the presence of duplicates that do represent PCR amplification bias can cause problems for downstream analysis...

      In general, having longer reads and paired-end reads as opposed to single-end reads can help in correctly identifying duplicates. But there is still no way to tell for sure whether a particular read corresponds to an amplification bias or not. In library types where you expect approximately uniform coverage of the target space (e.g. a whole genome library aligned to a reference genome) you can watch for areas of unusually bad read stacking and correct for it. But even with this you have to be careful, because your genome is not the reference genome, and genuine copy number variations can result in biologically meaningful deviations from uniform coverage that you may actually be interested in...

      The type of experiment and analysis being conducted may be influential in deciding how to approach the issue. If you provide more details of your particular library type and analysis goals, someone may be able to provide specific advice.

      In expression analysis, I do not remove duplicates but I do attempt to quantify library complexity for every library and watch out for libraries that seem to be outliers in this regard. You can view an example of this here:
      ALEXA-seq (see the figures under: 'summary of library complexity - estimated by tag redundancy per million reads and compared to other libraries')

      Ultimately, the best way to deal with duplicates that correspond to PCR amplification bias is to reduce their generation in the first place. Avoid using small amounts of input material (i.e. a molecular bottleneck) and keep amplification to an absolute minimum.

      Comment


      • #4
        I have mitochondria single end reads from Illumina sequencer. The samples are pooled and pcr using mitochondria specific primers. However, since the mitochondria is very short, it end up to have average coverage over 2000X. I found there are a ton of duplicates, so I am wondering if I should remove them or not.

        Comment


        • #5
          In our RNA-Seq data we tend to get very uneven coverage of expressed genes. Removing duplicates would eliminate much of our signal from those genes, and all our read counts would be dramatically lower. For doing SNP calling that wouldn't be as much of a problem, but for differential expression analysis it complicates matters. The differential expression statistics can account for some of the variation introduced by PCR duplicates, and we've chosen to live with any variation that's left over. If our coverage were mostly evenly distributed and/or if we were doing SNP calling we may have made a different decision.

          Comment


          • #6
            thanks malachig for a great description of the problem.

            I also want to mention further about the dangers of marking duplicates. I think those that recommend it are being way too cavalier about it.

            For example, there was a recent whole exome paper where they did both single end and paired end sequencing. They marked duplicates in both, and found ~28% and ~8% duplicates respectively. In fact, when they treated the paired ends as single ends (removing the connection between paired ends), the 8% jumped back up to 28%. That is a massive amount of data loss. So, those who have been throwing away duplicates , particularly single ends and based solely on coordinates, are losing a lot of data.

            I would worry about the removal of duplicates just as much as the keeping of duplicates

            Removing duplicates does help save time during alignments... but...

            when it comes to SNPs/indels detection, you might not want to mark duplicates until *after* alignment. The reason being that the programs that do local mutiple sequence realignment around indels (eg. GATK) will not look at the duplicates, and thus the lower read counts might not trigger the realignment needed for the region. So in my opinion, it would be better to align, realign indels, and then mark duplicates.
            Last edited by NGSfan; 09-16-2010, 01:19 AM.

            Comment


            • #7
              Originally posted by NGSfan View Post
              For example, there was a recent whole exome paper where they did both single end and paired end sequencing. They marked duplicates in both, and found ~28% and ~8% duplicates respectively. In fact, when they treated the paired ends as single ends (removing the connection between paired ends), the 8% jumped back up to 28%. That is a massive amount of data loss. So, those who have been throwing away duplicates , particularly single ends and based solely on coordinates, are losing a lot of data.
              Even that 8% may be an overestimate of the true PCR duplicates. I recently attended the Illumina Midwest Users Group meeting and an Illumina scientist presented some data on duplicate identification (it may have been the data you referred to since the percentages sound about the same). However they went a step further in distinguishing fragment duplicates from PCR duplicates. They prepared their paired end libraries using Illumina MID tagged style adapters, but instead of a finite set of known MID sequences, the adapters were constructed with random bases where the barcode would be. Now for each cluster they had three data points to compare, reads 1 and 2 and their respective alignment positions on the reference genome, plus the random, 6bp sequence in the MID position. A read would need to match all three of these to be called a PCR duplicate. When they added these random tags they found that the number of identified duplicates dropped from 8% to ~1%.

              Comment


              • #8
                wow... kmcarr, what a great trick. Thanks for sharing that. That really answers the question: how many real PCR duplicates am I getting? From that info you could play around with PCR cycle numbers and figure out how many cycles to drop.

                Comment


                • #9
                  Originally posted by NGSfan View Post

                  For example, there was a recent whole exome paper where they did both single end and paired end sequencing. They marked duplicates in both, and found ~28% and ~8% duplicates respectively. In fact, when they treated the paired ends as single ends (removing the connection between paired ends), the 8% jumped back up to 28%. That is a massive amount of data loss. So, those who have been throwing away duplicates , particularly single ends and based solely on coordinates, are losing a lot of data.
                  Can you point out which paper it is please?

                  Comment


                  • #10
                    Hi foxyg,

                    I've done a very similar experiment, but hadn't considered the issue of removing duplicates before. I pooled pcr-amplified mtDNA from ~415 individuals and ran this DNA on two lanes, doing single-ended 36-mer reads. I achieved about 6000x coverage with the combined dataset.

                    I'm interested in what you plan on getting out of your alignments. I want to estimate the allele frequencies of SNPs by counting the relative abundance of each allele at each position in the alignment. My opinion is that with such high coverage, it is inevitable that you will have many reads mapping to the exact same location, and you want to preserve all of these reads if you want to estimate allele frequencies in a pool.

                    I'm also interested in looking for evidence of small-scale indels and large deletions, but have been having difficulty making progress on this. I've done most of my analysis with maq, which doesn't do gapped alignments, and I have no paired-end data. I'm looking into indel detection with bwa, but having trouble migrating over to the newer file formats.

                    Let me know if you want to discuss more details of our projects, as they seem to be quite similar.

                    Comment


                    • #11
                      In Illumina sequencing, there are typically two types of duplicates: PCR duplicates and optical duplicates. Optical duplicates are sequences from one cluster in fact but identified by software from multiple adjacent clusters. They can be identified without alignment. You just check sequence and the coordinates on the image.

                      PCR duplicates are usually identified after alignment. Although there are alignment-free approaches, they are less powerful when alignment is available. Dedup typically works by identifying read pairs having identical 5'-end coordinates (3'-end coordinates are not considered). The chance of two 5'-end coordinates being identical depends on the coverage, but usually small. You can calculate the theoretical false dedup rate using my formula (0.28*m/s/L, where m is the number of read pairs, s is the standard deviation of insert size distribution and L is the length of the genome; sharp insert size distribution (small s) increases false dedup rate unfortunately). If the empirical duplicate rate is much higher than this 0.28*m/s/L, they are true PCR duplicates.

                      Dedup for single-end reads has high false positive rate given deep data. This is fine for SNP calling because the 40X coverage is enough for SNP calling. However, it is dangerous to dedup SE reads for RNA-seq or ChIP-seq where read count matters. As mrawlins has suggested, it would be better to account for duplicates in your read counting model rather than run a dedup program.

                      The rate of PCR duplicates is 0.5*m/N (m is the number of sequenced reads and N is the number of DNA molecules before amplification). It is not affected by the PCR cycles, at least not greatly. The key to reducing PCR duplicates is to get enough DNA (get large N). Also, the dup rate is proportional to M. The more you sequence, the higher dup rate you get.

                      Comment


                      • #12
                        Originally posted by foxyg View Post
                        Can you point out which paper it is please?
                        The paper is the Brainbridge 2010 paper in Genome Biology:

                        http://genomebiology.com/2010/11/6/R62

                        Table 3 is the relevant figure.

                        Page 3 describes in more detail, the treatment of PE as SE:

                        PE sequencing, in contrast, allows us to
                        use information about both the start and the end of the
                        capture-DNA fragment in order to determine whether
                        the data are derived from independent DNA molecules.
                        Thus, the increased information content of PE data
                        allows us to reduce the misidentification of duplicate
                        reads.
                        To test these theories we also analyzed the PE data as if
                        they were generated from a single-ended frag library.
                        This caused the on-target alignment rate to drop slightly
                        to 73% and the duplicate rate to nearly quadruple to
                        27.6%, virtually identical to the Illumina frag library
                        duplicate rate.

                        Comment


                        • #13
                          It has just occurred to me that if PCR duplicates are unbiased (although in fact they are), we may not need to dedup because the read count is also unbiased. Duplicates affect the variance but not the mean. Dedup is more important to variant discovery. Because although the read count in unbiased, the large variance will lead to more false calls.

                          Am I about right?
                          Last edited by lh3; 10-15-2010, 09:11 AM.

                          Comment


                          • #14
                            Originally posted by lh3 View Post
                            It has just occurred to me that if PCR duplicates are unbiased (although in fact they are), we may not need to dedup because the read count is also unbiased. Duplicates affect the variance but not the mean. Dedup is more important to variant discovery. Because although the read count in unbiased, the large variance will lead to more false calls.

                            Am I about right?
                            I'm probably going to write something wrong, but for ChIP-seq (and RNA-seq) the sample is biased (due to enrichment) and I believe this bias is propagated to PCR (and read counts)... rich gets richer and most represented sequences are more prone to be resequenced...
                            Ok, I stop here before I harm my reliability in this community :-)

                            Comment


                            • #15
                              Thanks, dawe. I have not thought about this bit. But naively, if PCR is unbiased (i.e. every sequence is amplified for the same number of cycles), it should keep the ratio of read counts between different loci. The problem is the enlarged variance, but perhaps the variance caused by duplicates is still small in comparison to the variance in gene expression.

                              I am not familiar with chip/rna-seq. Probably I am saying something wrong and my intuition is always unreliable. I just want to know what is the true answer...

                              Comment

                              Working...
                              X