Seqanswers Leaderboard Ad

**blancha** · 05-22-2014, 08:53 PM

I don't see how the trimming would affect the duplicate levels.
High duplicate levels are either due to PCR overamplification, or a low complexity library.

Without more information, it is not possible to tell if the high duplicate levels are due to PCR over amplification, and therefore a problem, or due to a low complexity library, and are therefore representative of the library.

If the amount of starting RNA was low and/or the number of PCR cycles was high, one would suspect PCR over amplification.
If when examining the alignment peaks, one sees isolated sequences duplicated multiple times, one would also suspect PCR over amplification.

It can be tricky to distinguish if high duplicate levels are due to PCR over amplification or a low complexity starting library. The researcher may not always be expecting a low complexity library. For example, I had an RNA-Seq sample of a cytoplasmic fraction with a high duplication rate because the library had been prepared using ribosomal depletion. An RNA signalling molecule present in very high numbers in the cytoplasm had not been removed.

Sometimes, you need to really understand your samples to identify the cause of the high duplicate levels.

**Brian Bushnell** · 05-22-2014, 09:10 PM

If coverage is high enough, you will have duplicates. Consider a 5 MB genome. At most you could have 100,000 unique 50bp reads; any more must be duplicates.

RNA-seq data often has some genes that have super-high expression levels; if a gene has 1000x coverage, with 50bp reads, then at least 95% of them must be duplicates, because unique reads can only reach 50x coverage. I think FastQC's warning is based on the assumption that you have DNA data; I would ignore it.

Duplicates often come from over-amplification with PCR, too, but generally it's possible to determine the cause of the duplicates, if you know what to look for. Mapping and looking at the mapped reads in IGV can help. High levels of PCR duplicates will have a distinctive patchy coverage. Normally people don't remove duplicates from RNA-seq data because that interferes with quantification, so if the duplicates are indeed from amplification, you should either ignore them, or redo the experiment with more RNA and less amplification if they are actually a problem.

The cleaning process sounds OK to me, but normally I recommend adapter trimming rather than adapter filtering, because you lose less data. The cleaning would tend to increase the percent of duplicate reads by removing reads with errors, but it's not like it adds any new duplicates, so that doesn't really matter.

**kmcarr** · 05-23-2014, 04:23 AM

Originally posted by Brian Bushnell View Post

If coverage is high enough, you will have duplicates. Consider a 5 MB genome. At most you could have 100,000 unique 50bp reads; any more must be duplicates.

For a 5 Mbp genome you can have 5,000,000 unique 50bp reads (or 100bp read, or 123bp reads, etc). A read starting a base n is unique from a read starting at base n+1 (e.g. 1-50, vs. 2-51). This assumes the genome is circular. If it is linear then the number of potential unique 50bp reads is 4,999,950.

**Brian Bushnell** · 05-23-2014, 08:16 AM

Originally posted by kmcarr View Post

For a 5 Mbp genome you can have 5,000,000 unique 50bp reads (or 100bp read, or 123bp reads, etc). A read starting a base n is unique from a read starting at base n+1 (e.g. 1-50, vs. 2-51). This assumes the genome is circular. If it is linear then the number of potential unique 50bp reads is 4,999,950.

Woops, my math was totally wrong, that's correct =) For Xbp reads you can have at most X-fold unique coverage.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Duplication levels

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News