Seqanswers Leaderboard Ad

**luc** · 09-28-2012, 03:06 PM

Could you give us some details about your protocols and the structure of the adapters that you are using? How do you get randomized sequences a the beginning of your reads? In any case, I would suggest to run FastQC or a similar program on your data to check for any quality problems.

**tjs7** · 10-01-2012, 10:38 AM

The core facility I collaborate with ran FastQC for me after I posted this, and it showed that quality scores were above 30 for bases 1~55, with the exception of base 5, which had a very low score. The explanation from the core facility computer analyst was that having a C in every read at position 5 is probably confusing the machine. Further analysis showed that 35% of the time C was correctly called, but the other ~65% of the time the machine called the 5th base as N.

During filtering, we were requiring that our reads have a C in the 5th position, thus we were throwing out a large portion of the data. By simply eliminating that requirement, we were able to include most reads in our data set, and most reads appear to have the correct structure.

I have no explanation why this occurred, since libraries of essentially the same structure were sequenced a year ago and bases were called correctly. It could be a particular software update or machine update. If anyone needs specifics (like software version, etc.) I am sure I could get them.

Thanks

**luc** · 10-01-2012, 06:10 PM

Hi,

good that you figured that out.
Having an identical base at one position in all clusters is obviously not a good premise as you have noted. Such problems are to be expected and you might have been merely lucky when doing your first sequencing run. Further I guess the HiSeq system has gotten considerably better over the last year - meaning we are getting a lot more reads on average - perhaps denser clusters lead to more problems in parts of the sequence lacking complexity?

I would have some more questions. Why would you need your 4 degenerate bases to determine PCR duplicates? Are you analyzing a small genome? I would assume that for eukaryotic genomes the first 30 bases (or perhaps better something like bases 12-40) are diverse enough for a good removal of PCR duplicates, especially for paired end data. At least that is our working assumption.

How did you generate the 4 degenerate bases at the beginning of the read? That sounds interesting.
What is the resulting base composition of your sequenced first 4 bases?

**tjs7** · 10-02-2012, 04:10 AM

Our library prep strategy has two variables that help identify PCR duplicates. First, our reads are designed to be of various lengths. Second, the RT primer we use has the 4 degenerate bases, which end up at the start of our reads (essentially 256 possible RT primers in the mix).

Doing a probability calculation, this comes out to thousands of possible combinations of read lengths and 4 degenerate base "codes" for a given genomic location. Thus, if we have multiple reads mapping to the exact same genomic coordinates and having the same 4 base "code," we treat those as PCR duplicates and collapse those reads into 1 read.

In practice, this works well for all but the most highly expressed genes. Those relatively few genes are so highly expressed in the tissue we study that the number of reads are so many that each combination of length, sequence, and 4 base code is repeated multiple times. We are willing to accept this to limit PCR duplicates throughout the majority of the dataset.

The base composition of the first 4 bases ended up as 25% A, 25% G, 15% C, and 35% T. Not a perfect 25% each, but OK for our purposes, which are qualitative comparative analyses.

**luc** · 10-02-2012, 12:47 PM

Thanks a lot for the details on your protocol! Very interesting.

**jparsons** · 10-03-2012, 05:31 AM

I'd be interested to know how often you get reads that look like PCR dupes without the random RT primer but have different degenerate bases. In other words, are 90% of the "duplicate" reads really duplicates, or is it more like 9%?

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Why am I losing up to 5 bases at start of reads?

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News