Hi,
We've been looking at SOLiD RNA-seq mappings in detail, and we've noticed two read families: reads that align "randomly", and reads that form (sometimes huge) perfectly identical pileups... In the same way that identical reads are suspected PCR artefacts and counted only once in ChIP-seq, should 100% identical reads (including base calls in conflict with reference sequence be counted only once in RNA-seq?
This was used for instance in PNAS December 23, 2008 vol. 105 no. 51 20179-20184:
"Genomic Mapping of the Sequence Tags.
Position-specific base compositions were made by compiling all uniquely aligned reads. The first base of every sequence tag was discarded because of nearly random utilization at the beginning of all sequences. To eliminate redundancies created by PCR amplification, all tags with identical sequences were considered single reads. After removal of adaptor sequences from the reads, the reads were compressed to a nonredundant list of unique sequence tags, which were then mapped to the human genome (hg17) with MosaikAligner (29), using a maximum of 2 mismatches over 95% alignment of the tag (34 nt) and a hash size of 15."
We've been looking at SOLiD RNA-seq mappings in detail, and we've noticed two read families: reads that align "randomly", and reads that form (sometimes huge) perfectly identical pileups... In the same way that identical reads are suspected PCR artefacts and counted only once in ChIP-seq, should 100% identical reads (including base calls in conflict with reference sequence be counted only once in RNA-seq?
This was used for instance in PNAS December 23, 2008 vol. 105 no. 51 20179-20184:
"Genomic Mapping of the Sequence Tags.
Position-specific base compositions were made by compiling all uniquely aligned reads. The first base of every sequence tag was discarded because of nearly random utilization at the beginning of all sequences. To eliminate redundancies created by PCR amplification, all tags with identical sequences were considered single reads. After removal of adaptor sequences from the reads, the reads were compressed to a nonredundant list of unique sequence tags, which were then mapped to the human genome (hg17) with MosaikAligner (29), using a maximum of 2 mismatches over 95% alignment of the tag (34 nt) and a hash size of 15."
Comment