Seqanswers Leaderboard Ad

**hingamp** · 11-23-2009, 12:33 AM

In the hope of stimulating a little discussion on this data cleaning issue, here is a picture of SOLiD reads mapped to a reference genome (Whole transcriptome protocol, total RNA). Over and above the 'apparently random' reads near the top that stagger regularly across the sequence, notice a few 'columns' of identical reads. Are these reads PCR amplification artefacts, in which case multiple identical reads should only be counted once for quantification of gene expressions? Please ignore the sequence mismatches that accumulate near the ends (the colored bases in the reads differ form the reference), I'll start a distinct thread on this subject.

Attached Files

suspect_pileup.jpg (13.1 KB, 479 views)

**pmiguel** · 11-23-2009, 05:54 AM

Originally posted by hingamp View Post

Hi,

We've been looking at SOLiD RNA-seq mappings in detail, and we've noticed two read families: reads that align "randomly", and reads that form (sometimes huge) perfectly identical pileups... In the same way that identical reads are suspected PCR artefacts and counted only once in ChIP-seq, should 100% identical reads (including base calls in conflict with reference sequence be counted only once in RNA-seq?

Note that one of the main purposes of RNA-seq on a SOLiD will be to count of the number of transcripts in an RNA sample. For this purpose one would not want to discard all reads with identical starts. They are probably indicative of the representation of that RNA in the sample.

They might also represent PCR biases, but there is a more likely culprit: The RNAseIII used to fragment the RNA prior to adaptor ligation. RNAseIII is a double-strand-specific RNAse. At the most recent SOLiD User's Conference, Ambion reps said that they had found conditions under which it will also cleave single-stranded RNA. The implication being that these conditions are used for RNA fragmentation in the SOLiD Whole Transcriptome kit. Nevertheless this enzyme is a likely source of start-bias, cleaving double-stranded RNA with abandon then begrudgingly slicing up a little bit of single stranded RNA as well.

Note that, while not ideal, start-bias is not the same as a count-bias. For a count, you just need to unambiguously identify a given read as belonging to a transcript from a certain gene. Where it starts, for this purpose, is not critical.

Nevertheless, if you wanted to, you could use another method of RNA fragmentation that would be less likely to have start biases. A presenter at the meeting had tried heat, and found it to be less biased. But unless you want to delve deep into the issue, it might be better to avoid tinkering. I think that heat probably results in 5'-hydroxyl, 2'-3'-cyclic phosphate RNA fragment terminii. These would likely not be good substrate for the adaptor ligation step. To convert these fragments into the 5'-phosphate,3'-hydroxyl ended fragments that the ligase will want, you would need to do end repair. T4-PNK, by itself, might do the trick. Or not.

--
Phillip

**javijevi** · 02-18-2010, 10:25 AM

Is there any tool to remove exact duplicate reads? It should remove all the reads with the same exact sequence, leaving alone the one with better quality values, shouldn't?

**hingamp** · 02-22-2010, 02:26 AM

Originally posted by javijevi View Post

Is there any tool to remove exact duplicate reads? It should remove all the reads with the same exact sequence, leaving alone the one with better quality values, shouldn't?

What I can say is that the samtools rmdup does not fit the bill. Maybe Picard (Java) does, but we've not tried it (we've basically uncompressed the BAM files & located duplicates in the SAM file using brute force RAM-greedy scripts).

**javijevi** · 02-22-2010, 05:04 AM

Originally posted by hingamp View Post

What I can say is that the samtools rmdup does not fit the bill. Maybe Picard (Java) does, but we've not tried it (we've basically uncompressed the BAM files & located duplicates in the SAM file using brute force RAM-greedy scripts).

Is it possible to get these scripts? I've also written one of them, but I'm a biologist and not very experienced in optimizing memory resources in my code, so that my script takes days to finish and, what is more important, does not select the duplicate with the best quality values.

**hingamp** · 02-22-2010, 06:01 AM

I've just asked my colleague who used that approach but it seems the script was not kept. From what I remember it just parsed the SAM file and built a memory hash (perl) using chrom coordinates as key. The SAM file has quality data, so you can choose to keep the best read. It ran pretty fast, but used huge RAM (we're now equiped with ~64Gb machines to do just that kind of silly scripts with no optimization whatsoever).

**javijevi** · 02-22-2010, 07:04 AM

Originally posted by hingamp View Post

I've just asked my colleague who used that approach but it seems the script was not kept. From what I remember it just parsed the SAM file and built a memory hash (perl) using chrom coordinates as key. The SAM file has quality data, so you can choose to keep the best read. It ran pretty fast, but used huge RAM (we're now equiped with ~64Gb machines to do just that kind of silly scripts with no optimization whatsoever).

I see. My script does the same, but I think it is impractical for hugue datasets. For example, with a 43 GB SAM file (corresponding to the mapping of a 33 GB fastq file), I guess that I need at least all the 32 GB RAM of my machine, and even it has to use the swap partition.

Any idea from the more IT-skilled people?

**drio** · 02-22-2010, 07:58 AM

I would suggest changing the data structure you guys use. Change it to
a prefix trie (http://en.wikipedia.org/wiki/Trie). That will reduce the memory
footprint but keeping the semantics of a hash table. Also, I would code it in C to
stay closer to the hardware.

What is the biological reason for not being able to use samtools/picard mark dups?

**javijevi** · 02-23-2010, 02:01 AM

Originally posted by drio View Post

I would suggest changing the data structure you guys use. Change it to
a prefix trie (http://en.wikipedia.org/wiki/Trie). That will reduce the memory
footprint but keeping the semantics of a hash table.

Thanks a lot. I'll give a try... if I manage to understand it...

Originally posted by drio View Post

Also, I would code it in C to
stay closer to the hardware.

Yes. Learning C is my first TODO.

Originally posted by drio View Post

What is the biological reason for not being able to use samtools/picard mark dups?

None in my case. I just wanted to get rid of duplicated reads from the very beginning, so that calculations and files will be less hugue. I will check it.

**hingamp** · 02-23-2010, 02:28 AM

Originally posted by drio View Post

I would suggest changing the data structure you guys use. Change it to
a prefix trie (http://en.wikipedia.org/wiki/Trie). That will reduce the memory
footprint but keeping the semantics of a hash table. Also, I would code it in C to
stay closer to the hardware.

+1 regarding coding in C; Perl and huge RAM works but clearly feels "quick & dirty"

Originally posted by drio View Post

What is the biological reason for not being able to use samtools/picard mark dups?

Again please check carefully as this dates back a while, but I seem to recall samtools did just what rmdups suggests: remove all duplicates (not keeping one read for each duplicate cluster). I've not tried it, but Picard is supposed to treat duplicates better?

**nilshomer** · 02-23-2010, 09:20 AM

Originally posted by hingamp View Post

+1 regarding coding in C; Perl and huge RAM works but clearly feels "quick & dirty"

Again please check carefully as this dates back a while, but I seem to recall samtools did just what rmdups suggests: remove all duplicates (not keeping one read for each duplicate cluster). I've not tried it, but Picard is supposed to treat duplicates better?

As a warning to users, do not use "samtools rmdup" as it is explicitly stated by its author.. Use picard instead.

**duffman** · 02-23-2010, 11:52 AM

Originally posted by hingamp View Post

I've just asked my colleague who used that approach but it seems the script was not kept. From what I remember it just parsed the SAM file and built a memory hash (perl) using chrom coordinates as key. The SAM file has quality data, so you can choose to keep the best read. It ran pretty fast, but used huge RAM (we're now equiped with ~64Gb machines to do just that kind of silly scripts with no optimization whatsoever).

If the SAM file is sorted by reference position, then you should be able to simply pass through the file, detecting duplicates by maintaining the previous line in memory, writing non-redundant reads to another file. This uses almost no RAM. (You can sort an unsorted SAM file from the Unix command line via sort -k 3,3 -k 4,4n alignments.sam > alignments_sorted.sam)

**dukevn** · 03-18-2010, 09:57 PM

I wonder if this should be applied for Illumina-produced data or not? Should I filter out the identical (repeated) reads before doing any analysis?

Thanks,

D.

**jwfoley** · 05-20-2010, 11:13 PM

I see this all the time with mammalian genomes regardless of protocol and regardless of platform. It's probably caused by sequencing errors that turn common non-uniquely-mapping reads into uniquely-mapping reads.

Don't remove duplicate reads because they're duplicates; that defeats the purpose of quantification. Filter them out by the posterior probability that their best alignment is correct. That's what the MAPQ field in SAM format is supposed to contain, although many mappers ignore the spec so you might have to calculate it yourself. If you do that, the "stacks" go away on their own.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 28 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 161 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Should identical reads be reduced to a single count?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News