Seqanswers Leaderboard Ad

**d17** · 02-01-2011, 03:25 PM

According to the SAM format specification (http://samtools.sourceforge.net/SAM1.pdf) , read names must be the same for read pairs. Your read pairs do not seem to follow this rule, so it appears they are not being recognized as pairs by MarkDuplicates.

**shruti** · 07-22-2011, 10:07 AM

Hi,

I'm also facing the same problem here is case in question. The read names are same.

HWI-ST220_63:6:2108:11340:115090 177 chr1 14480159 37 50M chr5 84273185 0 GCTACCACTCAGAAACTTGGAGAAATCACAGAGAGCCATGGACTTTATTT JIIHGIGJIJJIJHGIHGGJJJJJJJJIJJJIJJJIJGHHHHFFFFFCCC XT:A:U NM:i:0 SM:i:37 AM:i:3X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
HWI-ST220_63:6:2108:11340:115090 113 chr5 84273185 37 50M chr1 14480159 0 AACAGAATTCCAGTGAGATACCTTCTTCCTTCTCAGGATTTTACATGTTA HHJIJJIIIJJJJJJJJIHDJJJIJIJJJJJIIJJJJHHHHHDFFFFCCC XT:A:U NM:i:0 SM:i:3AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
HWI-ST220_63:6:1108:12937:121403 177 chr1 14480159 37 50M chr5 84273185 0 GCTACCACTCAGAAACTTGGAGAAATCACAGAGAGCCATGGACTTTATTT JIIGJIHGJJJJIHGJJJIJJJJJJIJHJIJIJJJIJHHHFHFFFFFCCC XT:A:U NM:i:0 SM:i:37 AM:i:3X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
HWI-ST220_63:6:1108:12937:121403 113 chr5 84273185 37 50M chr1 14480159 0 AACAGAATTCCAGTGAGATACCTTCTTCCTTCTCAGGATTTTACATGTTA JJIIJIJIGJJGIIJJIGE;JJJJIIJJJJIIIJJJJHFHHHFFFFFCCC XT:A:U NM:i:0 SM:i:3AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50

Any pointers would be most appreciated.

**Michael.James.Clark** · 07-22-2011, 10:28 AM

Need to include the command you used, etc.

**shruti** · 07-24-2011, 12:46 AM

I had 1st sorted the file
samtools sort BothEndsMapped.bam BothEndsMapped.sorted
and then rmdup..
samtools rmdup BothEndsMapped.sorted.bam BothEndsMapped.sorted.unique.bam
I guess by default it does for paired end reads.

**DZhang** · 07-24-2011, 06:19 AM

Hi cbl,

Can you explain to us why duplicate removal, especially removing the pairs that map to different chr. is critical to your analysis? I always treat duplicate removal as nice-to-have but not must-to-have.

Thank you,
Douglas

https://www.contigexpress.com

**shruti** · 07-24-2011, 07:24 AM

Well I'm looking for cases of rearrangements hence, reads mapping in different chromosomes is important.
As far as removing duplicates is considered, its quite subjective to the need as well as coverage.
To start with being strict on the data I'm removing the duplicates. Cases of different reads showing same rearrangement would be say 90% chance that they are case of rearrangement (provided they are not into repeats/segmental duplicates). In case of low depth of coverage it would be quite difficult distinguish rearrangements due to the real data and PCR duplicates, so have to be little strict on that front.

**DZhang** · 07-24-2011, 07:43 AM

Hi shruti,

I understand your points. For me, I use a different approach in rearrangement analysis. I rely on a very tiny subset of the raw reads to infer the rearrangement. So at the beginning I use very nonrestrictive criteria to locate candidate reads that indicate potential rearrangement. Then I will dive into these reads. In the context of this post, including duplicate reads at the beginning does not hurt your chance finding/validating the rearrangement. When you have a small subset of candidate reads, you may look into duplication, most likely with custom scripts.

So my suggestion here is not to dwell too much on removing duplicated reads at the beginning as some general-purpose tools (e.g., samtools) may not be suitable for your specific needs.

Douglas

https://www.contigexpress.com

**shruti** · 07-24-2011, 08:06 AM

Well..yes I guess one can start with being lenient. Just curious, if you start with these reads, then how do you filter at the end. or do you also experimentally validate these cases. how much of a false positive do you observe.

**DZhang** · 07-24-2011, 10:45 AM

Hi shruti,

I usually use custom scripts to screen for 1) whether they map to the same chr. 2) how far apart and in which orientation 3) mapping quality and ambiguity 4) read counts (independent or PCR-duplication?). I am reluctant to put a hard number on false positive but I agree with you that it is critical to experimentally follow up and validate the models obtained from next-gen data.

Douglas

https://www.contigexpress.com

**Michael.James.Clark** · 07-24-2011, 02:45 PM

Originally posted by shruti View Post

I had 1st sorted the file
samtools sort BothEndsMapped.bam BothEndsMapped.sorted
and then rmdup..
samtools rmdup BothEndsMapped.sorted.bam BothEndsMapped.sorted.unique.bam
I guess by default it does for paired end reads.

Use Picard MarkDuplicates instead and get back to us. Samtools rmdup is not recommended.

If you add more responses, copy and paste the exact command. Remove file names if they're sensitive.

Originally posted by DZhang View Post

Can you explain to us why duplicate removal, especially removing the pairs that map to different chr. is critical to your analysis? I always treat duplicate removal as nice-to-have but not must-to-have.

There has never been a demonstrated case where leaving duplicates in is advantageous.

In contrast, numerous analyses have demonstrated beneficial effects of removing duplicates.

Many discussions of this have been done on this site in the past. http://seqanswers.com/forums/showthread.php?t=6854 for example.

**DZhang** · 07-24-2011, 03:12 PM

Hi Michael.James.Clark,

I looked at the thread you recommended - http://seqanswers.com/forums/showthread.php?t=6854. It appears to me that you were the only one who positively and absolutely advocate for duplicate removal. Others were either not sure or took a balanced approach as to the potential pros and cons. I certainly recognize the benefits of duplicate removal, due to the limitations of current laboratory technology and bioinformatic methods, I would not put it as must-have as long as the user considers the impact of duplicate when interpreting the analysis results.

Douglas

https://www.contigexpress.com

**Michael.James.Clark** · 07-24-2011, 03:55 PM

Originally posted by DZhang View Post

I looked at the thread you recommended - http://seqanswers.com/forums/showthread.php?t=6854. It appears to me that you were the only one who positively and absolutely advocate for duplicate removal. Others were either not sure or took a balanced approach as to the potential pros and cons.

I think you ought to read that thread more carefully then, because that is not what I said, and I was not the only one to explain that duplicate removal benefits many downstream analyses.

(Oh, by the way, I chose that thread because it is in my history since I replied to it, but there are many others on the same topic that you could read up on that might interest you.)

I will clarify my statement for you: Were we talking about ChIP-seq or RNA-seq here, I might agree that duplicate removal is a questionable step to take. Certainly if we were talking about single-end data, I would say that it's probably not necessary (and, in fact, is likely throwing the baby out with the bathwater in many cases).

However, it seems this is paired WGS (or some other DNA-seq) with the goal of variant detection. In that case, de-dupping has been shown to be beneficial for variant calling (SNVs, indels, SVs, CNVs).

With regards to SVs, de-dupping is particularly useful for long mate-paired data. For simple PE data at adequate depth, it may have little impact. However, I've not seen anyone demonstrate it hurts SV detection in those cases, therefore why not do it? Especially if you'll be using that same data for SNV calling, where it has demonstrably significant effects.

**DZhang** · 07-24-2011, 04:21 PM

Hi Michael.James.Clark,

Thank you for your additional comment. I would not characterize my opinion on duplicate removal as "questionable" step. I completely recognize the benefits of doing it at the beginning of the pipeline, but would not take it as a mandatory step - "must-have". In this particular case, cbl had executed the duplicate removal. My suggestion to cbl was to go ahead with the downstream analysis instead of halting the project progress. The mere fact that both programs, samtools and picard, have built-in functions to remove duplicates but one may work and the other may not work in a particular case exactly illustrates there is a judgment call on duplicate removal.

It of course has value to find a solution that can remove the duplicates that are mapped to diffferent chr. and I hope your suggestion of picard will resolve that.

Douglas

https://www.contigexpress.com

**Michael.James.Clark** · 07-24-2011, 04:32 PM

Originally posted by DZhang View Post

The mere fact that both programs, samtools and picard, have built-in functions to remove duplicates but one may work and the other may not work in a particular case exactly illustrates there is a judgment call on duplicate removal.

Samtools rmdup and Picard MarkDuplicates are not different due to "judgment calls".

See here: http://sourceforge.net/apps/mediawik...tools_rmdup.3F

"samtools rmdup does not remove interchromosomal duplicates. MarkDuplicates does remove these duplicates."

That's because of the way samtools rmdup works... it looks for successive reads mapping to the same spot after sorting without considering the pair. It's not because the authors intended it to miss those from what I understand--it's because way back when they wrote it, they didn't consider that case.

(By the way, that also explains why shruti is seeing interchromosomal duplicates in her results, and why I told her to use Picard rather than samtools in the first place.

)

Also, I'm not sure actually if samtools rmdup considers the second end at all. I think it does, but I think it simply doesn't work on interchromosomals. Either way, it's safest to just use Picard from what I've seen.

Topics	Statistics	Last Post
Mechanical Forces in DNA Transcription Uncovered by Clemson Researchers by seqadmin Started by seqadmin, 10-02-2024, 04:51 AM	0 responses 13 views 0 likes	Last Post by seqadmin 10-02-2024, 04:51 AM
New Epigenetic Clock Links Cheek Cells to Mortality Risk by seqadmin Started by seqadmin, 10-01-2024, 07:10 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-01-2024, 07:10 AM
AI-Powered Blood Test Shows Promise for Early Ovarian Cancer Detection by seqadmin Started by seqadmin, 09-30-2024, 08:33 AM	0 responses 25 views 0 likes	Last Post by seqadmin 09-30-2024, 08:33 AM
Stem Cell Research Suggests Human Cells May Enter Developmental Pause by seqadmin Started by seqadmin, 09-26-2024, 12:57 PM	0 responses 18 views 0 likes	Last Post by seqadmin 09-26-2024, 12:57 PM

Seqanswers Leaderboard Ad

Announcement

Problem removing duplicate reads? (samtools and picard)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News