Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • jyoti.sutar92
    replied
    Similar problem

    I was facing a similar problem with rmdup of picard and samtools.
    After performing 'rmdup', picard left `~23,000' reads whereas samtools left me with '~1800' reads only, which is a huge difference.
    Anyhoo, this thread explained a lot... So thank you everyone who has contributed.

    Leave a comment:


  • amsci8
    replied
    Thank you, syfo!

    Leave a comment:


  • syfo
    replied
    I think that MarkDuplicates is considering the bases that were "clipped" during the alignment process: those terminal parts of the sequences do belong to the read but are not reported as part of the alignments. This is not the same as the "trimming" process that removes low-quality and/or adapter sequences before the alignment. As you pointed out, those are definitely lost.

    Leave a comment:


  • amsci8
    replied
    Mark duplicates post fastq trimming?

    Hello, I am fairly new to bioinformatics and have a question about using Picard Mark Duplicates after trimming sequence reads.

    (I could not find the exact answer to my question (so I apologize if this a repeat question), but if it exists please link to the answer.)

    Does anyone know how Picard Mark Duplicates knows which bases were trimmed from a read? If you trim bases from the 5' end of the read, how does Picard know that? See quoted comment from xguo below.

    Note, I am using trimmomatic to trim low quality bases and adapters from paired end whole genome DNA re-sequencing data. I am aligning this data to a reference then using the GATK to call snp variants for population genetic analyses... so I need remove duplicates to ensure accurate genotype/snp calls.

    Thank you!
    Amanda


    Originally posted by xguo View Post
    The following is what I found from samtools-help list:

    "Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
    have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15."

    Leave a comment:


  • Heisman
    replied
    Nevermind, figured out my problem. The problem was I'm an idiot. Our paired end sequence data comes back with read one ending in '#0/1' and read two ending in '#0/3', and the 3 has to be converted to a 2 for MarkDuplicates to work. I naturally used a command that changed every 3 to a 2...
    Last edited by Heisman; 09-20-2011, 05:03 PM.

    Leave a comment:


  • cjp
    replied
    Originally posted by swbarnes2 View Post
    Thanks for figuring it out. That makes sense, though I thought when I ran my file as it was, that it did trim a few entries out, just not as many as I thought should be trimmed. And samtools does gets that samples named like that are paired.
    Those are probably unpaired reads - you can look in the metrics file to see how many unpaired and paired end reads were marked as duplicates.

    Before read names were changed to be the same:

    LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
    A0600028 3172904 67528513 0 2607750 0 0 0.018865


    After making the name change:

    LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
    A0600028 3172904 67528513 0 2607750 38116791 0 0.570364 34131325

    Chris

    Leave a comment:


  • Michael.James.Clark
    replied
    Originally posted by cjp View Post
    Hi there,

    I was getting the same problem and played with the code a bit. If your forward and reverse reads have slightly different names - am guessing you have :a: and :b: as suffixes for your forward and reverse read names - then the program doesn't recognise them as being from the same pair.

    Chris
    Indeed, that's likely what it is. Actually, if you check the SAM format spec sheet, paired reads are supposed to have identical names.



    See the example.

    Leave a comment:


  • swbarnes2
    replied
    Originally posted by cjp View Post
    Hi there,

    I was getting the same problem and played with the code a bit. If your forward and reverse reads have slightly different names - am guessing you have :a: and :b: as suffixes for your forward and reverse read names - then the program doesn't recognise them as being from the same pair.

    Chris
    Thanks for figuring it out. That makes sense, though I thought when I ran my file as it was, that it did trim a few entries out, just not as many as I thought should be trimmed. And samtools does gets that samples named like that are paired.

    Leave a comment:


  • cjp
    replied
    Originally posted by swbarnes2 View Post
    I tried to use picard's mark duplicates, but it doesn't seem to be doing much of a job (The sam file is from a bwa paired end alignment)

    java -jar SortSam.jar I=full_genome_grepped.sam O=full_genome_grepped_sorted.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT

    java -jar MarkDuplicates.jar I=full_genome_grepped_sorted.sam O=full_genome_grepped_sorted_deduped.sam M=Dedup_metrics.txt VALIDATION_STRINGENCY=LENIENT QUIET=true REMOVE_DUPLICATES=true

    The resultant file is only a little smaller than the original. So I checked some known duplicates;

    grep 1:78:10366:8808 full_genome_grepped_sorted_deduped.sam
    100729:1:78:10366:8808:a: 99 chr9 3000477 0 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA gggggggggggggggggggggdgfcgggggdggggggggegggggggggg X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
    100729:1:78:10366:8808:b: 147 chr9 3000810 0 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhghhhhhhhhhhhhhhhhhghhhhhhhghhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3022120,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

    grep 1:80:15287:8131 full_genome_grepped_sorted_deduped.sam
    100729:1:80:15287:8131:a: 99 chr9 3000477 3 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA ggggggggggggggfggggggcggggggggggggggggggggggggggga X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
    100729:1:80:15287:8131:b: 147 chr9 3000810 3 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhhhhhghhhhhhhhhfhhhhhhhhhhghhhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3000810,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

    I was hoping that one of the two would be deleted from the deduped file, since they look like perfect duplicates to me.
    Hi there,

    I was getting the same problem and played with the code a bit. If your forward and reverse reads have slightly different names - am guessing you have :a: and :b: as suffixes for your forward and reverse read names - then the program doesn't recognise them as being from the same pair.

    Chris

    Leave a comment:


  • drio
    replied
    I do not know why one of those pairs was not flagged as PCR duplicate.

    Leave a comment:


  • swbarnes2
    replied
    Originally posted by drio View Post
    Picard looks for the mapping coordinates (and orientation), not the actual sequence.

    Notice the MAPQ of those alignments. It is very low. Those alignments should be filtered out and should not contribute to the downstream analysis.
    Ah, so it's just ignoring those because they are repetitive in the genome. Okay, I guess that makes sense, though I'd rather have have them in there, but deduped, so I can tell the difference between a repetative part, and a deletion.

    But I have other things that align uniquely, and still aren't being trimmed.

    grep 2:78:8570:13184 full_genome_grepped_sorted_deduped.sam
    100729:2:78:8570:13184:b: 129 chr1 149128373 37 50M = 149128649 276 CCTTTTCCTCTGCATGCAATAGATCATTGAGTGTGTGTGTGTGTGTGTGT hhghhhhhhhhhhghhhhhhchhhhghheafdfdfafcfcfcfafafafa X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
    100729:2:78:8570:13184:a: 65 chr1 149128649 37 50M = 149128373 -276 ACTTTCAATCTCTGCTTTTCAAAATGTGAAAATGCCCAATTGATTACAGC hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghghhhhhhfhh X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

    grep 2:33:3309:20360 full_genome_grepped_sorted_deduped.sam
    100729:2:33:3309:20360:b: 129 chr1 149128373 37 50M = 149128649 276 CCTTTTCCTCTGCATGCAATAGATCATTGAGTGTGTGTGTGTGTGTGTGT ggggggggggdffffgggggggfagggggffafcfafacab_b]b_fa]a X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
    100729:2:33:3309:20360:a: 65 chr1 149128649 37 50M = 149128373 -276 ACTTTCAATCTCTGCTTTTCAAAATGTGAAAATGCCCAATTGATTACAGC hhhhhhhhhhhhhhhhhhhhhhhg_hhhhhhhhahhhhhhfhfhheggcf X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

    The coordinates for each cluster are identical, even the sequences are identical at both ends, and isn't a mapping quality of 37 fine? Why isn't one of these being removed by Picard?

    Leave a comment:


  • drio
    replied
    Originally posted by swbarnes2 View Post
    I posted the .sam entries for two different clusters with the exact same sequence at each end; why didn't Picard remove one cluster or the other?
    Picard looks for the mapping coordinates (and orientation), not the actual sequence.

    Notice the MAPQ of those alignments. It is very low. Those alignments should be filtered out and should not contribute to the downstream analysis.

    Leave a comment:


  • JohnK
    replied
    I'm not sure really. I actually use samtools now to remove the dups and I believe we're moving towards filtering out the singletons, cross-chromosome pairs, invalid insert sizes, and anything else under a QV of ten. Just for the sake of the most accurate data.

    Leave a comment:


  • swbarnes2
    replied
    Originally posted by JohnK View Post
    Hi,

    Picard's markdups program's supposed advantage is in its ability to remove duplicates across chromosomes. I ran markdups and samtools and samtools ended up removing more dups, but not across chromosomes of course. Either way, you can physically go through and filter out pairs and sequences with poor pairing or mapping QVs. I also remove reads across chromosomes now. Depends on what you're doing really.
    Yeah, that's what I wrote a script for a while back, but I wanted to use something standard, something that other people can trust works.

    But Picard doesn't seem to be removing dupes within chromosomes. I posted the .sam entries for two different clusters with the exact same sequence at each end; why didn't Picard remove one cluster or the other?

    Leave a comment:


  • JohnK
    replied
    Originally posted by swbarnes2 View Post
    I tried to use picard's mark duplicates, but it doesn't seem to be doing much of a job (The sam file is from a bwa paired end alignment)

    java -jar SortSam.jar I=full_genome_grepped.sam O=full_genome_grepped_sorted.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT

    java -jar MarkDuplicates.jar I=full_genome_grepped_sorted.sam O=full_genome_grepped_sorted_deduped.sam M=Dedup_metrics.txt VALIDATION_STRINGENCY=LENIENT QUIET=true REMOVE_DUPLICATES=true

    The resultant file is only a little smaller than the original. So I checked some known duplicates;

    grep 1:78:10366:8808 full_genome_grepped_sorted_deduped.sam
    100729:1:78:10366:8808:a: 99 chr9 3000477 0 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA gggggggggggggggggggggdgfcgggggdggggggggegggggggggg X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
    100729:1:78:10366:8808:b: 147 chr9 3000810 0 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhghhhhhhhhhhhhhhhhhghhhhhhhghhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3022120,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

    grep 1:80:15287:8131 full_genome_grepped_sorted_deduped.sam
    100729:1:80:15287:8131:a: 99 chr9 3000477 3 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA ggggggggggggggfggggggcggggggggggggggggggggggggggga X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
    100729:1:80:15287:8131:b: 147 chr9 3000810 3 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhhhhhghhhhhhhhhfhhhhhhhhhhghhhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3000810,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

    I was hoping that one of the two would be deleted from the deduped file, since they look like perfect duplicates to me.
    Hi,

    Picard's markdups program's supposed advantage is in its ability to remove duplicates across chromosomes. I ran markdups and samtools and samtools ended up removing more dups, but not across chromosomes of course. Either way, you can physically go through and filter out pairs and sequences with poor pairing or mapping QVs. I also remove reads across chromosomes now. Depends on what you're doing really.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Exploring the Dynamics of the Tumor Microenvironment
    by seqadmin




    The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
    07-08-2024, 03:19 PM
  • seqadmin
    Exploring Human Diversity Through Large-Scale Omics
    by seqadmin


    In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
    06-25-2024, 06:43 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 07-16-2024, 05:49 AM
0 responses
24 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-15-2024, 06:53 AM
0 responses
31 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-10-2024, 07:30 AM
0 responses
40 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-03-2024, 09:45 AM
0 responses
205 views
0 likes
Last Post seqadmin  
Working...
X