Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    The example you are showing is not significant to determine if there was any problem with picard. You are showing the two ends of a pair. They are both in your sam file before and after running picard. That's the expected behavior.
    -drd

    Comment


    • #17
      Originally posted by Michael.James.Clark View Post
      There is a nice discussion of such things (including a post where Heng explains the answer to your question) as well as links to numerous other threads on the topic here: http://seqanswers.com/forums/showthread.php?t=6854

      Hope that helps.
      Referring to this topic, is duplicate removal really necessary for DNA sequencing?
      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
      Last edited by bioinfosm; 11-22-2010, 12:59 PM. Reason: working link
      --
      bioinfosm

      Comment


      • #18
        Originally posted by swbarnes2 View Post
        I tried to use picard's mark duplicates, but it doesn't seem to be doing much of a job (The sam file is from a bwa paired end alignment)

        java -jar SortSam.jar I=full_genome_grepped.sam O=full_genome_grepped_sorted.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT

        java -jar MarkDuplicates.jar I=full_genome_grepped_sorted.sam O=full_genome_grepped_sorted_deduped.sam M=Dedup_metrics.txt VALIDATION_STRINGENCY=LENIENT QUIET=true REMOVE_DUPLICATES=true

        The resultant file is only a little smaller than the original. So I checked some known duplicates;

        grep 1:78:10366:8808 full_genome_grepped_sorted_deduped.sam
        100729:1:78:10366:8808:a: 99 chr9 3000477 0 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA gggggggggggggggggggggdgfcgggggdggggggggegggggggggg X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
        100729:1:78:10366:8808:b: 147 chr9 3000810 0 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhghhhhhhhhhhhhhhhhhghhhhhhhghhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3022120,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

        grep 1:80:15287:8131 full_genome_grepped_sorted_deduped.sam
        100729:1:80:15287:8131:a: 99 chr9 3000477 3 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA ggggggggggggggfggggggcggggggggggggggggggggggggggga X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
        100729:1:80:15287:8131:b: 147 chr9 3000810 3 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhhhhhghhhhhhhhhfhhhhhhhhhhghhhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3000810,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

        I was hoping that one of the two would be deleted from the deduped file, since they look like perfect duplicates to me.
        Hi,

        Picard's markdups program's supposed advantage is in its ability to remove duplicates across chromosomes. I ran markdups and samtools and samtools ended up removing more dups, but not across chromosomes of course. Either way, you can physically go through and filter out pairs and sequences with poor pairing or mapping QVs. I also remove reads across chromosomes now. Depends on what you're doing really.

        Comment


        • #19
          Originally posted by JohnK View Post
          Hi,

          Picard's markdups program's supposed advantage is in its ability to remove duplicates across chromosomes. I ran markdups and samtools and samtools ended up removing more dups, but not across chromosomes of course. Either way, you can physically go through and filter out pairs and sequences with poor pairing or mapping QVs. I also remove reads across chromosomes now. Depends on what you're doing really.
          Yeah, that's what I wrote a script for a while back, but I wanted to use something standard, something that other people can trust works.

          But Picard doesn't seem to be removing dupes within chromosomes. I posted the .sam entries for two different clusters with the exact same sequence at each end; why didn't Picard remove one cluster or the other?

          Comment


          • #20
            I'm not sure really. I actually use samtools now to remove the dups and I believe we're moving towards filtering out the singletons, cross-chromosome pairs, invalid insert sizes, and anything else under a QV of ten. Just for the sake of the most accurate data.

            Comment


            • #21
              Originally posted by swbarnes2 View Post
              I posted the .sam entries for two different clusters with the exact same sequence at each end; why didn't Picard remove one cluster or the other?
              Picard looks for the mapping coordinates (and orientation), not the actual sequence.

              Notice the MAPQ of those alignments. It is very low. Those alignments should be filtered out and should not contribute to the downstream analysis.
              -drd

              Comment


              • #22
                Originally posted by drio View Post
                Picard looks for the mapping coordinates (and orientation), not the actual sequence.

                Notice the MAPQ of those alignments. It is very low. Those alignments should be filtered out and should not contribute to the downstream analysis.
                Ah, so it's just ignoring those because they are repetitive in the genome. Okay, I guess that makes sense, though I'd rather have have them in there, but deduped, so I can tell the difference between a repetative part, and a deletion.

                But I have other things that align uniquely, and still aren't being trimmed.

                grep 2:78:8570:13184 full_genome_grepped_sorted_deduped.sam
                100729:2:78:8570:13184:b: 129 chr1 149128373 37 50M = 149128649 276 CCTTTTCCTCTGCATGCAATAGATCATTGAGTGTGTGTGTGTGTGTGTGT hhghhhhhhhhhhghhhhhhchhhhghheafdfdfafcfcfcfafafafa X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
                100729:2:78:8570:13184:a: 65 chr1 149128649 37 50M = 149128373 -276 ACTTTCAATCTCTGCTTTTCAAAATGTGAAAATGCCCAATTGATTACAGC hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghghhhhhhfhh X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

                grep 2:33:3309:20360 full_genome_grepped_sorted_deduped.sam
                100729:2:33:3309:20360:b: 129 chr1 149128373 37 50M = 149128649 276 CCTTTTCCTCTGCATGCAATAGATCATTGAGTGTGTGTGTGTGTGTGTGT ggggggggggdffffgggggggfagggggffafcfafacab_b]b_fa]a X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
                100729:2:33:3309:20360:a: 65 chr1 149128649 37 50M = 149128373 -276 ACTTTCAATCTCTGCTTTTCAAAATGTGAAAATGCCCAATTGATTACAGC hhhhhhhhhhhhhhhhhhhhhhhg_hhhhhhhhahhhhhhfhfhheggcf X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

                The coordinates for each cluster are identical, even the sequences are identical at both ends, and isn't a mapping quality of 37 fine? Why isn't one of these being removed by Picard?

                Comment


                • #23
                  I do not know why one of those pairs was not flagged as PCR duplicate.
                  -drd

                  Comment


                  • #24
                    Originally posted by swbarnes2 View Post
                    I tried to use picard's mark duplicates, but it doesn't seem to be doing much of a job (The sam file is from a bwa paired end alignment)

                    java -jar SortSam.jar I=full_genome_grepped.sam O=full_genome_grepped_sorted.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT

                    java -jar MarkDuplicates.jar I=full_genome_grepped_sorted.sam O=full_genome_grepped_sorted_deduped.sam M=Dedup_metrics.txt VALIDATION_STRINGENCY=LENIENT QUIET=true REMOVE_DUPLICATES=true

                    The resultant file is only a little smaller than the original. So I checked some known duplicates;

                    grep 1:78:10366:8808 full_genome_grepped_sorted_deduped.sam
                    100729:1:78:10366:8808:a: 99 chr9 3000477 0 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA gggggggggggggggggggggdgfcgggggdggggggggegggggggggg X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
                    100729:1:78:10366:8808:b: 147 chr9 3000810 0 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhghhhhhhhhhhhhhhhhhghhhhhhhghhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3022120,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

                    grep 1:80:15287:8131 full_genome_grepped_sorted_deduped.sam
                    100729:1:80:15287:8131:a: 99 chr9 3000477 3 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA ggggggggggggggfggggggcggggggggggggggggggggggggggga X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
                    100729:1:80:15287:8131:b: 147 chr9 3000810 3 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhhhhhghhhhhhhhhfhhhhhhhhhhghhhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3000810,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

                    I was hoping that one of the two would be deleted from the deduped file, since they look like perfect duplicates to me.
                    Hi there,

                    I was getting the same problem and played with the code a bit. If your forward and reverse reads have slightly different names - am guessing you have :a: and :b: as suffixes for your forward and reverse read names - then the program doesn't recognise them as being from the same pair.

                    Chris

                    Comment


                    • #25
                      Originally posted by cjp View Post
                      Hi there,

                      I was getting the same problem and played with the code a bit. If your forward and reverse reads have slightly different names - am guessing you have :a: and :b: as suffixes for your forward and reverse read names - then the program doesn't recognise them as being from the same pair.

                      Chris
                      Thanks for figuring it out. That makes sense, though I thought when I ran my file as it was, that it did trim a few entries out, just not as many as I thought should be trimmed. And samtools does gets that samples named like that are paired.

                      Comment


                      • #26
                        Originally posted by cjp View Post
                        Hi there,

                        I was getting the same problem and played with the code a bit. If your forward and reverse reads have slightly different names - am guessing you have :a: and :b: as suffixes for your forward and reverse read names - then the program doesn't recognise them as being from the same pair.

                        Chris
                        Indeed, that's likely what it is. Actually, if you check the SAM format spec sheet, paired reads are supposed to have identical names.



                        See the example.
                        Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                        Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                        Projects: U87MG whole genome sequence [Website] [Paper]

                        Comment


                        • #27
                          Originally posted by swbarnes2 View Post
                          Thanks for figuring it out. That makes sense, though I thought when I ran my file as it was, that it did trim a few entries out, just not as many as I thought should be trimmed. And samtools does gets that samples named like that are paired.
                          Those are probably unpaired reads - you can look in the metrics file to see how many unpaired and paired end reads were marked as duplicates.

                          Before read names were changed to be the same:

                          LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
                          A0600028 3172904 67528513 0 2607750 0 0 0.018865


                          After making the name change:

                          LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
                          A0600028 3172904 67528513 0 2607750 38116791 0 0.570364 34131325

                          Chris

                          Comment


                          • #28
                            Nevermind, figured out my problem. The problem was I'm an idiot. Our paired end sequence data comes back with read one ending in '#0/1' and read two ending in '#0/3', and the 3 has to be converted to a 2 for MarkDuplicates to work. I naturally used a command that changed every 3 to a 2...
                            Last edited by Heisman; 09-20-2011, 05:03 PM.

                            Comment


                            • #29
                              Mark duplicates post fastq trimming?

                              Hello, I am fairly new to bioinformatics and have a question about using Picard Mark Duplicates after trimming sequence reads.

                              (I could not find the exact answer to my question (so I apologize if this a repeat question), but if it exists please link to the answer.)

                              Does anyone know how Picard Mark Duplicates knows which bases were trimmed from a read? If you trim bases from the 5' end of the read, how does Picard know that? See quoted comment from xguo below.

                              Note, I am using trimmomatic to trim low quality bases and adapters from paired end whole genome DNA re-sequencing data. I am aligning this data to a reference then using the GATK to call snp variants for population genetic analyses... so I need remove duplicates to ensure accurate genotype/snp calls.

                              Thank you!
                              Amanda


                              Originally posted by xguo View Post
                              The following is what I found from samtools-help list:

                              "Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
                              have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15."

                              Comment


                              • #30
                                I think that MarkDuplicates is considering the bases that were "clipped" during the alignment process: those terminal parts of the sequences do belong to the read but are not reported as part of the alignments. This is not the same as the "trimming" process that removes low-quality and/or adapter sequences before the alignment. As you pointed out, those are definitely lost.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                17 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                22 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                16 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                46 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X