Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • hlwright
    Member
    • Feb 2011
    • 30

    Picard Mark Duplicates help please

    I am tying to use Picard MarkDuplicates to remove my pcr duplicates from a human rna-seq bam file. The run was paired-end but I only have about 30% properly paired (that is another story).

    My command for picard was this:
    PHP Code:
    java -Xmx8 -jar /path/to/MarkDuplicates.jar INPUT=accepted_hits_sorted.bam OUTPUT=picard.bam METRICS_FILE=picard_info.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT 
    The picard_info.txt file suggests I have 52% duplicates in my data. However, when I convert the output file to SAM I can see (by eye) that there are duplicates still there:

    Original data:
    PHP Code:
    1329_105_1480_F3        355     chr1    13484   1       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,@NPYG423BC553AC.2BPB;:7OH0.-=><1,I3!5=D<4)-OD=44      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
    1863_1224_411_F3        99      chr1    13484   3       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      UUZ_[][VXY[XRNOYJFURZZULJZ_ZOQ[SRRTW@CPBJHJWU\FAMM      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    1939_1338_752_F3        355     chr1    13484   1       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      66?=3CEA:=LIE?CH8?F@J;?G4@QH7;KEHJIE1@@4=AD9=P@8<<      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
    1942_131_1549_F3        355     chr1    13484   3       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,9:.A>-.34-./491+#&DA20)>?-,36,**8<%#++20..6I<-..      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    2022_911_2004_F3        99      chr1    13484   3       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      OOTYX[TKCHXUHIUVONHDTUPTOTXQKMWQPTRB/DH?QJKNNZJJQQ      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    486_302_1756_F3 99      chr1    13528   0       50M     
    =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      __ZZZ[ZQWUVV?I]`TEQWYX<;RYH==GXI;5F];!(EYE0CV\D7NN      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    667_379_431_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      
    ``]_]][X^__WAIZ\K=RYYY??PRPORMLEB<DYM'*NYT;9R[E5@@      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1481_526_56_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      RRRNC==:HHC@3CG=:1<=BB2DMG>0:B70/.0@>!!/;K>/=B1)11      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1631_628_1988_F3        355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ^^]
    ``_NFW]^[IO_^V39UWN=JTVH6BSYB8HSYOD@>LR;9QY6!00      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1934_635_52_F3  99      chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ZZ]\\[YZ^WVZHL_
    `^PQYVS<EYVQHNWWJ=1@V@9=9LSABRW2!22      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    2219_1966_235_F3        355     chr1    13528   0       50M     
    =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      NNBAQRE9HJH6#CMNF5EFFA$0<AJ;9CE,!-?K0!22;M4+BC,%77      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1329_105_1480_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ::74<@':CAOIH9AHFUWL:FXE6?KNOUVCCKK     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
    1863_1224_411_F5-BC     147     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     __]XRR?G_\_a^^]]bcccbba_[[accccbabb     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0
    1939_1338_752_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     44.$5H+6GIJFQHHTW[VVTOQGEJOWYVSSVYY     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0 
    Post picard:
    PHP Code:
    1329_105_1480_F3        355     chr1    13484   1       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,@NPYG423BC553AC.2BPB;:7OH0.-=><1,I3!5=D<4)-OD=44      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
    1863_1224_411_F3        99      chr1    13484   3       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      UUZ_[][VXY[XRNOYJFURZZULJZ_ZOQ[SRRTW@CPBJHJWU\FAMM      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    1939_1338_752_F3        355     chr1    13484   1       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      66?=3CEA:=LIE?CH8?F@J;?G4@QH7;KEHJIE1@@4=AD9=P@8<<      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
    1942_131_1549_F3        355     chr1    13484   3       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,9:.A>-.34-./491+#&DA20)>?-,36,**8<%#++20..6I<-..      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    2022_911_2004_F3        99      chr1    13484   3       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      OOTYX[TKCHXUHIUVONHDTUPTOTXQKMWQPTRB/DH?QJKNNZJJQQ      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    486_302_1756_F3 99      chr1    13528   0       50M     
    =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      __ZZZ[ZQWUVV?I]`TEQWYX<;RYH==GXI;5F];!(EYE0CV\D7NN      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    667_379_431_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      
    ``]_]][X^__WAIZ\K=RYYY??PRPORMLEB<DYM'*NYT;9R[E5@@      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1481_526_56_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      RRRNC==:HHC@3CG=:1<=BB2DMG>0:B70/.0@>!!/;K>/=B1)11      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1631_628_1988_F3        355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ^^]
    ``_NFW]^[IO_^V39UWN=JTVH6BSYB8HSYOD@>LR;9QY6!00      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1934_635_52_F3  99      chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ZZ]\\[YZ^WVZHL_
    `^PQYVS<EYVQHNWWJ=1@V@9=9LSABRW2!22      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    2219_1966_235_F3        355     chr1    13528   0       50M     
    =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      NNBAQRE9HJH6#CMNF5EFFA$0<AJ;9CE,!-?K0!22;M4+BC,%77      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1329_105_1480_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ::74<@':CAOIH9AHFUWL:FXE6?KNOUVCCKK     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
    1863_1224_411_F5-BC     147     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     __]XRR?G_\_a^^]]bcccbba_[[accccbabb     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0
    1939_1338_752_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     44.$5H+6GIJFQHHTW[VVTOQGEJOWYVSSVYY     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0 
    I have also tried to set the REMOVE_DUPLICATES flag to false (I think this just flags them and leaves them in the new file rather than excluding them) but this gives me exactly the same result:
    PHP Code:
    1329_105_1480_F3        355     chr1    13484   1       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,@NPYG423BC553AC.2BPB;:7OH0.-=><1,I3!5=D<4)-OD=44      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
    1863_1224_411_F3        99      chr1    13484   3       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      UUZ_[][VXY[XRNOYJFURZZULJZ_ZOQ[SRRTW@CPBJHJWU\FAMM      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    1939_1338_752_F3        355     chr1    13484   1       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      66?=3CEA:=LIE?CH8?F@J;?G4@QH7;KEHJIE1@@4=AD9=P@8<<      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
    1942_131_1549_F3        355     chr1    13484   3       50M     
    =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,9:.A>-.34-./491+#&DA20)>?-,36,**8<%#++20..6I<-..      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    2022_911_2004_F3        99      chr1    13484   3       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      OOTYX[TKCHXUHIUVONHDTUPTOTXQKMWQPTRB/DH?QJKNNZJJQQ      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
    486_302_1756_F3 99      chr1    13528   0       50M     
    =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      __ZZZ[ZQWUVV?I]`TEQWYX<;RYH==GXI;5F];!(EYE0CV\D7NN      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    667_379_431_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      
    ``]_]][X^__WAIZ\K=RYYY??PRPORMLEB<DYM'*NYT;9R[E5@@      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1481_526_56_F3  355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      RRRNC==:HHC@3CG=:1<=BB2DMG>0:B70/.0@>!!/;K>/=B1)11      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1631_628_1988_F3        355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ^^]
    ``_NFW]^[IO_^V39UWN=JTVH6BSYB8HSYOD@>LR;9QY6!00      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1934_635_52_F3  99      chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ZZ]\\[YZ^WVZHL_
    `^PQYVS<EYVQHNWWJ=1@V@9=9LSABRW2!22      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    2219_1966_235_F3        355     chr1    13528   0       50M     
    =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      NNBAQRE9HJH6#CMNF5EFFA$0<AJ;9CE,!-?K0!22;M4+BC,%77      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
    1329_105_1480_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ::74<@':CAOIH9AHFUWL:FXE6?KNOUVCCKK     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
    1863_1224_411_F5-BC     147     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     __]XRR?G_\_a^^]]bcccbba_[[accccbabb     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0
    1939_1338_752_F5-BC     403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     44.$5H+6GIJFQHHTW[VVTOQGEJOWYVSSVYY     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0 
    On all three example SAM file extracts you can clearly see that five reads on chr1 at start position 13484 are duplicates, and are paired with reads starting at 13572, yet all have been left in and I can't see that any flag had changed.

    Am I doing something wrong? Please help!

    Thanks
    Helen
  • swbarnes2
    Senior Member
    • May 2008
    • 910

    #2
    Originally posted by hlwright View Post

    On all three example SAM file extracts you can clearly see that five reads on chr1 at start position 13484 are duplicates, and are paired with reads starting at 13572, yet all have been left in and I can't see that any flag had changed.

    Am I doing something wrong? Please help!

    Thanks
    Helen
    I think I see the problem, and it's the same one I had when I first tried to use Picard's MarkDuplicates. Whatever program you used to make the .sam file understood that read 1329_105_1480_F3 and 1329_105_1480_F5-BC were paired, even though their names are different. Someone on this board said that Picard won't work unless the names are identical. So try fixing the names, and then running Picard. I use samtools rmdup, which has a known issue of not removing duplicates that span chromosomes, but it does work if the names aren't identical.

    Comment

    • hlwright
      Member
      • Feb 2011
      • 30

      #3
      Thank you for the reply, I will try your suggestion of changing the read names.

      I have tried samtools rmdup but it removes >90% of my reads (as duplicates). I was hoping for a 'second opinion' from Picard that was less shocking!

      Helen

      Comment

      • hlwright
        Member
        • Feb 2011
        • 30

        #4
        I have changed the read names as suggested and re-run picard, but am still not happy that it has worked properly.

        I also included an extra command in the command line READ_NAME_REGEX="[0-9]_[0-9]_[0-9]". An example of the output can be seen below:

        PHP Code:
        1329_105_1480   355     chr1    13484   1       50M     =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,@NPYG423BC553AC.2BPB;:7OH0.-=><1,I3!5=D<4)-OD=44      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
        1863_1224_411   99      chr1    13484   3       50M     
        =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      UUZ_[][VXY[XRNOYJFURZZULJZ_ZOQ[SRRTW@CPBJHJWU\FAMM      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
        1939_1338_752   355     chr1    13484   1       50M     
        =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      66?=3CEA:=LIE?CH8?F@J;?G4@QH7;KEHJIE1@@4=AD9=P@8<<      NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92080      XS:A:+  HI:i:0
        1942_131_1549   355     chr1    13484   3       50M     
        =       13572   123     CAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGG      ,,9:.A>-.34-./491+#&DA20)>?-,36,**8<%#++20..6I<-..      NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357483  XS:A:+  HI:i:0
        486_302_1756    99      chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      __ZZZ[ZQWUVV?I]`TEQWYX<;RYH==GXI;5F];!(EYE0CV\D7NN      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
        667_379_431     355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      
        ``]_]][X^__WAIZ\K=RYYY??PRPORMLEB<DYM'*NYT;9R[E5@@      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
        1481_526_56     355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      RRRNC==:HHC@3CG=:1<=BB2DMG>0:B70/.0@>!!/;K>/=B1)11      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
        1631_628_1988   355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      ^^]
        ``_NFW]^[IO_^V39UWN=JTVH6BSYB8HSYOD@>LR;9QY6!00      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
        2219_1966_235   355     chr1    13528   0       50M     =       13615   122     CGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACA      NNBAQRE9HJH6#CMNF5EFFA$0<AJ;9CE,!-?K0!22;M4+BC,%77      NM:i:0  NH:i:5  CC:Z:chr12      CP:i:92036      XS:A:+  HI:i:0
        1329_105_1480   403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ::74<@':CAOIH9AHFUWL:FXE6?KNOUVCCKK     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
        1863_1224_411   147     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     __]XRR?G_\_a^^]]bcccbba_[[accccbabb     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0
        1939_1338_752   403     chr1    13572   1       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     44.$5H+6GIJFQHHTW[VVTOQGEJOWYVSSVYY     NM:i:0  NH:i:4  CC:Z:chr12      CP:i:92007      XS:A:+  HI:i:0
        1942_131_1549   403     chr1    13572   3       35M     =       13484   -123    GCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCC     ))++//!#+,98-+9<2?DC9,./.6CBGO?21@@     NM:i:0  NH:i:2  CC:Z:chr2       CP:i:114357410  XS:A:+  HI:i:0 
        Picard has removed ~60% of my reads as duplicates, but you can see that some of the reads that have been left in the file are duplicates (or at least I think they look like duplicates). Read 1329_105_1480 for example appears to have 4 copies which have all mapped as pairs to the same location.

        I ran samtools rmdup on this Picard output and that removed a further 75% as duplicates! But why is picard not picking them up?

        My full Picard command is:
        HTML Code:
        java -Xmx10g -jar /path/to/MarkDuplicates.jar INPUT=filename.bam OUTPUT=output.bam METRICS_FILE=output.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT READ_NAME_REGEX="[0-9]_[0-9]_[0-9]"
        Thanks
        Helen

        Comment

        • polarise
          Member
          • Jan 2011
          • 13

          #5
          Might this work?

          This might be helpful. It suggests that the problem has to do with underscores in the names so the underscore has to be part of the regex.



          PK

          Comment

          • jflowers
            Member
            • Oct 2011
            • 42

            #6
            Your regular expression won't match your read names as you have them. The problem is that you need to include (1) capturing parentheses and (2) use + (to indicate one or more character matches).

            To match identifiers like yours (e.g., 1240_122_334) use:
            READ_NAME_REGEX="([0-9]+)_([0-9]+)_([0-9]+)"

            Also be sure that the three number in your read names are (in this order) tile/region, x coordinate, y coordinate per the Picard tools documentation

            Comment

            • jflowers
              Member
              • Oct 2011
              • 42

              #7
              One final thought, check the metrics file to determine how many of your reads are being detected as optical and PCR duplicates. If your regex is working, there should be > 0 optical duplicates

              Comment

              • lifeng.tian
                Member
                • Jul 2009
                • 16

                #8
                Helen,

                It seems like you are using an aligner that reports multiple alignments, right?

                One thing I noticed is that all these remaining reads are "not primary alignment" (I used http://picard.sourceforge.net/explain-flags.html to interpret the flag in col 2).
                Picard might skip these reads for dup detection.
                I've used samtools view -F 0x100 xx.bam to remove the non-primary-alignment.


                Lifeng

                Comment

                Latest Articles

                Collapse

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                16 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                26 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-04-2026, 08:59 AM
                0 responses
                37 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 12:03 PM
                0 responses
                61 views
                0 reactions
                Last Post SEQadmin2  
                Working...