Announcement

Collapse
No announcement yet.

Duplicate Reads

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicate Reads

    Hi all,

    I did a search of this forum and wasn't able to find a solution to my problem. Point me in a direction if this has already been addressed.

    I'm getting the following error when running IndelRealigner:

    ERROR MESSAGE: Error caching SAM record
    HWI-ST632:9204NCACXX:2:1101:10620:83792, which is usually caused by malformed SAM/BAM files in which multiple identical copies of a read are
    present.


    The command I'm running is:
    java -Xmx8g -Djava.io.tmpdir=/tmp/110809S10v1.2.9_zmzk -jar
    GenomeAnalysisTK-1.1-26-g9f3328d/GenomeAnalysisTK.jar -T IndelRealigner -R human_g1k_v37.fasta -targetIntervals 110809S10_realn.intervals
    -B:dbsnp,vcf dbsnp_134_b37.vcf -I 110809S10_tumor_dupRem.bam -o 110809S10_tumor_realign.bam


    I checked the dupRem.bam file for the HWI-ST632:9204NCACXX:2:1101:10620:83792 record and found the following lines:

    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 16902900 0 77M835N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:[email protected]@B?5588(>5(,?A>[email protected]>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:[email protected] RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:16902900 HI:i:0
    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 16902900 0 77M7112N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:[email protected]@B?5588(>5(,?A>[email protected]>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:[email protected] RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:16902900 HI:i:1
    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 16902900 0 77M13417N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:[email protected]@B?5588(>5(,?A>[email protected]>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:[email protected] RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:16909177 HI:i:2
    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 16909177 0 77M835N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:[email protected]@B?5588(>5(,?A>[email protected]>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:[email protected] RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:16909177 HI:i:3
    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 16909177 0 77M7140N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:[email protected]@B?5588(>5(,?A>[email protected]>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:[email protected] RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:16915482 HI:i:4
    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 16915482 0 77M835N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:[email protected]@B?5588(>5(,?A>[email protected]>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:[email protected] RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:147589993 HI:i:5
    HWI-ST632:9204NCACXX:2:1101:10620:83792 153 1 147589993 0 77M7112N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:[email protected]@B?5588(>5(,?A>[email protected]>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:[email protected] RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:147596274 HI:i:6
    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 147596274 0 77M831N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:[email protected]@B?5588(>5(,?A>[email protected]>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:[email protected] RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:148343732 HI:i:7
    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 148343732 0 77M831N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:[email protected]@B?5588(>5(,?A>[email protected]>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:[email protected] RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 HI:i:8


    Sorry for all the data, but I though it better to include everything. I ran picard MarkDuplicates in an earlier step in the pipeline but it didn't seem to get rid of the offending records. I tried running picard MarkDuplicates again on the dupRep.bam file to no avail. I'm relatively new at this game and appreciate any suggestions.

  • #2
    Your problem has nothing to do with duplicate reads so marking/removing duplicates won't address it. Duplicate reads mean different reads which align to the the identical location on the reference.

    The lines from your BAM file show the same read aligned to multiple locations. But there is an added twist in your case. Looking at the first three lines the reference coordinate is the same for the 5' end fo the alignment, but the CIGAR string (column 6) reveals the difference between them. The first 77 bases of the alignment start at position 16,902,900 of the reference but then there are three alternate alignments for the remaining 24 bases of the read, one at a distance of 835 nt, the second at 7,112 and the last at 13,417 nt downstream of the first part of the alignment. What you have a split read aligning to a tandemly repeated element in your reference. The first 77 bases anchor at the first repeat, then the last 24 bases align to copies 1, 2 and 3 of the repeat. The next two lines show the first 77 bases anchoring at repeat #2 with the last 24 aligning to #2 and #3 and then the final single alignment to repeat #3. This repeat appears again farther downstream in the reference but it seems that all of the alignments are not reported there.

    I don't have much experience with GATK but I would say it is misinterpreting lines 1, 2, 3 and then 4, 5 as "identical copies of the the same read", only considering the read names and start coordinate, ignoring the difference in the CIGAR string.

    Which aligner did you use and what were the parameters applied for reporting of multiple alignments?

    Comment


    • #3
      Thanks for the quick reply, I've aligned using tophat, the command is the following:

      /tophat -r75 -p 8 -o /tmp/110809S10v1.2.9_zmzk --rg-id 110809S10_tumor --rg-library 110809S10_tumor --rg-sample 110809S10_tumor hg19 110809S10_tumor1_1.fq.gz 110809S10_tumor1_2.fq.gz

      Comment


      • #4
        I'm experiencing the same issue. Does someone knows how to solve this? Do we need to modify some of the parameters for the bowtie aligner that tophat is using?

        Comment


        • #5
          You might try adding "--max-multihits 1" as a command line option to tophat, as that might remove duplicate reads.

          Comment


          • #6
            I tried this option. It seems to solve the problem but it removes more that 1/4 of the total reads, which is not so good. Instead of this, I filtered out the bam file with a mapping quality threshold. I will see what it does!

            Comment


            • #7
              Any one able to solve this problem ?

              Comment


              • #8
                Did you able to solve this issue ?

                Comment

                Working...
                X