Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicate Reads

    Hi all,

    I did a search of this forum and wasn't able to find a solution to my problem. Point me in a direction if this has already been addressed.

    I'm getting the following error when running IndelRealigner:

    ERROR MESSAGE: Error caching SAM record
    HWI-ST632:9204NCACXX:2:1101:10620:83792, which is usually caused by malformed SAM/BAM files in which multiple identical copies of a read are
    present.


    The command I'm running is:
    java -Xmx8g -Djava.io.tmpdir=/tmp/110809S10v1.2.9_zmzk -jar
    GenomeAnalysisTK-1.1-26-g9f3328d/GenomeAnalysisTK.jar -T IndelRealigner -R human_g1k_v37.fasta -targetIntervals 110809S10_realn.intervals
    -B:dbsnp,vcf dbsnp_134_b37.vcf -I 110809S10_tumor_dupRem.bam -o 110809S10_tumor_realign.bam


    I checked the dupRem.bam file for the HWI-ST632:9204NCACXX:2:1101:10620:83792 record and found the following lines:

    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 16902900 0 77M835N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:0@958@B?5588(>5(,?A>BB=@>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:F@BB RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:16902900 HI:i:0
    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 16902900 0 77M7112N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:0@958@B?5588(>5(,?A>BB=@>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:F@BB RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:16902900 HI:i:1
    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 16902900 0 77M13417N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:0@958@B?5588(>5(,?A>BB=@>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:F@BB RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:16909177 HI:i:2
    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 16909177 0 77M835N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:0@958@B?5588(>5(,?A>BB=@>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:F@BB RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:16909177 HI:i:3
    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 16909177 0 77M7140N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:0@958@B?5588(>5(,?A>BB=@>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:F@BB RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:16915482 HI:i:4
    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 16915482 0 77M835N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:0@958@B?5588(>5(,?A>BB=@>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:F@BB RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:147589993 HI:i:5
    HWI-ST632:9204NCACXX:2:1101:10620:83792 153 1 147589993 0 77M7112N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:0@958@B?5588(>5(,?A>BB=@>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:F@BB RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:147596274 HI:i:6
    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 147596274 0 77M831N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:0@958@B?5588(>5(,?A>BB=@>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:F@BB RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 CC:Z:= CP:i:148343732 HI:i:7
    HWI-ST632:9204NCACXX:2:1101:10620:83792 409 1 148343732 0 77M831N24M * 0 0 CATCTCTCCCTTCCCGTAACTTCTCCCTTAACTGGGTCAGCTCTCGTTCCTGAGAGTGAACCAGGACTTTATATTGCCTGAGCCCCTCAGCTTGCTTGAGC @:0@958@B?5588(>5(,?A>BB=@>@@>;???;.).)==5'-'88=).)8B4B8??*009*9:0**@DD:?)FC1112))3+:EA++FC?FDDA:F@BB RG:Z:110809S10_tumor NM:i:1 XS:A:- NH:i:9 HI:i:8


    Sorry for all the data, but I though it better to include everything. I ran picard MarkDuplicates in an earlier step in the pipeline but it didn't seem to get rid of the offending records. I tried running picard MarkDuplicates again on the dupRep.bam file to no avail. I'm relatively new at this game and appreciate any suggestions.

  • #2
    Your problem has nothing to do with duplicate reads so marking/removing duplicates won't address it. Duplicate reads mean different reads which align to the the identical location on the reference.

    The lines from your BAM file show the same read aligned to multiple locations. But there is an added twist in your case. Looking at the first three lines the reference coordinate is the same for the 5' end fo the alignment, but the CIGAR string (column 6) reveals the difference between them. The first 77 bases of the alignment start at position 16,902,900 of the reference but then there are three alternate alignments for the remaining 24 bases of the read, one at a distance of 835 nt, the second at 7,112 and the last at 13,417 nt downstream of the first part of the alignment. What you have a split read aligning to a tandemly repeated element in your reference. The first 77 bases anchor at the first repeat, then the last 24 bases align to copies 1, 2 and 3 of the repeat. The next two lines show the first 77 bases anchoring at repeat #2 with the last 24 aligning to #2 and #3 and then the final single alignment to repeat #3. This repeat appears again farther downstream in the reference but it seems that all of the alignments are not reported there.

    I don't have much experience with GATK but I would say it is misinterpreting lines 1, 2, 3 and then 4, 5 as "identical copies of the the same read", only considering the read names and start coordinate, ignoring the difference in the CIGAR string.

    Which aligner did you use and what were the parameters applied for reporting of multiple alignments?

    Comment


    • #3
      Thanks for the quick reply, I've aligned using tophat, the command is the following:

      /tophat -r75 -p 8 -o /tmp/110809S10v1.2.9_zmzk --rg-id 110809S10_tumor --rg-library 110809S10_tumor --rg-sample 110809S10_tumor hg19 110809S10_tumor1_1.fq.gz 110809S10_tumor1_2.fq.gz

      Comment


      • #4
        I'm experiencing the same issue. Does someone knows how to solve this? Do we need to modify some of the parameters for the bowtie aligner that tophat is using?

        Comment


        • #5
          You might try adding "--max-multihits 1" as a command line option to tophat, as that might remove duplicate reads.

          Comment


          • #6
            I tried this option. It seems to solve the problem but it removes more that 1/4 of the total reads, which is not so good. Instead of this, I filtered out the bam file with a mapping quality threshold. I will see what it does!

            Comment


            • #7
              Any one able to solve this problem ?

              Comment


              • #8
                Did you able to solve this issue ?

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                51 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X