Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • david.tamborero
    Member
    • Feb 2011
    • 60

    error during GATK indel realigner

    Hello,

    I've performed an exome alignement (paired end reads) by using bfast match + localalign + postprocess, thereafter I've removed duplicates by Picard and when running the local realignement, during the GATK Indel Religner step I get the following error:

    Code:
    ##### ERROR MESSAGE: Error caching SAM record HWUSI-EAS1692_0001:3:55:5381:15775#0, which is usually caused by malformed SAM/BAM files in which multiple identical copies of a read are present.
    This is how the bam file looks:

    Code:
    HWUSI-EAS1692_0001:3:55:5381:15775#0	179	chr1	148354187	0	95M	=	148354236	49TAGCATCTTTCACAAAGCTCTCTGTGTTTGAGTACGCACCTTGATCCATAGGCTCACATTTGATCCCAACTGGCGGCTGCTTCTTGGCATTAACT	DGFBGGGGGGGGGGGGGGGGGGGBGFFGGGFGDGEGGAGFEGDGGGGFEGEEGBGGGGGEGDBDEDEEDBA??EEA?##################	XA:i:3	MD:Z:95	PG:Z:bfast	RG:Z:012_t_l1	IH:i:1	NH:i:11	HI:i:1	NM:i:0	MQ:i:0	AS:i:4750
    HWUSI-EAS1692_0001:3:55:5381:15775#0	81	chr1	148354236	0	95M	=	148568822	214586	AGGCTCACATTTGATCCCAACTGGCGGCTGCTTCTTGGCATTAACTTTGGATTCCCAACCAGTAAATCTTACCAAGATCTGAGTTTCTCCAGGTA	@AABAC<CA@>>>=4>=>=3DCDCFEDEGEECDF?DEFGFFEDCEDDDEEEDDGEGFGGGGGEGGGFGFGGDGGGGGGFGGFFGGGGGGGEGGGB	XA:i:3	MD:Z:95	PG:Z:bfastRG:Z:012_t_l2	IH:i:1	NH:i:11	HI:i:1	NM:i:0	MQ:i:0	AS:i:4750
    HWUSI-EAS1692_0001:3:55:5381:15775#0	115	chr1	148354236	0	95M	=	148354187	-49AGGCTCACATTTGATCCCAACTGGCGGCTGCTTCTTGGCATTAACTTTGGATTCCCAACCAGTAAATCTTACCAAGATCTGAGTTTCTCCAGGTA	@AABAC<CA@>>>=4>=>=3DCDCFEDEGEECDF?DEFGFFEDCEDDDEEEDDGEGFGGGGGEGGGFGFGGDGGGGGGFGGFFGGGGGGGEGGGB	XA:i:3	MD:Z:95	PG:Z:bfast	RG:Z:012_t_l1	IH:i:1	NH:i:11	HI:i:1	NM:i:0	MQ:i:0	AS:i:4750
    HWUSI-EAS1692_0001:3:55:5381:15775#0	161	chr1	148568822	0	95M	=	148354236	-214586	AGTTAATGCCAAGAAGCAGCCGCCAGTTGGGATCAAATGTGAGCCTATGGATCAAGGTGCGTACTCAAACACAGAGAGCTTTGTGAAAGATGCTA	##################?AEE??ABDEEDEDBDGEGGGGGBGEEGEFGGGGDGEFGAGGEGDGFGGGFFGBGGGGGGGGGGGGGGGGGGGBFGD	XA:i:3	MD:Z:95	PG:Z:bfastRG:Z:012_t_l2	IH:i:1	NH:i:11	HI:i:1	NM:i:0	MQ:i:0	AS:i:4750
    So I guess the GATK it's right. My question is:

    - I've runned the bfast postprocess with the '-a 3 -z' argument, so is it not supposed that it takes only one single alignement for each read?

    - anyway, can I somehow say to the GATK to ignore these "conflictive" reads? I've tried with the '--validation_strictness SILENT' but it is still complaining.

    Well, I'm pretty jammed with that, any help will be much appreciated. And merry christmas, by the way!

    thanks,
    david
  • Jon_Keats
    Senior Member
    • Mar 2010
    • 279

    #2
    Can you assign more specific read groups to avoid the collisions (Flowcell/lane)? It looks like your read group is really a sample ID. Also looks like some of your PG and RB lines are not separate lines.

    Comment

    • david.tamborero
      Member
      • Feb 2011
      • 60

      #3
      Thank you very much for your answer, Jon.

      I did not point out that reads are paired end. According to the bam flag, the first and third entries should correspond to the 2nd and 1st end of lane_2, whereas second and fourth entries should correspond to the 1st and 2nd end of lane_1, respectively.

      I'm newbie and maybe I am wrong, but it should be not a problem due to the read group values. Even if I assigned them a bit dummy-like, there are no problems in the remaining samples (in which I did the same, and no errors have raised).

      I am wondering if the problem is that the sequencer has given the same id to reads from two different lanes. Is it possible? I hope the above has sense and I am no missing some point about what you say.

      Many thanks.

      Comment

      • kasthuri
        Member
        • Jun 2011
        • 36

        #4
        I found some of the errors in GATK were gone if I "clean" the bam files using:

        samtools view -F 0x04 -b in.bam > out.bam

        after this I sort, index and mark the duplicates using Picard before proceeding with GATK.

        -Kasthuri

        Added later: Did you merge the bam files for a same sample run on different lanes?
        Last edited by kasthuri; 12-29-2011, 07:52 PM.

        Comment

        • YunjieLiu
          Junior Member
          • Oct 2011
          • 2

          #5
          I also met a problem with realigner.
          I ran bwa+realigner+indelgenotyper, and I got message below during indel genotyper.
          ##### ERROR MESSAGE: Invalid command line: Argument window_size has a bad value: Read HWUSI-EAS1600R_0008:4:9:17021:5336#0: out of coverage window bounds. Probably window is too small, so increase the value of the window_size argument.
          ##### ERROR Read length=115; cigar=1M84D114M; start=128243463; end=128243661; window start (after trying to accomodate the read)=128243458; window end=128243657
          So I found the reads in bam file in realign output
          HWUSI-EAS1600R_0008:4:9:17021:5336#0 99 chr7 128243463 70 1M84D114M
          While I check the same reads in bwa output ,I found
          HWUSI-EAS1600R_0008:4:9:17021:5336#0 99 chr7 128243547 60 115M
          It seems the realigner put a wrong deletion in it.
          Did anyone meet this error?

          Comment

          • david.tamborero
            Member
            • Feb 2011
            • 60

            #6
            kasthuri I found some of the errors in GATK were gone if I "clean" the bam files using:

            samtools view -F 0x04 -b in.bam > out.bam

            after this I sort, index and mark the duplicates using Picard before proceeding with GATK.
            In this case, the 0x04 flag is not useful, since it labels unmapped reads. In my case, the same read is mapped to several (two) positions.

            Added later: Did you merge the bam files for a same sample run on different lanes?
            I have merged two bam files, each corresponding to a different lane. Each of these bam files have been obtained by bfast alignement and then tagged as corresponding by picard_add_groups.

            What is confusing for me is why the same read_id appears in the lane_1.bam and also in the lane_2.bam files, since this read_id appears in the lane_1.fastq but not in the lane_2.fastq original raw read files.
            Something must be wrong in my pipeline, but I've checked it one thousand times and everything seems fine (and moreover, it only occurs in one of the many samples I have processed in the same way).

            Comment

            • iSNÖ
              Junior Member
              • Jul 2011
              • 1

              #7
              Hi David,

              Did you manage to clean up this error eventually? I'm sitting here with the exact same thing. And in one sample only out of several. I would hope there was some easy way through Picard or samtools, I simply haven't found it.

              Cheers,
              K

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM
              • SEQadmin2
                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                by SEQadmin2


                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                Introduction

                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                05-22-2026, 06:42 AM
              • SEQadmin2
                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                by SEQadmin2

                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                05-06-2026, 09:04 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-02-2026, 12:03 PM
              0 responses
              19 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-02-2026, 11:40 AM
              0 responses
              14 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 05-28-2026, 11:40 AM
              0 responses
              29 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 05-26-2026, 10:12 AM
              0 responses
              31 views
              0 reactions
              Last Post SEQadmin2  
              Working...