Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • blakeoft
    Member
    • Oct 2013
    • 79

    Questions about exome sequencing

    I've got exome data from 12 human samples that I'd like to analyze. For the most part, I've been following GATK's best practices for my pipeline, but I run into some problems and I can't seem to solve them by reading the forums here or at broadinstitute.org. I wanted to get all the way through the pipeline with one of the samples just to make sure that my script will work properly.

    The first real problem I had was when I used the BaseRecalibrator tool from GATK. From what I could tell, there was a problem (I think) with many of the CIGAR strings. There are about 1000 instances of deletions at the end of a read. Perhaps this is my the source of all my issues. Does this even make sense? It seems like a deletion should not occur at the beginning or end of a read, but I may be misunderstanding something. By reading the forums at broadinstitute, I learned that adding the "-rf BadCigar" option will make BaseRecalibrator ignore all cases of bad CIGAR strings. But then later on when I use HaplotypeCaller, I get an error with error message line:
    Code:
    ##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?
    This reminded me of the 'bad' CIGAR strings that I had observed earlier.

    If this CIGAR string issue is my problem, can anyone help me fix it? I used bwa aln to align and then bwa sampe to generate the sam file. The odd CIGAR strings already appear in this sam file, so if deletions at the end of a read are bad, then the problem seems to be in the alignment/sampe stage. I can also post an abbreviated version of my script or the entire error message that I get from HaplotypeCaller if they might be helpful.

    Thank you,
    Blake
  • bruce01
    Senior Member
    • Mar 2011
    • 160

    #2
    If you post up the script and error it will be easier to help, not sure why this is happening as I haven't dealt with the specific error but having used GATK 'best prcaitce' pipelines I can say it requires a lot of attention to get it running correctly! If you could attach a small sample of your SAM with reads that are generating error that would be really helpful too. Also the Broad/GATK forum people are very helpful if you try contacting them directly.

    Comment

    • blakeoft
      Member
      • Oct 2013
      • 79

      #3
      Thank you for the reply. Here is an abbreviated version of my script so far (sampleL1_1 refers to the first mate on lane 1 of the paired end reads):

      Code:
      bwa aln -t 4 -f sampleL1_1.sai hg19 sampleL1_1.fastq
      bwa aln -t 4 -f sampleL1_2.sai hg19 sampleL1_2.fastq
      bwa sampe -f sampleL1.sam -r"RG@\tID:sampleL1\tLB:sampleL1\tSM:sampleL1\tPL:ILLUMINA" hg19 sampleL1_1.sai sampleL1_2.sai sampleL1_1.fastq sampleL1_2.fastq
      java -Xmx4g -Djava.io.tmpdir=/local -jar SortSam.jar SO=coordinate INPUT=sampleL1.sam OUTPUT=sampleL1.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
      java -Xmx4g -Djava.io.tmpdir=/local -jar MarkDuplicates.jar INPUT=sampleL1.bam OUTPUT=sampleL1.dedup.bam METRICS_FILE=sampleL1.dup.txt VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true
      samtools index sampleL1.dedup.bam
      java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R hg19.fa -I sampleL1.dedup.bam -o sampleL1.intervals -known gold_standard.vcf -known 1000G_phase1.vcf
      java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T IndelRealigner -R hg19.fa -I sampleL1.dedup.bam -targetIntervals sample1.intervals -o sampleL1.dedup.realn.bam -known gold_standard.vcf -known 1000G_phase1.vcf
      java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T BaseRecalibrator -rf BadCigar -R hg19.fa -I sampleL1.dedup.realn.bam -knownSites dbsnp137.vcf -knownSites gold_standard.vcf -knownSites 1000G_phase1.vcf -o sampleL1.recal_data.table
      java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T PrintReads -R hg19.fa -I sampleL1.dedup.realn.bam -BQSR sampleL1.recal_data.table -o sampleL1.dedup.realn.recal.bam
      java -Xmx4g -Djava.io.tmpdir=/local -jar GenomeAnalysisTK.jar -T ReduceReads -R hg19.fa -I sampleL1.dedup.realn.recal.bam -o sampleL1.dedup.realn.recal.reduced.bam
      I got an error earlier on the BaseRecalibrator step saying something like "START (101) > (100) STOP -- this should never happen -- call Mauricio!" After searching the broadinstitute.org forums, I saw a few posts claiming that adding the -rf BadCigar would make the tools ignore the reads with bad CIGAR strings. It seems to have worked. I do the previous set of commands on three lanes and then picard mergesamfiles, index and use GATK's HaplotypeCaller:

      Code:
      java -Xmx4g -jar MergeSamFiles.jar TMP_DIR=/local INPUT=sampleL1.dedup.realn.recal.reduced.bam INPUT=sampleL2.dedup.realn.recal.reduced.bam INPUT=sampleL3.dedup.realn.recal.reduced.bam OUTPUT=sample.merged.bam
      samtools index sample.merged.bam
      java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa -I sample.merged.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o sample_raw.vcf
      As I previously mentioned, I get an error on the last step:

      Code:
      ##### ERROR ------------------------------------------------------------------------------------------
      ##### ERROR stack trace
      org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Somehow the requested coordinate is not covered by the read. Too many deletions?
              at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:447)
              at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:396)
              at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:392)
              at org.broadinstitute.sting.gatk.walkers.annotator.DepthOfCoverage.annotate(DepthOfCoverage.java:56)
              at org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation.annotate(InfoFieldAnnotation.java:24)
              at org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:223)
              at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.assignGenotypeLikelihoodsAndCallIndependentEvents(GenotypingEngine.java:196)
              at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:411)
              at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:107)
              at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegion(TraverseActiveRegions.java:285)
              at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.callWalkerMapOnActiveRegions(TraverseActiveRegions.java:230)
              at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegions(TraverseActiveRegions.java:205)
              at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:131)
              at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:28)
              at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:74)
              at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
              at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
              at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237)
              at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147)
              at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
      ##### ERROR ------------------------------------------------------------------------------------------
      ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-9-ge5ebf34):
      ##### ERROR
      ##### ERROR Please visit the wiki to see if this is a known problem
      ##### ERROR If not, please post the error, with stack trace, to the GATK forum
      ##### ERROR Visit our website and forum for extensive documentation and answers to
      ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
      ##### ERROR
      ##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?
      ##### ERROR ------------------------------------------------------------------------------------------
      Perhaps this is all more information than I needed to provide as I'm hoping that my problems originate from bad CIGAR strings at the alignment step. Both of the errors that I have encountered that were not silly typos seem to be referring to these bad CIGAR strings. Here is one of the odd looking reads that I get in my original sam file:

      Code:
      GWZHISEQ01:231:D2BL9ACXX:1:1101:8158:10808      73      chr9    80793844        37      95M4I2M3D       =       80793844        0       TGATATTCTCGAGTTCCAAAACGGGGTCCCTGTGAAGGTGACCAACGTCAAGGATGGCACCACCCACCAGACCTCCTTGGTACCTTGGTACCTCCAGGATT    CCCFFFFFHHHHHIIJJJJJJJJJJJGIIJJJHIJJJJHHIJJJJJJHIJJJJJJJJIJJGHHFFFFDDDDDDDDDDDDDCCDDDDDDDDDDDDDDDD?CC   RG:Z:JC06_L1    XT:A:U  NM:i:7  SM:i:37 AM:i:0  X0:i:1  X1:i:0  XM:i:4  XO:i:1  XG:i:1  MD:Z:97^TTC0
      This is only the first one that I see, and there are around 500 in my first sample's lane. Does it make any sense at all to have a CIGAR string "95M4I2M3D"? It's also strange because the read has 101 bases and 95 + 4 + 2 + 3 is not 101.

      Comment

      • bruce01
        Senior Member
        • Mar 2011
        • 160

        #4
        Ok, my immediate response to issues like this is: what is the proportion of reads that have the error to total reads, i.e. can you just throw them out?! You say 500, I assume of millions, so I would get rid and see if your pipeline now runs. You can retain reads and see if they fall in same area, or share any other info which may direct you to the cause of the problem.

        As to the CIGAR, I am not an expert at all! But I agree, it does not make sense based on what CIGAR is meant to say about the read (104 bases when read is 101 as you state). Perhaps try the Picard ValidateSAMFile to test if there is anything really wrong, esp if you do go back and remove the 'weird' CIGAR-reads.

        Sorry I can't be of more help, good luck and post up what you did when solved as it will be useful+interesting to see.

        Comment

        • dpryan
          Devon Ryan
          • Jul 2011
          • 3478

          #5
          D operations don't contribute to read length, so 95+4+2=101.

          Having said that, Both the CIGAR string and MD tag make no sense. Talking about a deletion after the end of a read is just strange. I would have guessed that the IndelRealigner just screwed this one up, but if this was how things are in the original SAM file, then it was actually the aligner. This looks like a bug in BWA.

          Comment

          • blakeoft
            Member
            • Oct 2013
            • 79

            #6
            Originally posted by dpryan View Post
            D operations don't contribute to read length, so 95+4+2=101.

            Having said that, Both the CIGAR string and MD tag make no sense. Talking about a deletion after the end of a read is just strange. I would have guessed that the IndelRealigner just screwed this one up, but if this was how things are in the original SAM file, then it was actually the aligner. This looks like a bug in BWA.
            Of course D's don't contribute to read length. Duh, what was I thinking? I think you're right. I aligned with bowtie and did
            Code:
            awk '$6 ~ /D$/ {print}' sampleL1_bowtie.sam
            and nothing was returned. I'll post what happens with my pipeline when I try aligning with bowtie and maybe some more QC.

            Comment

            • rbagnall
              Member
              • Jun 2010
              • 34

              #7
              A reported error in GATK



              Update your version from version 2.3-9-ge5ebf34 to the current

              Comment

              • dpryan
                Devon Ryan
                • Jul 2011
                • 3478

                #8
                Reading through the GATK forums, it seems that blakeoft is correct and weird CIGAR strings cause this (see, for example this on-going thread). Even if GATK gets around the weird CIGAR string, I would still consider that to be a BWA bug, since you really shouldn't have to run with "-rf BadCigar".

                Comment

                • blakeoft
                  Member
                  • Oct 2013
                  • 79

                  #9
                  I think my problem is that I was using bwa aln and not bwa mem. The latter produces alignments with no trailing D's with the same fastq input. But I think my main question (is it nonsensical for CIGAR strings to have D's at the end?) has been answered. Thanks to everyone.

                  Comment

                  • blakeoft
                    Member
                    • Oct 2013
                    • 79

                    #10
                    I have another problem, but I want to post on GATK's forums. That leads to another issue, which is that I don't know how to post there... There doesn't seem to be a way to ask a new question or post in an existing thread. Does anyone know how to ask a new question on that forum?

                    Comment

                    • rbagnall
                      Member
                      • Jun 2010
                      • 34

                      #11
                      First you need to register with GATK. At the top of the home page there is a 'Forum' option, and within there is the option to 'Register'.

                      Once registered and signed in, you can post a new question on the same 'Forum' page (top left), or you can add comments to a current thread by scrolling to the bottom of the thread and adding a comment.

                      Comment

                      Latest Articles

                      Collapse

                      • SEQadmin2
                        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                        by SEQadmin2


                        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                        Here are nine questions we think about, in roughly the order they matter, before...
                        06-18-2026, 07:11 AM
                      • SEQadmin2
                        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                        by SEQadmin2


                        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                        ...
                        06-02-2026, 10:05 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, 06-26-2026, 11:10 AM
                      0 responses
                      12 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-17-2026, 06:09 AM
                      0 responses
                      48 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-09-2026, 11:58 AM
                      0 responses
                      106 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-05-2026, 10:09 AM
                      0 responses
                      125 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...