No announcement yet.

bwa GATK4 read group issues

  • Filter
  • Time
  • Show
Clear All
new posts

  • bwa GATK4 read group issues

    I am trying to run GATK4 HaplotypeCaller on bam files from single individuals generated with bwa mem on Ubuntu 18 with
    $gatk HaplotypeCaller -R Tse_SBAPGDGG_D.fa -I AHP2746_sorted_dedup.bam -ERC GVCF -O AHP2746.gvcf
    But I get the error:
    A USER ERROR has occurred: Argument --emit-ref-confidence has a bad value: Can only be used in single sample mode currently. Use the --sample-name argument to run on a single sample out of a multi-sample BAM file.
    I assume this is a read group error.
    if I run
    samtools view -H AHP2746_sorted_dedup.bam | grep @RG
    I get nothing back.
    If I run
    samtools view -H AHP2746_sorted_dedup.bam
    I get a long list of headers and the bottom look like this:
    @SQ     LN:500  SN:scaffold_42026
    @SQ     SN:scaffold_42025       LN:500
    @SQ     SN:scaffold_42022       LN:500
    @SQ     SN:scaffold_42019       LN:500
    @SQ     SN:scaffold_42018       LN:500
    @SQ     LN:500  SN:scaffold_42017
    @SQ     SN:scaffold_42016       LN:500
    @SQ     SN:scaffold_42015       LN:500
    @SQ     SN:scaffold_42014       LN:500
    @SQ     SN:scaffold_42013       LN:500
    @SQ     SN:scaffold_42012       LN:500
    @SQ     SN:scaffold_42011       LN:500
    @SQ     SN:scaffold_42010       LN:500
    @SQ     SN:scaffold_42009       LN:500
    @SQ     SN:scaffold_42008       LN:500
    @SQ     SN:scaffold_42006       LN:500
    @SQ     SN:scaffold_42005       LN:500
    @SQ     SN:scaffold_42004       LN:500
    @SQ     SN:scaffold_42003       LN:500
    @SQ     SN:scaffold_42002       LN:500
    @SQ     SN:scaffold_42001       LN:500
    @SQ     SN:scaffold_42000       LN:500
    @SQ     SN:scaffold_41998       LN:500
    @SQ     SN:scaffold_41997       LN:500
    @SQ     SN:scaffold_41996       LN:500
    @SQ     SN:scaffold_41995       LN:500
    @PG     PN:bwa  VN:0.7.17-r1188 CL:bwa mem -t 32 Tse_SBAPGDGG_D AHP2746_L003_R1_001.fastq.gz AHP2746_L003_R2_001.fastq.gz ID:bwa
    @PG     CL:elprep filter /dev/stdin AHP2746_sorted_dedup.bam --filter-unmapped-reads --mark-duplicates --mark-optical-duplicates AHP2746_output.metrics --optical-duplicates-pixel-distance 100 --remove-duplicates --sorting-order coordinate    PP:bwa  ID:elprep 4.1.3 PN:elprep       VN:4.1.3        DS:
    I assume my bam files have no read group defined (I did not use the bwa mem -R option).
    I then tried running bwa mem with -R
    bwa mem -M -t 32 -R '@RG\tID:sample_1\tSM:AHP2746\tPL:illumina\tPU:lane1\tPI:420' Tse_SBAPGDGG_D AHP2746_L003_R1_001.fastq.gz AHP2746_L003_R2_001.fastq.gz > bwa-mem-Rtest_AHP2746.bam
    the analyses completed, but I am using elprep to dedupe and order the bam files and elprep throws the error "2019/07/23 10:07:46 gzip: invalid header in NewBGZFReader"
    my elprep command:
    elprep filter bwa-mem-Rtest_AHP2746.bam bwa-mem-Rtest_AHP2746-elprep.bam --mark-duplicates --mark-optical-duplicates output.metrics --remove-duplicates --sorting-order coordinate --nr-of-threads 32

    Anyone know what I am doing wrong here?

    Additional info:
    reference genome was generated and assembled with linked reads (10X) sequenced on a NovaSeq and Hi-C data.
    the resequencing data were sequenced on HiSeq400 and some on NovaSeq, but all have same issue.
    I am using the latest software versions available with conda installs.

  • #2

    Quite a while since you posted this so if you reached a solution that'd be great to hear.
    Either way I'm having the same problem and thought I'd add some things to the

    I am however not sure this is related to the RG tag. According to SAM format
    specifications, the BC tag is where sample ID is stored so my guess is that is has to do
    with this.

    However, in my case I haven't specified any BC tag at all and I wonder if this could be
    the problem.

    SAM format specs:


    • #3
      Okay I managed to solve my issue which I'll post here if someone else finds it useful in the future.

      RL; DR:

      Add an arbitrary RG tag.


      From reading at their github it seems gatk HaplotypeCaller will call on a utils function to assert the number of samples, which it does by counting the number of RG entries in the header. HaplotypCaller then exits with the error message if this count is not equal to 1, which seemingly also includes when it the utils function returns zero.


      Add an RG tag to your header and tag all reads with the RG tag. The fastest solution is to do this with picardtools AddOrReplaceReadGroups.

      Solution complication:

      When running gatk HaplotypeCaller on my RG tagged file I got an error saying the uncompression length was invalid. I ran picardtools ValidateSamFile which warned that the index file was older than the BAM file, and after rerunning the indexing everything seemingly worked great (using samtools index).


      NB: I'm running picardtools/gatk installed through conda, if you're using the .jar file directly just exchange picard/gatk with java --jar path/to/jar-file.jar

      picard AddOrReplaceReadGroups I=reads.bam O=reads.tag.bam RGID=4 RGLB=lib1 RGPL=ILLUMINA RGPU=unit1 RGSM=20 
      samtools index reads.tag.bam
      gatk HaplotypeCaller I=reads.tag.bam -R ref.fa -ERC GVCF -O vars.vcf
      PS. I noticed the BC tag is part of the RG tag, so disregard that part of my prior answer.