At our place we use GATK (3-series) for germline SNV/INDEL calling in a clinical setup, and we are now considering how to make the move to the new GATK 4.
One of the benefits of GATK (and often often emphasized as a sales point from Broad) is the solution to the N+1 problem: When a new sample arrives, one can run GenotypeGVCFs on that sample together with a huge GVCF catalogue of previous samples, thus improving the accuracy of calling.
However, with GATK 4 this functionality has changed tremendeously. It is now recommended to use a GenomicsDB object instead of a combined GVCF file and use that as input to GenotypeGVCFs. In itself this is not a problem, but GenotypeGVCFs now only accepts one "-V" input. Thus, one cannot use both the the large GenomicsDB and the GVCF file from a new sample.
Our first thought was to add the new GVCF file to the GenomicsDB, but that is not supported by the GenomicsDBImport tool. The only solution appears to be to create a new GenomicsDB object from scratch each time a new sample arrives, but that takes days (if not weeks) of computing and is just not feasable. It all seems very odd.
Has anybody here found a way of solving the N+1 problem in GATK 4?
One of the benefits of GATK (and often often emphasized as a sales point from Broad) is the solution to the N+1 problem: When a new sample arrives, one can run GenotypeGVCFs on that sample together with a huge GVCF catalogue of previous samples, thus improving the accuracy of calling.
However, with GATK 4 this functionality has changed tremendeously. It is now recommended to use a GenomicsDB object instead of a combined GVCF file and use that as input to GenotypeGVCFs. In itself this is not a problem, but GenotypeGVCFs now only accepts one "-V" input. Thus, one cannot use both the the large GenomicsDB and the GVCF file from a new sample.
Our first thought was to add the new GVCF file to the GenomicsDB, but that is not supported by the GenomicsDBImport tool. The only solution appears to be to create a new GenomicsDB object from scratch each time a new sample arrives, but that takes days (if not weeks) of computing and is just not feasable. It all seems very odd.
Has anybody here found a way of solving the N+1 problem in GATK 4?
Comment