Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calling invariant sites with GATK

    Hello,

    I would like to obtain a vcf with variant AND INVARIANT sites using GATK.
    I am not sure about the options to use in order to obtain invariant sites.
    I tried a few but still I have only variants.

    Anyone can help ?

    Thanks a lot !

    Muriel

  • #2
    Hi Muriel,

    What you want is to run the GATK's HaplotypeCaller in GVCF mode, with the arguments --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 added to your command line. Have a look at the documentation here: http://www.broadinstitute.org/gatk/g...-discovery-ovw and here: http://www.broadinstitute.org/gatk/g...rticle?id=3893

    These documents cover joint analysis of cohorts, but the basic principle is the same for single samples, you just skip the joint genotyping step.

    Comment


    • #3
      Hi,
      Thank you vdauwera for your answer.
      The thing is that when you do that, you can't have a bam containing multiple samples.
      Muriel

      Comment


      • #4
        If you want to use GATK's UnifiedGenotyper you just need to add: --output EMIT_ALL_SITES

        Comment


        • #5
          Thank you cariboudoug !
          Indeed, I tried this already but I was thinking that it's probably better to use HaplotypeCaller since this is the new version of UnifiedGenotyper...

          Comment


          • #6
            Originally posted by MurielGB View Post
            Hi,
            Thank you vdauwera for your answer.
            The thing is that when you do that, you can't have a bam containing multiple samples.
            Muriel
            Oh, if you have multiple samples, you can just call each in turn using the GVCF mode, then you do the joint genotyping to get a single multi-sample VCF with the results.

            If your problem is that you have all the samples in one bam, the simplest is to split it into per-sample bams, or you can pass the bam with a read filter to specify a single sample to run on in each run.

            Comment


            • #7
              Hi !

              I had a look on this new pipeline.

              I launched variant calling on a bam file containing only one sample with the options mentioned before :
              java -Xmx300g -jar /GenomeAnalysisTK/3.0.0/GenomeAnalysisTK.jar \
              -T HaplotypeCaller -R ref.fasta \
              -I ${INPUT}.bam \
              --genotyping_mode DISCOVERY \
              -stand_emit_conf 0 \
              -stand_call_conf 0 \
              -o ${INPUT}\_VC2.vcf \
              --emitRefConfidence GVCF \
              --variant_index_type LINEAR \
              --variant_index_parameter 128000 \
              -nct 16
              I still don't have all positions in the output vcd as you can see :
              KE340296.1 822 . C <NON_REF> . . END=823GTP:GQ:MIN_DP:PL 0/0:7:18:7:0,18,270
              KE340296.1 824 . C <NON_REF> . . END=824GTP:GQ:MIN_DP:PL 0/0:7:0:7:0,0,219
              KE340296.1 825 . C <NON_REF> . . END=827GTP:GQ:MIN_DP:PL 0/0:7:18:7:0,18,270
              KE340296.1 828 . G <NON_REF> . . END=829GTP:GQ:MIN_DP:PL 0/0:7:0:7:0,0,199
              KE340296.1 830 . T <NON_REF> . . END=854GTP:GQ:MIN_DP:PL 0/0:6:18:6:0,12,180
              KE340296.1 855 . C <NON_REF> . . END=855GTP:GQ:MIN_DP:PL 0/0:7:1:7:0,1,233
              KE340296.1 856 . T <NON_REF> . . END=890GTP:GQ:MIN_DP:PL 0/0:4:6:3:0,6,90
              KE340296.1 891 . G <NON_REF> . . END=905GTP:GQ:MIN_DP:PL 0/0:4:0:0:0,0,0
              KE340296.1 906 . T <NON_REF> . . END=944GTP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,37
              KE340296.1 945 . C <NON_REF> . . END=104GTP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
              Do you know why ?

              Muriel

              Comment


              • #8
                Sure, if you want a call for every single position (instead of averaged blocks, which are used to compress the information in order to minimize file size) you just change -ERC GVCF to -ERC BP_RESOLUTION.

                Comment


                • #9
                  Oh great !!
                  I was wondering what was the difference between these two options !
                  Thanks a lot for your help vdauwera !

                  Comment


                  • #10
                    Well... I am wondering if I won't have another problem using -ERC BP_RESOLUTION.
                    Because after that I obtain a vcf file per sample containing both variant and invariant sites, I want to run the joint analysis to have a vcf combining all my samples.

                    In the documentation regarding the GenotypeGVCF option that perform the joint analysis (here : https://www.broadinstitute.org/gatk/...typeGVCFs.html), it is written that : GenotypeGVCFs merges gVCF records that were produced as part of the "single sample discovery" pipeline using the '-ERC GVCF' mode of the Haplotype Caller.

                    So I am wondering if this is gonna work since I use -ERC BP_RESOLUTION rather than GVCF ?

                    Thanks a lot !

                    Muriel

                    Comment


                    • #11
                      That's completely fine actually, the BP_RESOLUTION gvcfs are also a valid input for the joint genotyping step. I'll update the documentation to be more accurate and inclusive. Feel free to post any further questions or comments about GATK tools on our support forum

                      Comment


                      • #12
                        So cooool !!!
                        Thanks a lot, I am very happy of this new tool, it's gonna allow population geneticists to do so much !

                        Comment


                        • #13
                          I have another problem, sorry vdauwera

                          I did variant calling on two samples individually using the option described and it works well, I have both variant and invariant sites.

                          However, the QUAL and INFO fields are missing in most cases :
                          KE332545.1 64 . G <NON_REF> . . . GT:ADP:GQ:PL 0/0:17,0:17:51:0,51,660
                          KE332545.1 65 . G <NON_REF> . . . GT:ADP:GQ:PL 0/0:17,0:17:51:0,51,655
                          KE332545.1 66 . G T,<NON_REF> 1.14 . BaseQRankSum=-2.602;ClippingRankSum=-0.903;DP=24;MLEAC=1,0;MLEAF=0.500,0.00;MQ=38.57;MQ0=0;MQRankSum=1.115;ReadPosRankSum=0.372 GT:ADP:GQ:PL:SB 0/1:14,4,0:18:23:23,0,449,64,460,524:10,4,3,1
                          KE332545.1 67 . T <NON_REF> . . . GT:ADP:GQ:PL 0/0:18,0:18:53:0,53,583
                          KE332545.1 68 . T <NON_REF> . . . GT:ADP:GQ:PL 0/0:18,0:18:54:0,54,659
                          KE332545.1 69 . T <NON_REF> . . . GT:ADP:GQ:PL 0/0:18,0:18:54:0,54,699
                          It seems that these two fields are present only in case of an alternative allele existing right ?
                          Why ? How I can I use my filtering pipeline for the invariant sites !

                          Then I performed the joint analysis and this is the same.
                          This is a shame because I really hoped I will also have these informations for invariant sites...
                          Is there something that I missed ?

                          Thank you,

                          Muriel

                          Comment


                          • #14
                            Hi Muriel

                            Were you able to find an answer to your question?

                            Thanks

                            Comment


                            • #15
                              We need the "QUAL" of invariant sites (homogeneous reference sites) from GATK. Can we just plug "0" into the formula -10 log_10 Prob (0 variant) since ALT is ". "?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM
                              • seqadmin
                                Multiomics Techniques Advancing Disease Research
                                by seqadmin


                                New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                                A major leap in the field has
                                ...
                                02-08-2024, 06:33 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 02-28-2024, 06:12 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-23-2024, 04:11 PM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-21-2024, 08:52 AM
                              0 responses
                              77 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-20-2024, 08:57 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X