Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calling specific loci with UnifiedGenotyper

    Hi, I've been having a little trouble (and confusion) trying to get the right set of calls out of the GATK's UnifiedGenotyper; I'm doing analysis of of known and novel SNPs, and I'd like to get calls for both:
    - observed variants across my 50 samples (including novel ones)
    - known variant sites (even though all samples may be homozygous reference) for which there is sufficient read coverage, etc.
    The no-call vs homozygous reference difference is a really a big deal for my analysis. When I run with -out_mode EMIT_VARIANTS_ONLY, I only get calls for sites where my samples vary, and when I run with -out_mode EMIT_ALL_CONFIDENT_SITES, I get a huge .vcf file with tons of calls everywhere I had decent coverage. Is there a way to tell UnifiedGenotyper to call specific sites, e.g. the ones listed in the --dbsnp rod file? Or do I need to run with EMIT_ALL_CONFIDENT_SITES, and filter by hand (which I'd like to avoid)?

  • #2
    I wound up writing a Python script to filter the results of EMIT_ALL_CONFIDENT_SITES with the dbSNP .vcf file - as it's bundled with a bunch of my still-unfinished-helper-scripts I'm not going to post it publicly, but PM me if you would like a copy.

    Comment


    • #3
      -L "chr4:21942-34289"

      The -L flag in combination with EMIT_ALL_SITES should get what you want.

      Comment


      • #4
        Thanks for the response, but that just filters to a specific region (which I don't care about) - what I want are calls for every locus that is variant in my data, as well as calls for every locus that is known to be variant, irrelevant of position. These could be anywhere across the genome.
        Perhaps it would make sense to phrase it differently: when I query a specific rs number to see what my samples had at that location, very often there isn't even a line for it in the .vcf file. If I check the coverage at that location, it's more than ample (and running with EMIT_ALL_CONFIDENT_SITES does give me a call for the variant with high quality scores), but it wasn't included in the .vcf file using EMIT_VARIANTS_ONLY because all my samples are homozygous reference for that locus. It's really important for me to be able to query a known variant and know the difference between a true no-call (where there isn't enough data) and something that wasn't called because everyone was homozygous reference.

        Comment


        • #5
          Originally posted by yasashiku View Post
          Thanks for the response, but that just filters to a specific region (which I don't care about) - what I want are calls for every locus that is variant in my data, as well as calls for every locus that is known to be variant, irrelevant of position. These could be anywhere across the genome.
          Perhaps it would make sense to phrase it differently: when I query a specific rs number to see what my samples had at that location, very often there isn't even a line for it in the .vcf file. If I check the coverage at that location, it's more than ample (and running with EMIT_ALL_CONFIDENT_SITES does give me a call for the variant with high quality scores), but it wasn't included in the .vcf file using EMIT_VARIANTS_ONLY because all my samples are homozygous reference for that locus. It's really important for me to be able to query a known variant and know the difference between a true no-call (where there isn't enough data) and something that wasn't called because everyone was homozygous reference.
          Do you mean if you want to print every site that has a >0 minor allele frequency among your samples?

          I don't know if GATK has native support for this or not. But BEAGLE author did exactly that with this script he wrote:



          but this script might have more stringent filtering criteria than what you have in mind, so you might need to tweak it a little bit. Good luck!

          Comment


          • #6
            Originally posted by ymc View Post
            Do you mean if you want to print every site that has a >0 minor allele frequency among your samples?
            Not exactly; I want every site that has a >0 minor allele frequency among my samples AND every site in my DBSNP rod file that has =0 minor allele frequency among my samples, but enough coverage to make a call.

            I don't know if GATK has native support for this or not. But BEAGLE author did exactly that with this script he wrote...
            What's the point of imputation when I can get the calls directly? GATK just suppresses them in the output by default because none of my samples are variant - this is the behavior I'm trying to get around.

            Comment


            • #7
              Originally posted by yasashiku View Post
              Not exactly; I want every site that has a >0 minor allele frequency among my samples AND every site in my DBSNP rod file that has =0 minor allele frequency among my samples, but enough coverage to make a call.


              What's the point of imputation when I can get the calls directly? GATK just suppresses them in the output by default because none of my samples are variant - this is the behavior I'm trying to get around.
              I am not telling you tell do an imputation. I am telling you that the BEAGLE author wrote a script to convert 1000genomes data into his BEAGLE format. In the process, I think he is doing approximately what you want.

              Of course, some imputation can also be helpful if some of your sites have very poor coverage.

              Comment


              • #8
                hi yasashiku,

                I am dealing with the same issue and came across your post hear while searching for help. so wondering if you ever found the way to address this problem. if so, could you please share your experience.

                Thanks
                Rama

                Comment


                • #9
                  Originally posted by rama View Post
                  hi yasashiku,

                  I am dealing with the same issue and came across your post hear while searching for help. so wondering if you ever found the way to address this problem. if so, could you please share your experience.

                  Thanks
                  Rama
                  Hi Rama,
                  I've actually moved on to another research area entirely, so I'm probably not going to be a ton of help - I know some of these parameter names have probably already changed. My best advice is to try two calling passes; one that picks up things that vary in your data, and another EMIT_ALL_CONFIDENT_SITES run with -L someBedFile.bed, where someBedFile.bed has a ton of 1-bp-long regions for every known variant. Maybe some tool out there will convert the .vcf file containing known sites into a .bed file? Finally, merge the two .vcf files, and be warned that some of the lines will be redundant.

                  A cleaner idea would be to run once with EMIT_ALL_CONFIDENT_SITES if you have enough space on your hard drive (the result will be massive), and filter it immediately after using your own script. In case it's useful, I've posted the scripts I wrote back when I was playing with NGS data here. None of them will help you directly, but if you feel like hacking my python code, filterVCF.py might get you close. If you're uncomfortable writing/hacking scripts, Data Wrangler or OpenRefine (formerly Google Refine) might be able to help.

                  Comment


                  • #10
                    Hi yasashiku,

                    Thanks a bunch for your kind reply. I am going to try your 1st suggestion as this seem to be the most common according to the GATK forums.

                    Rama

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Latest Developments in Precision Medicine
                      by seqadmin



                      Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                      Somatic Genomics
                      “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                      05-24-2024, 01:16 PM
                    • seqadmin
                      Recent Advances in Sequencing Analysis Tools
                      by seqadmin


                      The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                      05-06-2024, 07:48 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 05-24-2024, 07:15 AM
                    0 responses
                    15 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-23-2024, 10:28 AM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-23-2024, 07:35 AM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-22-2024, 02:06 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X