Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calling specific loci with UnifiedGenotyper

    Hi, I've been having a little trouble (and confusion) trying to get the right set of calls out of the GATK's UnifiedGenotyper; I'm doing analysis of of known and novel SNPs, and I'd like to get calls for both:
    - observed variants across my 50 samples (including novel ones)
    - known variant sites (even though all samples may be homozygous reference) for which there is sufficient read coverage, etc.
    The no-call vs homozygous reference difference is a really a big deal for my analysis. When I run with -out_mode EMIT_VARIANTS_ONLY, I only get calls for sites where my samples vary, and when I run with -out_mode EMIT_ALL_CONFIDENT_SITES, I get a huge .vcf file with tons of calls everywhere I had decent coverage. Is there a way to tell UnifiedGenotyper to call specific sites, e.g. the ones listed in the --dbsnp rod file? Or do I need to run with EMIT_ALL_CONFIDENT_SITES, and filter by hand (which I'd like to avoid)?

  • #2
    I wound up writing a Python script to filter the results of EMIT_ALL_CONFIDENT_SITES with the dbSNP .vcf file - as it's bundled with a bunch of my still-unfinished-helper-scripts I'm not going to post it publicly, but PM me if you would like a copy.

    Comment


    • #3
      -L "chr4:21942-34289"

      The -L flag in combination with EMIT_ALL_SITES should get what you want.

      Comment


      • #4
        Thanks for the response, but that just filters to a specific region (which I don't care about) - what I want are calls for every locus that is variant in my data, as well as calls for every locus that is known to be variant, irrelevant of position. These could be anywhere across the genome.
        Perhaps it would make sense to phrase it differently: when I query a specific rs number to see what my samples had at that location, very often there isn't even a line for it in the .vcf file. If I check the coverage at that location, it's more than ample (and running with EMIT_ALL_CONFIDENT_SITES does give me a call for the variant with high quality scores), but it wasn't included in the .vcf file using EMIT_VARIANTS_ONLY because all my samples are homozygous reference for that locus. It's really important for me to be able to query a known variant and know the difference between a true no-call (where there isn't enough data) and something that wasn't called because everyone was homozygous reference.

        Comment


        • #5
          Originally posted by yasashiku View Post
          Thanks for the response, but that just filters to a specific region (which I don't care about) - what I want are calls for every locus that is variant in my data, as well as calls for every locus that is known to be variant, irrelevant of position. These could be anywhere across the genome.
          Perhaps it would make sense to phrase it differently: when I query a specific rs number to see what my samples had at that location, very often there isn't even a line for it in the .vcf file. If I check the coverage at that location, it's more than ample (and running with EMIT_ALL_CONFIDENT_SITES does give me a call for the variant with high quality scores), but it wasn't included in the .vcf file using EMIT_VARIANTS_ONLY because all my samples are homozygous reference for that locus. It's really important for me to be able to query a known variant and know the difference between a true no-call (where there isn't enough data) and something that wasn't called because everyone was homozygous reference.
          Do you mean if you want to print every site that has a >0 minor allele frequency among your samples?

          I don't know if GATK has native support for this or not. But BEAGLE author did exactly that with this script he wrote:



          but this script might have more stringent filtering criteria than what you have in mind, so you might need to tweak it a little bit. Good luck!

          Comment


          • #6
            Originally posted by ymc View Post
            Do you mean if you want to print every site that has a >0 minor allele frequency among your samples?
            Not exactly; I want every site that has a >0 minor allele frequency among my samples AND every site in my DBSNP rod file that has =0 minor allele frequency among my samples, but enough coverage to make a call.

            I don't know if GATK has native support for this or not. But BEAGLE author did exactly that with this script he wrote...
            What's the point of imputation when I can get the calls directly? GATK just suppresses them in the output by default because none of my samples are variant - this is the behavior I'm trying to get around.

            Comment


            • #7
              Originally posted by yasashiku View Post
              Not exactly; I want every site that has a >0 minor allele frequency among my samples AND every site in my DBSNP rod file that has =0 minor allele frequency among my samples, but enough coverage to make a call.


              What's the point of imputation when I can get the calls directly? GATK just suppresses them in the output by default because none of my samples are variant - this is the behavior I'm trying to get around.
              I am not telling you tell do an imputation. I am telling you that the BEAGLE author wrote a script to convert 1000genomes data into his BEAGLE format. In the process, I think he is doing approximately what you want.

              Of course, some imputation can also be helpful if some of your sites have very poor coverage.

              Comment


              • #8
                hi yasashiku,

                I am dealing with the same issue and came across your post hear while searching for help. so wondering if you ever found the way to address this problem. if so, could you please share your experience.

                Thanks
                Rama

                Comment


                • #9
                  Originally posted by rama View Post
                  hi yasashiku,

                  I am dealing with the same issue and came across your post hear while searching for help. so wondering if you ever found the way to address this problem. if so, could you please share your experience.

                  Thanks
                  Rama
                  Hi Rama,
                  I've actually moved on to another research area entirely, so I'm probably not going to be a ton of help - I know some of these parameter names have probably already changed. My best advice is to try two calling passes; one that picks up things that vary in your data, and another EMIT_ALL_CONFIDENT_SITES run with -L someBedFile.bed, where someBedFile.bed has a ton of 1-bp-long regions for every known variant. Maybe some tool out there will convert the .vcf file containing known sites into a .bed file? Finally, merge the two .vcf files, and be warned that some of the lines will be redundant.

                  A cleaner idea would be to run once with EMIT_ALL_CONFIDENT_SITES if you have enough space on your hard drive (the result will be massive), and filter it immediately after using your own script. In case it's useful, I've posted the scripts I wrote back when I was playing with NGS data here. None of them will help you directly, but if you feel like hacking my python code, filterVCF.py might get you close. If you're uncomfortable writing/hacking scripts, Data Wrangler or OpenRefine (formerly Google Refine) might be able to help.

                  Comment


                  • #10
                    Hi yasashiku,

                    Thanks a bunch for your kind reply. I am going to try your 1st suggestion as this seem to be the most common according to the GATK forums.

                    Rama

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Exploring the Dynamics of the Tumor Microenvironment
                      by seqadmin




                      The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                      07-08-2024, 03:19 PM
                    • seqadmin
                      Exploring Human Diversity Through Large-Scale Omics
                      by seqadmin


                      In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                      06-25-2024, 06:43 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 05:49 AM
                    0 responses
                    15 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-15-2024, 06:53 AM
                    0 responses
                    26 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-10-2024, 07:30 AM
                    0 responses
                    37 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-03-2024, 09:45 AM
                    0 responses
                    204 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X