Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • rama
    replied
    Hi yasashiku,

    Thanks a bunch for your kind reply. I am going to try your 1st suggestion as this seem to be the most common according to the GATK forums.

    Rama

    Leave a comment:


  • yasashiku
    replied
    Originally posted by rama View Post
    hi yasashiku,

    I am dealing with the same issue and came across your post hear while searching for help. so wondering if you ever found the way to address this problem. if so, could you please share your experience.

    Thanks
    Rama
    Hi Rama,
    I've actually moved on to another research area entirely, so I'm probably not going to be a ton of help - I know some of these parameter names have probably already changed. My best advice is to try two calling passes; one that picks up things that vary in your data, and another EMIT_ALL_CONFIDENT_SITES run with -L someBedFile.bed, where someBedFile.bed has a ton of 1-bp-long regions for every known variant. Maybe some tool out there will convert the .vcf file containing known sites into a .bed file? Finally, merge the two .vcf files, and be warned that some of the lines will be redundant.

    A cleaner idea would be to run once with EMIT_ALL_CONFIDENT_SITES if you have enough space on your hard drive (the result will be massive), and filter it immediately after using your own script. In case it's useful, I've posted the scripts I wrote back when I was playing with NGS data here. None of them will help you directly, but if you feel like hacking my python code, filterVCF.py might get you close. If you're uncomfortable writing/hacking scripts, Data Wrangler or OpenRefine (formerly Google Refine) might be able to help.

    Leave a comment:


  • rama
    replied
    hi yasashiku,

    I am dealing with the same issue and came across your post hear while searching for help. so wondering if you ever found the way to address this problem. if so, could you please share your experience.

    Thanks
    Rama

    Leave a comment:


  • ymc
    replied
    Originally posted by yasashiku View Post
    Not exactly; I want every site that has a >0 minor allele frequency among my samples AND every site in my DBSNP rod file that has =0 minor allele frequency among my samples, but enough coverage to make a call.


    What's the point of imputation when I can get the calls directly? GATK just suppresses them in the output by default because none of my samples are variant - this is the behavior I'm trying to get around.
    I am not telling you tell do an imputation. I am telling you that the BEAGLE author wrote a script to convert 1000genomes data into his BEAGLE format. In the process, I think he is doing approximately what you want.

    Of course, some imputation can also be helpful if some of your sites have very poor coverage.

    Leave a comment:


  • yasashiku
    replied
    Originally posted by ymc View Post
    Do you mean if you want to print every site that has a >0 minor allele frequency among your samples?
    Not exactly; I want every site that has a >0 minor allele frequency among my samples AND every site in my DBSNP rod file that has =0 minor allele frequency among my samples, but enough coverage to make a call.

    I don't know if GATK has native support for this or not. But BEAGLE author did exactly that with this script he wrote...
    What's the point of imputation when I can get the calls directly? GATK just suppresses them in the output by default because none of my samples are variant - this is the behavior I'm trying to get around.

    Leave a comment:


  • ymc
    replied
    Originally posted by yasashiku View Post
    Thanks for the response, but that just filters to a specific region (which I don't care about) - what I want are calls for every locus that is variant in my data, as well as calls for every locus that is known to be variant, irrelevant of position. These could be anywhere across the genome.
    Perhaps it would make sense to phrase it differently: when I query a specific rs number to see what my samples had at that location, very often there isn't even a line for it in the .vcf file. If I check the coverage at that location, it's more than ample (and running with EMIT_ALL_CONFIDENT_SITES does give me a call for the variant with high quality scores), but it wasn't included in the .vcf file using EMIT_VARIANTS_ONLY because all my samples are homozygous reference for that locus. It's really important for me to be able to query a known variant and know the difference between a true no-call (where there isn't enough data) and something that wasn't called because everyone was homozygous reference.
    Do you mean if you want to print every site that has a >0 minor allele frequency among your samples?

    I don't know if GATK has native support for this or not. But BEAGLE author did exactly that with this script he wrote:



    but this script might have more stringent filtering criteria than what you have in mind, so you might need to tweak it a little bit. Good luck!

    Leave a comment:


  • yasashiku
    replied
    Thanks for the response, but that just filters to a specific region (which I don't care about) - what I want are calls for every locus that is variant in my data, as well as calls for every locus that is known to be variant, irrelevant of position. These could be anywhere across the genome.
    Perhaps it would make sense to phrase it differently: when I query a specific rs number to see what my samples had at that location, very often there isn't even a line for it in the .vcf file. If I check the coverage at that location, it's more than ample (and running with EMIT_ALL_CONFIDENT_SITES does give me a call for the variant with high quality scores), but it wasn't included in the .vcf file using EMIT_VARIANTS_ONLY because all my samples are homozygous reference for that locus. It's really important for me to be able to query a known variant and know the difference between a true no-call (where there isn't enough data) and something that wasn't called because everyone was homozygous reference.

    Leave a comment:


  • ymc
    replied
    -L "chr4:21942-34289"

    The -L flag in combination with EMIT_ALL_SITES should get what you want.

    Leave a comment:


  • yasashiku
    replied
    I wound up writing a Python script to filter the results of EMIT_ALL_CONFIDENT_SITES with the dbSNP .vcf file - as it's bundled with a bunch of my still-unfinished-helper-scripts I'm not going to post it publicly, but PM me if you would like a copy.

    Leave a comment:


  • yasashiku
    started a topic Calling specific loci with UnifiedGenotyper

    Calling specific loci with UnifiedGenotyper

    Hi, I've been having a little trouble (and confusion) trying to get the right set of calls out of the GATK's UnifiedGenotyper; I'm doing analysis of of known and novel SNPs, and I'd like to get calls for both:
    - observed variants across my 50 samples (including novel ones)
    - known variant sites (even though all samples may be homozygous reference) for which there is sufficient read coverage, etc.
    The no-call vs homozygous reference difference is a really a big deal for my analysis. When I run with -out_mode EMIT_VARIANTS_ONLY, I only get calls for sites where my samples vary, and when I run with -out_mode EMIT_ALL_CONFIDENT_SITES, I get a huge .vcf file with tons of calls everywhere I had decent coverage. Is there a way to tell UnifiedGenotyper to call specific sites, e.g. the ones listed in the --dbsnp rod file? Or do I need to run with EMIT_ALL_CONFIDENT_SITES, and filter by hand (which I'd like to avoid)?

Latest Articles

Collapse

  • seqadmin
    Understanding Genetic Influence on Infectious Disease
    by seqadmin




    During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

    Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
    09-09-2024, 10:59 AM
  • seqadmin
    Addressing Off-Target Effects in CRISPR Technologies
    by seqadmin






    The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
    08-27-2024, 04:44 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 09-11-2024, 02:44 PM
0 responses
13 views
0 likes
Last Post seqadmin  
Started by seqadmin, 09-06-2024, 08:02 AM
0 responses
146 views
0 likes
Last Post seqadmin  
Started by seqadmin, 09-03-2024, 08:30 AM
0 responses
153 views
0 likes
Last Post seqadmin  
Started by seqadmin, 08-27-2024, 04:40 AM
0 responses
163 views
0 likes
Last Post seqadmin  
Working...
X