Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GATK with non-model organism (Help with making SNP VCF file))

    Hi

    Has any one tried using GATK with a non-model organism? If so kindly, would you tell me the format of the VCF files you are using? I was reading few posts at the GATK forum and got details about makings a VCF file. Below is a snap shot of mine VCF file.

    ##fileformat=VCFv4.0
    #CHROM POS ID REF ALT QUAL FILTER INFO
    1 11613 . C A . PASS .
    1 12971 . T G . PASS .
    1 13003 . T A . PASS .
    1 13032 . A G . PASS .

    GATK does make the tribble index file but at the end throws an error saying that "The provided VCF file is malformed at approximately line number 56981: The VCF specification does not allow for whitespace in the INFO"

    The INFO column has ".", which i found was used by other for making a VCF file. Do i have to use VCF validate it? I do know VCF tools but think this file will not be able to pass it.
    I would appreciate any help as i am stuck and need to complete this....

    Thanks

  • #2
    What are you trying to accomplish by writing a VCF file manually? What is your goal with GATK?

    Comment


    • #3
      I need to call SNPs from a population to look at ASE. I have had used GATK before but with the change in format and inout files, its tough to recreate it...

      Comment


      • #4
        So first you can call SNPs without specifying a VCF with known mutations. Then you don't need to write a VCF file.

        Comment


        • #5
          But i was looking into the protocol and it calls for ideally not doing it unless you are an expert..moreover dos it not help in the validation and recalibration steps?
          Last edited by newbietonextgen; 07-09-2012, 04:32 PM.

          Comment


          • #6
            Well you can use a known list of mutations during the VQSR step (variant score recalibration). But if you do not have known list, it doesn't help to make one up.

            There are two suggestions I have (and others on SEQanswers might have more):

            1. If you only have 1 genome, you might try calling without a known SNP file, then filter the output VCF for just the highest quality calls and then recall the SNPs using the filtered list as your known list. Not perfect, I know.

            2. If you have a population of genomes you might consider calling again without a known list and then take a set of high quality calls found in many genomes from the population and make that your known set.

            Does anyone else have better ideas? You might also get on the GetSatisfaction page for GATK and see if they have a better suggestion.

            Hope this helps.

            Comment


            • #7
              thanks for the help. Yes, i do have two populations that i need to call the SNPs. I am sure your method works, but i will keep it as an last resort. But its a such as shame that such a useful tool is limited by just one file, VCF format file. I did post at the GetSatisfaction forum and i was told to validate my VCF file. I am not sure how this is going to work as most of columns are empty...but i will try to validate my VCF. ANy other suggestion ..please ...

              Comment


              • #8
                Ive been wrestling with this for a few days. You aren't actually limited to just VCF, you can use BED and others (http://gatkforums.broadinstitute.org...n/1349/tribble see part 4) but I couldn't get a BED that was used in v1 to work in v2, so made my own VCF.

                @adaptivegenome I tried doing as you suggest but the whole point of realigning and recalibrating is to use known SNPs to inform the process (well that's is the point for me). I have called SNPs with mpileup->VarScan, with fairly strict definitions of what a SNP is, and got good results. The 'gold standard' of GATK seems to be called for when publishing... Also its a great tool theoretically.

                Ok so my workflow as a hint for @newbie... make your VCF (I have "." for INFO and works fine); vcf-sort your.vcf > your.sorted.vcf; vcf-validator your.sorted.vcf; take your reference.fasta and remove any chromosomes/scaffolds not found in your VCF, otherwise it will throw ERROR (again!); sort new fasta exactly as your VCF is sorted (by chromosome/scaffold), no other order will do!; igvtools index your.sorted.fasta > your.sorted.fasta.fai; try and run this now and GATK should make your.sorted.dict for you and make the Tribble index. This seems to take a while (hence I am searching "tribble index" and am here=)

                Good luck,

                Bruce.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Best Practices for Single-Cell Sequencing Analysis
                  by seqadmin



                  While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                  06-06-2024, 07:15 AM
                • seqadmin
                  Latest Developments in Precision Medicine
                  by seqadmin



                  Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                  Somatic Genomics
                  “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                  05-24-2024, 01:16 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 06-07-2024, 06:58 AM
                0 responses
                179 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-06-2024, 08:18 AM
                0 responses
                228 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-06-2024, 08:04 AM
                0 responses
                184 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-03-2024, 06:55 AM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Working...
                X