Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ashkot
    Member
    • Nov 2011
    • 59

    The VCF file interpretation

    Hi all, i am working with VCF files from NGS data analysis generated using open source tools.

    In the VCF files we have genotypes listed as 0/1,1/1 etc, in my case i almost never see a row that is 0/0.

    Can someone please help with this, is there something obvious that might be missing which generates the 0/0 genotypes calls?

    Using the 1000 Genomes data.

    Thanks,
    Ashwin
  • jflowers
    Member
    • Oct 2011
    • 42

    #2
    What software/command line did you use to generate your VCF?

    Comment

    • ashkot
      Member
      • Nov 2011
      • 59

      #3
      Hi there,
      I am using SAMTools and VCFUtils, the following are the chain of commands.

      samtools sort eg2.bam eg2.sorted

      samtools mpileup -uf reference_genome_file eg2.sorted.bam | bcftools view -bvcg - > eg2.raw.bcf

      bcftools view *.bcf | perl /usr/share/samtools/vcfutils.pl varFilter -D100 > *.vcf

      Comment

      • jflowers
        Member
        • Oct 2011
        • 42

        #4
        Your command line is requesting a variants-only VCF. So, sites with 0/0 (identical to the reference) shouldn't appear right?

        What happens if you use bcftools view -bcg? In this case, you are asking for all sites and you should see many 0/0?

        Comment

        • ashkot
          Member
          • Nov 2011
          • 59

          #5
          Hi there,
          I did run the command as you specified, but again i am not seeing a 0/0. I was working with sample HG115 from 1K Genomes and the reference assembly i used (from 1K genomes website) was hs37d5.fa

          Any ideas?

          Thanks.

          Comment

          • jflowers
            Member
            • Oct 2011
            • 42

            #6
            Hi ashkot,

            I think I mentioned this post by lh3 before, but he recommends not looking at the genotype calls in the vcf at all:



            I am running your command line on some of my data to see if I get the same as you

            Comment

            • jflowers
              Member
              • Oct 2011
              • 42

              #7
              ashkot,

              Out of curisiosity, can I ask what you are hoping to extract from the genotype calls? Are you trying to generate a consensus?

              Comment

              • ashkot
                Member
                • Nov 2011
                • 59

                #8
                jflowers,
                i am trying to determine the zygosity for variants. i have a database of variants and i want to find out if a person is homozygous wildtype, hetrozygous and homozygous risk allele for those variants.

                since in the vcf file i generated the calls are 0/1 or 1/1 i dont know if the absense of a variant means homozygous wildtype.

                i would prefer if such variants were also in the VCF.

                i am waiting to hear from you about the results that you get when you run the command-line.

                thanks.

                ashwin

                Comment

                • jflowers
                  Member
                  • Oct 2011
                  • 42

                  #9
                  Originally posted by ashkot View Post
                  Hi there,
                  I am using SAMTools and VCFUtils, the following are the chain of commands.

                  samtools sort eg2.bam eg2.sorted

                  samtools mpileup -uf reference_genome_file eg2.sorted.bam | bcftools view -bvcg - > eg2.raw.bcf

                  bcftools view *.bcf | perl /usr/share/samtools/vcfutils.pl varFilter -D100 > *.vcf
                  I see the problem. Looking at the source code for vcfutils.pl varFilter there is the following line of code:

                  next if ($t[4] eq '.'); # skip non-var sites

                  $t[4] contains data for the fifth column in the vcf, so even if you request all sites to be output using bcftools view -cg calling vcfutils.pl varFilter is filtering invariant sites (ie site with the same basecall as the reference).

                  One workaround woud be to avoid using vcfutils.pl varFilter, so why not try:

                  samtools mpileup -uf reference_genome_file eg2.sorted.bam | bcftools view -cg - > *.vcf

                  If all you need to do is filter out read depth > 100 (-D 100) then this could easily be done without vcfutils. There are a number of vcf filtering tools available. Another solution would be to modify the vcfutils.pl varFilter to suit your needs, but I wouldn't do that unless you fully understand the implications of your modifications.

                  Comment

                  • jflowers
                    Member
                    • Oct 2011
                    • 42

                    #10
                    One final parting shot, given that Heng Li has indicated that there is a problem with the genotypes in the vcf (see example at http://biostar.stackexchange.com/que...quality-scores) I would hesitate before using the genotypes 0/0, 0/1, 1/1 by itself without considering the PL scores.

                    Comment

                    • ashkot
                      Member
                      • Nov 2011
                      • 59

                      #11
                      jflowers, i did exactly as you suggested and the program output a huge VCF file although there is not a single 0/0 row. I am not sure what could be going on here?

                      Any further ideas?

                      Thanks.

                      Comment

                      • jflowers
                        Member
                        • Oct 2011
                        • 42

                        #12
                        Good, at least now youre not filtering out the homozygous sites.

                        If you follow the example carefully from the biostar-exchange link I sent you, you will see that the genotypes (e.g., 0/0,0/1,1/1) are in fact not called correctly (I'm not sure why). Heng Li (samtools lead developer) to this post responded acknowledging this fact, so don't look at these genotypes.

                        Using the command line I suggested (without using vcfutils.pl varFilter) should have produced a vcf with many rows with reference allele column being A,T,G,or C and the alternate allele column is ".". Right? These sites are the homozygotes you are looking for.

                        Can you post say a few examples like this, then we can see what is going on.

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                          by SEQadmin2


                          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                          Here are nine questions we think about, in roughly the order they matter, before...
                          06-18-2026, 07:11 AM
                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          06-02-2026, 10:05 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, Yesterday, 11:10 AM
                        0 responses
                        8 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-17-2026, 06:09 AM
                        0 responses
                        43 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-09-2026, 11:58 AM
                        0 responses
                        104 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-05-2026, 10:09 AM
                        0 responses
                        125 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...