Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Overlap number discrepancy between VCFTools and BEDTools

    --------------------------------VCFTools-----------------------------------------------------
    $./vcf-compare in.vcf.gz dbsnp_132.hg18.vcf.gz > in.snp-dbsnp.summary
    $less in.snp-dbsnp.summary

    Number of sites found only in
    115 in.vcf.gz (0.8%)
    15101 dbsnp_132.hg18.vcf.gz (0.1%) in.vcf.gz (99.2%)
    26694223 dbsnp_132.hg18.vcf.gz (99.9%)

    Number of REF matches: 15052
    Number of ALT matches: 14449
    Number of REF mismatches: 49
    Number of ALT mismatches: 603
    Number of sites lost due to grouping (e.g. duplicate sites)
    2024327 (7.0%) .. read 28733651, reported 26709324 dbsnp_132.hg18.vcf.gz

    --------------------------------BEDTools-----------------------------------------------------
    $intersectBed -u -f 1 -a in.vcf -b dbsnp_132.hg18.vcf > in.snp-dbsnp.u.bed
    $wc -l in.snp-dbsnp.u.bed
    15092 in.snp-dbsnp.u.bed

    ***********************************************************************************************
    As you can see, the number of overlap is different. It is 15101 from VCFTools but 15092 by BEDTools.

    I also used the vcf-isec to get the VCFTools version of overlap vcf. Then, I used the 'intersectBed' to overlap the this vcf (VCFTools version) with the in.snp-dbsnp.u.bed (BEDTools version) and get all 15092 overlaps.

    AFAIK, BEDTools did the overlap by only using the position information (i.e. consider an overlap even with different base(s)). But I am not sure what VCFTools does to come out with the additional 9 overlaps (or why BEDTools has 9 overlaps missing).

    It would be great if someone could explain what VCFTools:vcf-isec is doing and give me some advice on how to interpret the above mentioned discrepancy. Many thanks!
    Last edited by zxyeo; 07-25-2011, 07:48 PM.

  • #2
    Update: is BEDTools doing more than simply comparing the genomic position?

    I just looked at the additional 9 entries and I would like to update my interpretation:

    Let's start with 1 of the 9 entries (generally, other entries give the same observation):
    chr2 79990299 . GAGC G 408.26 PASS AC=2;AF=1.00;AN=2;DP=12;FS=0.000;HRun=0;HaplotypeScore=63.0941;MQ=41.32;MQ0=0;QD=34.02;SB=-158.86;SF=0,1,1

    In 'dbsnp_132.hg18.vcf', a similar entry was found:
    chr2 79990299 rs10578220 G GAGC . PASS G5;G5A;GNO;NSF;REF;RSPOS=80136791;SAO=0;SCS=0;SLO;SSR=0;VC=INDEL;VP=050100001201030100000200;WGT=1;dbSNPBuildID=119

    i.e. somehow my REF and ALT base(s) were switched.

    Interestingly, BEDTools managed to distinguish them (which means BEDTools might be using the base identity on top of position information). I guess VCFTools might be only considering the position information (that's why this is detected as overlap). Correct me if this is not true.

    Comment


    • #3
      vcf-compare

      Compares positions in two or more VCF files and outputs the numbers of positions contained in one but not the other files; two but not the other files, etc, which comes handy when generating Venn diagrams. The script also computes numbers such as nonreference discordance rates (including multiallelic sites), compares actual sequence (useful when comparing indels), etc.

      vcf just considering the position information.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Recent Advances in Sequencing Analysis Tools
        by seqadmin


        The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
        05-06-2024, 07:48 AM
      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 05-14-2024, 07:03 AM
      0 responses
      24 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-10-2024, 06:35 AM
      0 responses
      44 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-09-2024, 02:46 PM
      0 responses
      59 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-07-2024, 06:57 AM
      0 responses
      45 views
      0 likes
      Last Post seqadmin  
      Working...
      X