Header Leaderboard Ad


R, vcfr, extrcat.gt, bcftools filter



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • wbsimey
    Brian Knaus, on his vcfR Git site, helped me figure this out. Turns out if you want to include Hardy Weinberg in your population variant calling with bcftools, you must include some annotations with the -a option for bcftools mpileup. Any annotations not listed with -a will not be included in the 'gt' slot; they will be in the 'info' slot so "extract.gt" will ignore any annotations not in the 'gt' slot. Alternatively, you can try 'extract.info', but that did not pull DP stats for each individual for me.

    The bcftools commands that solved the above issue was:

    "bcftools mpileup -Ou -I -B -C 50 -a QS -a AD -a DP --threads $THREADS -f $REF $bam_files"
    call_cmd="bcftools call -mv -Oz --threads $THREADS -o $OUTFILE"

    Leave a comment:

  • wbsimey
    started a topic R, vcfr, extrcat.gt, bcftools filter

    R, vcfr, extrcat.gt, bcftools filter

    I have been unable to extract depth 'DP' info from my vcf files (generated with latest bcftools variant calling (SNPs)) after being converted to vcfr for R analyses.

    bcftools may break the vcf formatting. I cannot extract depth data in R packages, for e.g., all I get is 'NA' for each depth (DP). I can see the DP value in the vcf, but R packages cannot see it.

    VCF format after bcftools (DP values are clearly present):
    Chr10_nm_RagTag 187 . A T 4.82181 LowQual DP=38;VDB=0.407747;SGB=-0.0966945;RPB=0.0621765;MQB=0.984496;MQSB=0.618783;BQB=0.810528;MQ0F=0.157895;AC=2;AN=26;DP4=7,2,3,3;MQ=9 GT:PL:AD:QS ./.:0,0,0:0,0:0,0 0/0:0,3,4:1,0:4,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,

    but the often used function "extract.gt" misses the DP values and only returns items highlighted in green.

    > dp <- extract.gt(vcf, element = "DP", as.numeric=TRUE)
    > dp[1:4,1:6]
    MZ195912 MZ195913 MZ195914 MZ195915 MZ195916 MZ195930
    Chr10_nm_RagTag_187 NA NA NA NA NA NA
    Chr10_nm_RagTag_188 NA NA NA NA NA NA
    Chr10_nm_RagTag_194 NA NA NA NA NA NA
    Chr10_nm_RagTag_195 NA NA NA NA NA NA

    I think this is because bcftools filtering placed DP info in the wrong field and when data are converted to the required R format, the info is lost. If I do a head on the converted data I see no DP info:

    [1] "***** Object of class 'vcfR' *****"
    [1] "***** Meta section *****"
    [1] "##fileformat=VCFv4.2"
    [1] "##FILTER=<ID=PASS,Description="All filters passed">"
    [1] "##bcftoolsVersion=1.12+htslib-1.12"
    [1] "##bcftoolsCommand=mpileup -Ou -I -B -C 50 -a QS -a AD --threads 32 - [Truncated]"
    [1] "##reference=file:///home/bsimison/Projects/Neotoma/96_LC-genomes/Nbr [Truncated]"
    [1] "##contig=<ID=Chr10_nm_RagTag,length=45123732>"
    [1] "First 6 rows."
    [1] "***** Fixed section *****"
    [1,] "Chr10_nm_RagTag" "187" NA "A" "T" "4.82181" "LowQual"
    [2,] "Chr10_nm_RagTag" "188" NA "G" "A" "42.0559" "LowQual"
    [3,] "Chr10_nm_RagTag" "194" NA "G" "C" "3.74867" "LowQual"
    [4,] "Chr10_nm_RagTag" "195" NA "A" "C" "5.61943" "LowQual"
    [5,] "Chr10_nm_RagTag" "488" NA "G" "C" "32.0871" "PASS"
    [6,] "Chr10_nm_RagTag" "518" NA "T" "C" "157.534" "PASS"
    [1] "***** Genotype section *****"
    FORMAT MZ195912 MZ195913
    [1,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "0/0:0,3,4:1,0:4,0"
    [2,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [3,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [4,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [5,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [6,] "GT:PL:AD:QS" "1/1:29,3,0:0,1:0,29" "./.:0,0,0:0,0:0,0"
    MZ195914 MZ195915 MZ195916
    [1,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [2,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [3,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [4,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [5,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [6,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [1] "First 6 columns only."
    [1] "Unique GT formats:"
    [1] "GT:PL:AD:QS"

    you can see that only "GT:PL:AD:QS" are captured, but not DP.
    Anybody know what's going on here?
    Last edited by wbsimey; 08-19-2021, 09:11 AM.