Header Leaderboard Ad

Collapse

R, vcfr, extrcat.gt, bcftools filter

Collapse

Announcement

Collapse

SEQanswers June Challenge Has Begun!

The competition has begun! We're giving away a $50 Amazon gift card to the member who answers the most questions on our site during the month. We want to encourage our community members to share their knowledge and help each other out by answering questions related to sequencing technologies, genomics, and bioinformatics. The competition is open to all members of the site, and the winner will be announced at the beginning of July. Best of luck!

For a list of the official rules, visit (https://www.seqanswers.com/forum/sit...wledge-and-win)
See more
See less
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • R, vcfr, extrcat.gt, bcftools filter

    Hello,
    I have been unable to extract depth 'DP' info from my vcf files (generated with latest bcftools variant calling (SNPs)) after being converted to vcfr for R analyses.

    bcftools may break the vcf formatting. I cannot extract depth data in R packages, for e.g., all I get is 'NA' for each depth (DP). I can see the DP value in the vcf, but R packages cannot see it.

    VCF format after bcftools (DP values are clearly present):
    Chr10_nm_RagTag 187 . A T 4.82181 LowQual DP=38;VDB=0.407747;SGB=-0.0966945;RPB=0.0621765;MQB=0.984496;MQSB=0.618783;BQB=0.810528;MQ0F=0.157895;AC=2;AN=26;DP4=7,2,3,3;MQ=9 GT:PL:AD:QS ./.:0,0,0:0,0:0,0 0/0:0,3,4:1,0:4,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,

    but the often used function "extract.gt" misses the DP values and only returns items highlighted in green.

    > dp <- extract.gt(vcf, element = "DP", as.numeric=TRUE)
    > dp[1:4,1:6]
    MZ195912 MZ195913 MZ195914 MZ195915 MZ195916 MZ195930
    Chr10_nm_RagTag_187 NA NA NA NA NA NA
    Chr10_nm_RagTag_188 NA NA NA NA NA NA
    Chr10_nm_RagTag_194 NA NA NA NA NA NA
    Chr10_nm_RagTag_195 NA NA NA NA NA NA

    I think this is because bcftools filtering placed DP info in the wrong field and when data are converted to the required R format, the info is lost. If I do a head on the converted data I see no DP info:

    head(vcf)
    [1] "***** Object of class 'vcfR' *****"
    [1] "***** Meta section *****"
    [1] "##fileformat=VCFv4.2"
    [1] "##FILTER=<ID=PASS,Description="All filters passed">"
    [1] "##bcftoolsVersion=1.12+htslib-1.12"
    [1] "##bcftoolsCommand=mpileup -Ou -I -B -C 50 -a QS -a AD --threads 32 - [Truncated]"
    [1] "##reference=file:///home/bsimison/Projects/Neotoma/96_LC-genomes/Nbr [Truncated]"
    [1] "##contig=<ID=Chr10_nm_RagTag,length=45123732>"
    [1] "First 6 rows."
    [1]
    [1] "***** Fixed section *****"
    CHROM POS ID REF ALT QUAL FILTER
    [1,] "Chr10_nm_RagTag" "187" NA "A" "T" "4.82181" "LowQual"
    [2,] "Chr10_nm_RagTag" "188" NA "G" "A" "42.0559" "LowQual"
    [3,] "Chr10_nm_RagTag" "194" NA "G" "C" "3.74867" "LowQual"
    [4,] "Chr10_nm_RagTag" "195" NA "A" "C" "5.61943" "LowQual"
    [5,] "Chr10_nm_RagTag" "488" NA "G" "C" "32.0871" "PASS"
    [6,] "Chr10_nm_RagTag" "518" NA "T" "C" "157.534" "PASS"
    [1]
    [1] "***** Genotype section *****"
    FORMAT MZ195912 MZ195913
    [1,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "0/0:0,3,4:1,0:4,0"
    [2,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [3,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [4,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [5,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [6,] "GT:PL:AD:QS" "1/1:29,3,0:0,1:0,29" "./.:0,0,0:0,0:0,0"
    MZ195914 MZ195915 MZ195916
    [1,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [2,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [3,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [4,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [5,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [6,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0"
    [1] "First 6 columns only."
    [1]
    [1] "Unique GT formats:"
    [1] "GT:PL:AD:QS"
    [1]

    you can see that only "GT:PL:AD:QS" are captured, but not DP.
    Anybody know what's going on here?
    Last edited by wbsimey; 08-19-2021, 09:11 AM.

  • #2
    Brian Knaus, on his vcfR Git site, helped me figure this out. Turns out if you want to include Hardy Weinberg in your population variant calling with bcftools, you must include some annotations with the -a option for bcftools mpileup. Any annotations not listed with -a will not be included in the 'gt' slot; they will be in the 'info' slot so "extract.gt" will ignore any annotations not in the 'gt' slot. Alternatively, you can try 'extract.info', but that did not pull DP stats for each individual for me.

    The bcftools commands that solved the above issue was:

    "bcftools mpileup -Ou -I -B -C 50 -a QS -a AD -a DP --threads $THREADS -f $REF $bam_files"
    call_cmd="bcftools call -mv -Oz --threads $THREADS -o $OUTFILE"

    Comment

    Latest Articles

    Collapse

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 07:14 AM
    0 responses
    7 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 06-06-2023, 01:08 PM
    0 responses
    10 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 06-01-2023, 08:56 PM
    0 responses
    164 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 06-01-2023, 07:33 AM
    0 responses
    299 views
    0 likes
    Last Post seqadmin  
    Working...
    X