Varscan vcf output for indel
Hi,
I am also encountering issues with vcf output of indels. I have got indels that look like this:
1 984171 . CAG AG .
1 1588744 . AGCG GCG .
I checked genome browser for the context of both mutations(http://genome.ucsc.edu/cgi-bin/hgTra...A984170-984180 and http://genome.ucsc.edu/cgi-bin/hgTra...588740-1588750), it seems that the first one is supposed to be simple deletion of the first base and the should look like this:
1 984170 . GC G .
And the second one can be either represented by a block substitution that looks like this:
1 1588743 . AAG AG .
or a deletion (if you align the deletion to the left) that looks like this:
1 1588742 . GA G .
So I do not know whether I did something wrong or it was because Varscan has a different vcf output format for indels?
Please help me. Many many thanks.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
If you are using VarScan mpileup2snp or mpileup2indel, why does the QUAL column not have a number in it?
Leave a comment:
-
I'm also seeing the slashes with VarScan v2.3.6
I wrote this script to convert the slashes to commas:
Code:import sys if len(sys.argv) < 2: sys.exit("Usage: " + sys.argv[0] + " vcf_filename") in_fname = sys.argv[1] out_fname = (in_fname[:-4] if in_fname.endswith(".vcf") else in_frame) + ".fixed.vcf" print("Writing to: " + out_fname) out = open(out_fname, "w") for line in open(in_fname): if not line or line[0] is "#": out.write(line) else: fields = line.split("\t") fields[3] = fields[3].replace("/", ",").replace("\\", ",") # remove any slashes from REF field fields[4] = fields[4].replace("/", ",").replace("\\", ",") # remove any slashes from ALT field out.write("\t".join(fields))
python script.py file.vcf
Also, this version of the script just removes the vcf records with slashes:
Code:import sys if len(sys.argv) < 2: sys.exit("Usage: " + sys.argv[0] + " vcf_filename") in_fname = sys.argv[1] out_fname = (in_fname[:-4] if in_fname.endswith(".vcf") else in_frame) + ".fixed.vcf" print("Writing to: " + out_fname) out = open(out_fname, "w") for line in open(in_fname): if not line or line[0] is "#": out.write(line) else: fields = line.split("\t") if "\\" not in (fields[3]+fields[4]) and "/" not in (fields[3]+fields[4]): out.write("\t".join(fields))
Last edited by bw.; 02-05-2014, 02:16 PM. Reason: Turns out slashes also sometimes appear in the REF field, so added checks for that.
Leave a comment:
-
I am also facing the +/- issue in the varscan indel notations - however i do not use the vcf output but prefer the regular tabular output of Varscan. Is there a way that this indel notation can be changed so as to be compatible with annovar ? I use Varscan 2.3.6
Leave a comment:
-
solved indel vcf format with awk command
Here is an awk command that can change your indel vcf format into the correct format.
cat Original_VCF | awk 'BEGIN {OFS="\t"} NR <= 24' > FINAL_VCF && cat Original_VCF | awk 'BEGIN {OFS="\t"} NR >= 25 { if (length($4)>length($5)) {$5 = substr($4, 0, 1)}; print }' >> FINAL_VCF
It uses two awk commands because the second command changes the header of the file if you run it on the whole file. So the first awk command transfers the header(assumed to be 24 lines) and then from the 25th line down is the vcf indels that are changed to the correct indel format using the second awk command.
Leave a comment:
-
Originally posted by eeyun View PostAs far as I can tell, it should be ref = TTCC and alt = TAttached Files
Leave a comment:
-
Originally posted by eeyun View PostWe are having the same problem with 2.3.5
<pre>#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr1 6529182 . TTCC TCC . PASS ADP=314;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDPP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:255:322:314:178:138:43.67%:1.1101E-50:34:31:88:90:70:68</pre>
Leave a comment:
-
We are having the same problem with 2.3.5
<pre>#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr1 6529182 . TTCC TCC . PASS ADP=314;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDPP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:255:322:314:178:138:43.67%:1.1101E-50:34:31:88:90:70:68</pre>
Leave a comment:
-
Originally posted by tommivat View PostOlivia, +/- issue is fixed in the latest version 2.3.4.
Leave a comment:
-
Is it fixed in all VarScan tools?
Which one do you use?
I use VarScan.v2.3.4.jar mpileup2indel and I still get some "C +AAAG" or "G -AT" in my vcf output file.
Olivia
Leave a comment:
-
Olivia, +/- issue is fixed in the latest version 2.3.4.
However, the way variant alleles are coded is still unconventional. vcf format uses comma to separate alleles whereas varscan uses slash so I hope this can be fixed in future releases:
Code:A/C -> A,C ACG/CG -> ACG,CG
Tommi
Leave a comment:
-
Hello all,
I realised that the missing qual field had been added in one of the last versions of VarScan. As I use it in a pipeline, I did not update it recently to avoid compatibility problems.
But after a few tests, it seems to me that the insertion and deletion are still coded with + and - in the ref and alt column, which don't match with the vcf specifications. I think an insertion of a T after a C should be written C in the ref field and CT in the alt field (and not by +T) for example.
Regards,
Olivia
Leave a comment:
-
Hello Dan and others,
First, thanks for the great piece of software! It would space us some work if somaticFilter supported .vcf files as well. I don't know if it is tedious to implement.
Another thing I wanted to ask, not related to vcf, concerns false-positive filtering (fpfilter.pl). I'm using bam-readcount to produce input for the script, but even if I do it chromosome by chromosome, the files are too big (>50G) and my computer (with 8Gb memory) just gets jammed when running the fpfilter.pl. Is there a way to do modify the script to support pipeing? And please tell me if it already does.
br,
Tommi
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, Yesterday, 05:31 AM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
Yesterday, 05:31 AM
|
||
Started by seqadmin, 10-24-2024, 06:58 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
10-24-2024, 06:58 AM
|
||
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types
by seqadmin
Started by seqadmin, 10-23-2024, 08:43 AM
|
0 responses
48 views
0 likes
|
Last Post
by seqadmin
10-23-2024, 08:43 AM
|
||
Started by seqadmin, 10-17-2024, 07:29 AM
|
0 responses
58 views
0 likes
|
Last Post
by seqadmin
10-17-2024, 07:29 AM
|
Leave a comment: