Varscan VCF output bugs(?)

coco90417 replied

02-27-2014, 10:24 AM
Varscan vcf output for indel

Hi,

I am also encountering issues with vcf output of indels. I have got indels that look like this:

1 984171 . CAG AG .
1 1588744 . AGCG GCG .

I checked genome browser for the context of both mutations(http://genome.ucsc.edu/cgi-bin/hgTra...A984170-984180 and http://genome.ucsc.edu/cgi-bin/hgTra...588740-1588750), it seems that the first one is supposed to be simple deletion of the first base and the should look like this:

1 984170 . GC G .

And the second one can be either represented by a block substitution that looks like this:

1 1588743 . AAG AG .

or a deletion (if you align the deletion to the left) that looks like this:

1 1588742 . GA G .

So I do not know whether I did something wrong or it was because Varscan has a different vcf output format for indels?

Please help me. Many many thanks.
Leave a comment:
IsmailM replied

02-11-2014, 10:35 AM
If you are using VarScan mpileup2snp or mpileup2indel, why does the QUAL column not have a number in it?
Leave a comment:

bw. replied

02-04-2014, 03:26 PM

I'm also seeing the slashes with VarScan v2.3.6
I wrote this script to convert the slashes to commas:

Code:

import sys

if len(sys.argv) < 2:  sys.exit("Usage: " + sys.argv[0] + "  vcf_filename")

in_fname = sys.argv[1]
out_fname = (in_fname[:-4] if in_fname.endswith(".vcf") else in_frame) + ".fixed.vcf"
print("Writing to: " + out_fname)
out = open(out_fname, "w")
for line in open(in_fname):
        if not line or line[0] is "#":
                out.write(line)
        else:
                fields = line.split("\t")
                fields[3] = fields[3].replace("/", ",").replace("\\", ",")   # remove any slashes from REF field
                fields[4] = fields[4].replace("/", ",").replace("\\", ",")   # remove any slashes from ALT field
                out.write("\t".join(fields))

To use, just copy-paste into a file (lets say script.py) and run:

python script.py file.vcf

Also, this version of the script just removes the vcf records with slashes:

Code:

import sys

if len(sys.argv) < 2:  sys.exit("Usage: " + sys.argv[0] + "  vcf_filename")

in_fname = sys.argv[1]
out_fname = (in_fname[:-4] if in_fname.endswith(".vcf") else in_frame) + ".fixed.vcf"
print("Writing to: " + out_fname)
out = open(out_fname, "w")
for line in open(in_fname):
        if not line or line[0] is "#":
                out.write(line)
        else:
                fields = line.split("\t")
                if "\\" not in (fields[3]+fields[4]) and "/" not in (fields[3]+fields[4]):
                        out.write("\t".join(fields))

Last edited by bw.; 02-05-2014, 02:16 PM. Reason: Turns out slashes also sometimes appear in the REF field, so added checks for that.

Leave a comment:

rnahar replied

12-23-2013, 08:33 PM
I am also facing the +/- issue in the varscan indel notations - however i do not use the vcf output but prefer the regular tabular output of Varscan. Is there a way that this indel notation can be changed so as to be compatible with annovar ? I use Varscan 2.3.6
Leave a comment:
IsmailM replied

07-25-2013, 02:17 PM
solved indel vcf format with awk command

Here is an awk command that can change your indel vcf format into the correct format.

cat Original_VCF | awk 'BEGIN {OFS="\t"} NR <= 24' > FINAL_VCF && cat Original_VCF | awk 'BEGIN {OFS="\t"} NR >= 25 { if (length($4)>length($5)) {$5 = substr($4, 0, 1)}; print }' >> FINAL_VCF

It uses two awk commands because the second command changes the header of the file if you run it on the whole file. So the first awk command transfers the header(assumed to be 24 lines) and then from the 25th line down is the vcf indels that are changed to the correct indel format using the second awk command.
Leave a comment:
eeyun replied

05-15-2013, 10:39 AM
Originally posted by eeyun View Post

As far as I can tell, it should be ref = TTCC and alt = T

Attachment included here to show the variant in question.
Attached Files

varscan problem.PNG (1.8 KB, 84 views)
Leave a comment:
eeyun replied

05-15-2013, 10:38 AM
Originally posted by eeyun View Post

We are having the same problem with 2.3.5

<pre>#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr1 6529182 . TTCC TCC . PASS ADP=314;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDPP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:255:322:314:178:138:43.67%:1.1101E-50:34:31:88:90:70:68</pre>

As far as I can tell, it should be ref = TTCC and alt = T
Leave a comment:
eeyun replied

05-15-2013, 10:32 AM
We are having the same problem with 2.3.5

<pre>#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr1 6529182 . TTCC TCC . PASS ADP=314;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDPP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:255:322:314:178:138:43.67%:1.1101E-50:34:31:88:90:70:68</pre>
Leave a comment:
sophiespo replied

05-15-2013, 02:55 AM
Originally posted by NestorNotabilis View Post

To add to Olivia's comment, I'm also using 2.3.4 and still getting the +/- issue when using mpileup2snp.

I am as well when using the somatic/processSomatic functions.

Annovar doesn't like this.. can anyone help?
Leave a comment:
NestorNotabilis replied

02-05-2013, 03:38 AM
Originally posted by tommivat View Post

Olivia, +/- issue is fixed in the latest version 2.3.4.

To add to Olivia's comment, I'm also using 2.3.4 and still getting the +/- issue when using mpileup2snp.
Leave a comment:
tommivat replied

02-01-2013, 07:17 AM
Originally posted by oliviajm View Post

Is it fixed in all VarScan tools?
Which one do you use?

I use VarScan.v2.3.4.jar mpileup2indel and I still get some "C +AAAG" or "G -AT" in my vcf output file.

That explains.. I use somatic for tumor-normal pairs.

Tommi
Leave a comment:
oliviajm replied

02-01-2013, 07:11 AM
Is it fixed in all VarScan tools?

Which one do you use?

I use VarScan.v2.3.4.jar mpileup2indel and I still get some "C +AAAG" or "G -AT" in my vcf output file.

Olivia
Leave a comment:
tommivat replied

02-01-2013, 05:19 AM
Olivia, +/- issue is fixed in the latest version 2.3.4.

However, the way variant alleles are coded is still unconventional. vcf format uses comma to separate alleles whereas varscan uses slash so I hope this can be fixed in future releases:

Code:

A/C -> A,C ACG/CG -> ACG,CG

br,
Tommi
Leave a comment:
oliviajm replied

02-01-2013, 05:00 AM
Hello all,

I realised that the missing qual field had been added in one of the last versions of VarScan. As I use it in a pipeline, I did not update it recently to avoid compatibility problems.

But after a few tests, it seems to me that the insertion and deletion are still coded with + and - in the ref and alt column, which don't match with the vcf specifications. I think an insertion of a T after a C should be written C in the ref field and CT in the alt field (and not by +T) for example.

Regards,

Olivia
Leave a comment:
tommivat replied

01-31-2013, 10:01 AM
Hello Dan and others,

First, thanks for the great piece of software! It would space us some work if somaticFilter supported .vcf files as well. I don't know if it is tedious to implement.

Another thing I wanted to ask, not related to vcf, concerns false-positive filtering (fpfilter.pl). I'm using bam-readcount to produce input for the script, but even if I do it chromosome by chromosome, the files are too big (>50G) and my computer (with 8Gb memory) just gets jammed when running the fpfilter.pl. Is there a way to do modify the script to support pipeing? And please tell me if it already does.

br,
Tommi
Leave a comment:

Previous 1 2 template Next

Choosing Between NGS and qPCR

by seqadmin

Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
- Channel: Articles
10-18-2024, 07:11 AM
Non-Coding RNA Research and Technologies

by seqadmin

Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

Nobel Prize for MicroRNA Discovery
This week,...
- Channel: Articles
10-07-2024, 08:07 AM

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News