Hi, all
It seems lots of people met the incompatibility problem in GATK when their input files were containing the non-ACGT IUPAC codes.
I'm dealing with such a VCF files containing thousands of non-ACGT nucleotide codes now. I just want to replace these codes with "N" in the REF column, and tried to do with the awk command line as follows:
awk '{gsub(/W|K|Y|R|S|M/,"N",$4); print}' Input.vcf > Input_Ns.vcf
However, the output vcf was still unusable to GATK, and the error message indicated that the file structure had been changed:
" ERROR MESSAGE: Line **: there aren't enough columns for ..."
Indeed, additional to the replaces of non-ACGT codes with Ns in REF column, all "\tab"s in lines in which REF column had been changed have also been replced wiht a Space(" "), leading to the GATK interruption.
Take the following as an example :
Input.vcf
1[\tab]2[\tab]3[\tab]4[\tab]5
A[\tab]G[\tab]C[\tab]G[\tab]T
A[\tab]T[\tab]C[\tab]S[\tab]Y
T[\tab]K[\tab]L[\tab]T[\tab]A
awk '{gsub(/W|K|Y|R|S|M/,"N",$4); print}' Input.vcf > Input_Ns.vcf
Input_Ns.vcf
1[\tab]2[\tab]3[\tab]4[\tab]5
A[\tab]G[\tab]C[\tab]G[\tab]T
A" "T" "C" "N" "Y
T[\tab]K[\tab]L[\tab]T[\tab]A
Why all the \tab in the 3rd line have been replaced ?
How can I do to achieve my purpose?
Is there any tool to resolve the incompatibility between non-acgt IUPAC codes and GATK?
I'm a newbie in programming, and looking forward to your reply,
Thanks a bunch!
It seems lots of people met the incompatibility problem in GATK when their input files were containing the non-ACGT IUPAC codes.
I'm dealing with such a VCF files containing thousands of non-ACGT nucleotide codes now. I just want to replace these codes with "N" in the REF column, and tried to do with the awk command line as follows:
awk '{gsub(/W|K|Y|R|S|M/,"N",$4); print}' Input.vcf > Input_Ns.vcf
However, the output vcf was still unusable to GATK, and the error message indicated that the file structure had been changed:
" ERROR MESSAGE: Line **: there aren't enough columns for ..."
Indeed, additional to the replaces of non-ACGT codes with Ns in REF column, all "\tab"s in lines in which REF column had been changed have also been replced wiht a Space(" "), leading to the GATK interruption.
Take the following as an example :
Input.vcf
1[\tab]2[\tab]3[\tab]4[\tab]5
A[\tab]G[\tab]C[\tab]G[\tab]T
A[\tab]T[\tab]C[\tab]S[\tab]Y
T[\tab]K[\tab]L[\tab]T[\tab]A
awk '{gsub(/W|K|Y|R|S|M/,"N",$4); print}' Input.vcf > Input_Ns.vcf
Input_Ns.vcf
1[\tab]2[\tab]3[\tab]4[\tab]5
A[\tab]G[\tab]C[\tab]G[\tab]T
A" "T" "C" "N" "Y
T[\tab]K[\tab]L[\tab]T[\tab]A
Why all the \tab in the 3rd line have been replaced ?
How can I do to achieve my purpose?
Is there any tool to resolve the incompatibility between non-acgt IUPAC codes and GATK?
I'm a newbie in programming, and looking forward to your reply,
Thanks a bunch!
Comment