Accessing .vcf.gz files on a Windows platform

xied75 replied

08-13-2012, 02:32 AM
Originally posted by bpb9 View Post

Dong, it appears that program is only for Windows users.

Yes, but the thread title mentioned 'Windows' there.

Anyway, don't use R to process text files, that's not what it was designed for.
Leave a comment:
bpb9 replied

08-10-2012, 09:58 AM
Dong, it appears that program is only for Windows users.

Kennels, your way was SO much faster! Thanks!
Leave a comment:
bpb9 replied

08-10-2012, 09:51 AM
Thanks for the tips. I will definitely try this and let you know how it goes. I ended up using R yesterday to do this, but it took about an hour just to read in the file. After that I was able to split the files by converting the txt file to a data frame and splitting it into smaller files based on the value in the first column (the chromosome number):

Code:
>colnames(txtfilename)<-c("CHR","POSITION","ID","Allele1Ref","Allele2Var","Ancestral")
>dataframe<-data.frame(txtfilename)
>apart<-split(dataframe,dataframe["CHR"])
>lapply(split(dataframe, dataframe$CHR),
function(x)write.table(x, quote=FALSE, row.names=FALSE, col.names=TRUE, file = paste(x$CHR[1], ".txt", sep = "")))

This way I have one file named after each chromosome.

The way you guys suggested is probably much faster!

P.S. How do you guys enter the code in a separate box like that?
Leave a comment:
xied75 replied

08-10-2012, 03:54 AM
To all who struggle with large files, go http://mh-nexus.de/en/hxd/ and download HxD, you can use this to open and view any size file, at speed of light.

Best,

dong
Leave a comment:
Kennels replied

08-09-2012, 05:52 PM
Originally posted by bpb9 View Post

Hello, I am also new to the world of genomic datasets, tabix and vcf files. I understand that vcf files are basically giant text files, however due to their size I am unable to open them. I was able to upload my file on Galaxy and view it there, and extract just the columns I wanted, but the file is still huge--something like 20 million rows. If I could split up the file by chromosome, it would be manageable size for working on. However I can't figure out how to do this in Galaxy. Is there a specific program or command (in Galaxy or elsewhere) that can do this? It seems like such a simple task, but I can't find any obvious way to do it. If it helps, I am a Mac user. Thanks!

Hi bpb9,

You can flexibly view or output the contents of your huge file by a series of commands in the terminal. awk, sed, grep, perl one liners would be good choices.

Assuming your .vcf file is tab delimited, you can use awk in the terminal. Open a terminal window on your mac, and 'cd' (change directory) into the directory where your file is saved. (press enter to execute a command)
e.g.

Code:

cd /home/user/directorycontainingyourfile

If you do not know which directory you are when you open your terminal, type

Code:

pwd

It should show you 'where' you are. If you are unsure about directory structures in unix/linux, do some googling, it should become apparent pretty quick.

Then type the following:

Code:

awk ' { if ( $1 == "1") print $0 } ' filename.txt > output.txt

This means:
$1 is the column number, so if it is equals 1 (the chromosome ID), it will output the whole line that fits that condition. $0 means the entire line. You can substitute '1' for the name of you chromosome e.g. chr1 . You can select other columns by changing $1 into $2 etc. The '>' symbol means the output of the awk command is saved in a file called output.txt

If you just want to look at what it does first, you can make it show you the lines without outputting to a file.

Code:

awk ' { if ( $1 == "1") print $0 } ' filename.txt | head

same deal as above, except the '|' (pipe) means it takes the output from the awk command, and gives it to head, which shows the first 10 lines of the output. You can vary the number of lines by using the '-n' option. e.g. head -n 20 gives you 20 lines

There are many parameters to these commands, so you if you do some search you should be able to get pretty flexible searching.
Btw, these commands work for a Linux platform. You might need to adjust on a Mac, but just try it out first.

hope that helps.
Leave a comment:
bpb9 replied

08-09-2012, 06:55 AM
Split VCF by chromosome?

Hello, I am also new to the world of genomic datasets, tabix and vcf files. I understand that vcf files are basically giant text files, however due to their size I am unable to open them. I was able to upload my file on Galaxy and view it there, and extract just the columns I wanted, but the file is still huge--something like 20 million rows. If I could split up the file by chromosome, it would be manageable size for working on. However I can't figure out how to do this in Galaxy. Is there a specific program or command (in Galaxy or elsewhere) that can do this? It seems like such a simple task, but I can't find any obvious way to do it. If it helps, I am a Mac user. Thanks!
Leave a comment:
JohnK@Genome_Quest replied

06-06-2011, 07:51 PM
Originally posted by bnfoguy View Post

Hello,

I'm a first timer to bioinformatics research and have recently been working on a project involving genetic variations. I am supposed to use the latest 1k genomes release file which is "ALL.2of4intersection.2010084.genotypes.vcf.gz". I have downloaded the file which is about 61.2GB () but am facing trouble in extracting its contents. I would really appreciate some guidance in this matter.

I used to use Cygwin when I used Windows. Once you figure out how Cygwin's files are setup and tied in to Windows, which should be fairly easy to figure out, you can figure out how to navigate to whatever directory in your Windows system that is storing your data. Then you can use your gunzip, and all those other 'great' 'nix binaries:

Cygwin

http://cygwin.com/

Once you've done this, you can also use this basic perl command line (CML) template to parse whatever columns:

< file_name perl -e 'while(<>){ $line = $_; ($var1, $var2, $var3) = split("\t", $_); print "$var1\n"; }' > new_file

You can modify this to get the job done, and once you figure out the basic syntax from above I'm sure you'll be a perl CML guru.

Last edited by JohnK@Genome_Quest; 06-06-2011, 07:54 PM.
Leave a comment:
BAMseek replied

06-06-2011, 05:33 PM
Hi Bnfoguy,
The .tbi files are external indexes that help locate regions within the .gz files. It uses a binning scheme similar to the one used for quickly doing range queries on BAM files. The Tabix program can create and use the indexes to do range queries, using commands like I showed in my earlier post. Looks like you have to build the program from source and that would most easily be done on Linux or Mac. On Windows, you might have the most luck trying to work with the TabixReader.java file that comes in the Tabix download, but that would take some programming skills to create a working program, I would think.

The .tbi files are not human readable and would really only need to be looked at if you were interested in understanding how the binning scheme works. The Tabix program would take care of building and using those indexes for you. In case you are interested in how the index works, here is the .tbi schema tabix.pdf

Hope that at least partially answers your question. I know that is alot to bite off.

BAMseek
Leave a comment:
bnfoguy replied

06-06-2011, 01:27 PM
Hello BAMseek,

Thanks for your suggestions. I will try these and let you know if I have any success. Would you know of a method to read the .tbi file? Like opening it with notepad or any other windows program.

Thank you again,

Bnfoguy
Leave a comment:
BAMseek replied

06-05-2011, 10:17 PM
Hi bnfoguy,

I added VCF support to the BAMseek large file viewer. You can find the download here. You should be able to open both the uncompressed text file and the compressed gz file. Let me know how it works out for you.

While adding the VCF support, I found out some additional information that you may find useful. Those .gz files on 1000 genomes are actually BGZF-compressed files - you can decompress them as usual using gzip but you can also jump to locations within the file and begin decompressing from there which allows you to extract chunks from the file without decompressing the whole thing.

You may have noticed that those files also have an associated .tbi file in the 1000 genomes repository. These are Tabix index files which allow you to extract all features overlapping a genomic region, without requiring you to download the entire file locally first.

For example,

tabix ftp://ftp.1000genomes.ebi.ac.uk/vol1...4.sites.vcf.gz 1:2,000,000-2,100,000

would query the file on the ftp server and give you back all features on chromosome 1 between 2 million and 2.1 million. A nice thread on this subject can be found here.

Hope that helps!
Leave a comment:
BAMseek replied

06-01-2011, 01:32 PM
Hi bnfoguy,

Not quite a direct answer to your question, but I thought I would mention a tool I have been working on that addresses a similar issue of trying to view very large alignment files. It is called BAMseek and is available at http://code.google.com/p/bamseek/ . Currently, it works for BAM and SAM files but I could easily extend that to work on VCF files. This would allow you to at least view the file, get familiar with its contents, and even do some copying from the file. For more complex needs or repetitive tasks (such as extracting a large number of regions), then a knowledge of command line and scripting is always useful - such as perl or python. As mentioned above, vcftools might have what you need too.
Leave a comment:
swbarnes2 replied

06-01-2011, 11:57 AM
According to some googling, zgrep will work.
Leave a comment:
RDW replied

06-01-2011, 10:35 AM
Originally posted by bnfoguy View Post

Is there a way that I could use specific command line prompts or programs to access specific protions of it?

You could always use 'less':

less (Unix) - Wikipedia

http://en.wikipedia.org/wiki/Less_%28Unix%29

This is standard on Linux and other Unix-like systems, and there are versions for Windows that a quick search will find. But to do anything sensible with this file, you're going to need a program that knows how to parse it and extract what you need (this might, of course, be something simple you could write yourself, or maybe one of the utilities from the vcftools package).
Leave a comment:
bnfoguy replied

06-01-2011, 07:44 AM
Thanks for your reply,

I have manually extracted the file but as you said the traditional text editors like Notepad or Word are having trouble opening the file due to its size. Is there a way that I could use specific command line prompts or programs to access specific protions of it? Linux is new to me and am still learning the commands. Would python snippets be of any help?
Leave a comment:
ulz_peter replied

05-31-2011, 10:15 PM
As far as I know there was a new release of 1000genome SNP calls:

ftp://ftp-trace.ncbi.nih.gov/1000gen...hase1_release/

However, these files are zipped using the GNU ZIP program. I found a link for the windows version :

Gzip for Windows

http://gnuwin32.sourceforge.net/packages/gzip.htm

gzip {whatisit}

You should then be able to decompress the file and view it using a text editor of your choice (as .vcf files are nothing but plain text files).

Nevertheless:

1) Uncompressing large files takes a very, very long time. If you use a conventional PC this could be in the hours range.

2) I don't know if you could actually open the resulting .vcf file as it is extremely large (62GB is the compressed version!)

3) Are you sure you need the genotypes file? I guess some participants of 1000genomes project will know better, but as far as I know genotypes file contain all the individual genotypes. That's what makes it that big. THe .sites file contain a more condensed version, but should include the same sites, but no inidiviual genotypes....

Hope that helps
Leave a comment:

Previous 1 2 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News