Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Accessing .vcf.gz files on a Windows platform

    Hello,

    I'm a first timer to bioinformatics research and have recently been working on a project involving genetic variations. I am supposed to use the latest 1k genomes release file which is "ALL.2of4intersection.2010084.genotypes.vcf.gz". I have downloaded the file which is about 61.2GB () but am facing trouble in extracting its contents. I would really appreciate some guidance in this matter.

  • #2
    As far as I know there was a new release of 1000genome SNP calls:

    ftp://ftp-trace.ncbi.nih.gov/1000gen...hase1_release/

    However, these files are zipped using the GNU ZIP program. I found a link for the windows version :


    You should then be able to decompress the file and view it using a text editor of your choice (as .vcf files are nothing but plain text files).

    Nevertheless:

    1) Uncompressing large files takes a very, very long time. If you use a conventional PC this could be in the hours range.

    2) I don't know if you could actually open the resulting .vcf file as it is extremely large (62GB is the compressed version!)

    3) Are you sure you need the genotypes file? I guess some participants of 1000genomes project will know better, but as far as I know genotypes file contain all the individual genotypes. That's what makes it that big. THe .sites file contain a more condensed version, but should include the same sites, but no inidiviual genotypes....

    Hope that helps

    Comment


    • #3
      Thanks for your reply,

      I have manually extracted the file but as you said the traditional text editors like Notepad or Word are having trouble opening the file due to its size. Is there a way that I could use specific command line prompts or programs to access specific protions of it? Linux is new to me and am still learning the commands. Would python snippets be of any help?

      Comment


      • #4
        Originally posted by bnfoguy View Post
        Is there a way that I could use specific command line prompts or programs to access specific protions of it?
        You could always use 'less':



        This is standard on Linux and other Unix-like systems, and there are versions for Windows that a quick search will find. But to do anything sensible with this file, you're going to need a program that knows how to parse it and extract what you need (this might, of course, be something simple you could write yourself, or maybe one of the utilities from the vcftools package).

        Comment


        • #5
          According to some googling, zgrep will work.

          Comment


          • #6
            Hi bnfoguy,

            Not quite a direct answer to your question, but I thought I would mention a tool I have been working on that addresses a similar issue of trying to view very large alignment files. It is called BAMseek and is available at http://code.google.com/p/bamseek/ . Currently, it works for BAM and SAM files but I could easily extend that to work on VCF files. This would allow you to at least view the file, get familiar with its contents, and even do some copying from the file. For more complex needs or repetitive tasks (such as extracting a large number of regions), then a knowledge of command line and scripting is always useful - such as perl or python. As mentioned above, vcftools might have what you need too.

            Comment


            • #7
              Hi bnfoguy,

              I added VCF support to the BAMseek large file viewer. You can find the download here. You should be able to open both the uncompressed text file and the compressed gz file. Let me know how it works out for you.

              While adding the VCF support, I found out some additional information that you may find useful. Those .gz files on 1000 genomes are actually BGZF-compressed files - you can decompress them as usual using gzip but you can also jump to locations within the file and begin decompressing from there which allows you to extract chunks from the file without decompressing the whole thing.

              You may have noticed that those files also have an associated .tbi file in the 1000 genomes repository. These are Tabix index files which allow you to extract all features overlapping a genomic region, without requiring you to download the entire file locally first.

              For example,
              would query the file on the ftp server and give you back all features on chromosome 1 between 2 million and 2.1 million. A nice thread on this subject can be found here.

              Hope that helps!

              Comment


              • #8
                Hello BAMseek,

                Thanks for your suggestions. I will try these and let you know if I have any success. Would you know of a method to read the .tbi file? Like opening it with notepad or any other windows program.

                Thank you again,

                Bnfoguy

                Comment


                • #9
                  Hi Bnfoguy,
                  The .tbi files are external indexes that help locate regions within the .gz files. It uses a binning scheme similar to the one used for quickly doing range queries on BAM files. The Tabix program can create and use the indexes to do range queries, using commands like I showed in my earlier post. Looks like you have to build the program from source and that would most easily be done on Linux or Mac. On Windows, you might have the most luck trying to work with the TabixReader.java file that comes in the Tabix download, but that would take some programming skills to create a working program, I would think.

                  The .tbi files are not human readable and would really only need to be looked at if you were interested in understanding how the binning scheme works. The Tabix program would take care of building and using those indexes for you. In case you are interested in how the index works, here is the .tbi schema tabix.pdf

                  Hope that at least partially answers your question. I know that is alot to bite off.

                  BAMseek

                  Comment


                  • #10
                    Originally posted by bnfoguy View Post
                    Hello,

                    I'm a first timer to bioinformatics research and have recently been working on a project involving genetic variations. I am supposed to use the latest 1k genomes release file which is "ALL.2of4intersection.2010084.genotypes.vcf.gz". I have downloaded the file which is about 61.2GB () but am facing trouble in extracting its contents. I would really appreciate some guidance in this matter.
                    I used to use Cygwin when I used Windows. Once you figure out how Cygwin's files are setup and tied in to Windows, which should be fairly easy to figure out, you can figure out how to navigate to whatever directory in your Windows system that is storing your data. Then you can use your gunzip, and all those other 'great' 'nix binaries:



                    Once you've done this, you can also use this basic perl command line (CML) template to parse whatever columns:

                    < file_name perl -e 'while(<>){ $line = $_; ($var1, $var2, $var3) = split("\t", $_); print "$var1\n"; }' > new_file

                    You can modify this to get the job done, and once you figure out the basic syntax from above I'm sure you'll be a perl CML guru.
                    Last edited by JohnK@Genome_Quest; 06-06-2011, 07:54 PM.

                    Comment


                    • #11
                      Split VCF by chromosome?

                      Hello, I am also new to the world of genomic datasets, tabix and vcf files. I understand that vcf files are basically giant text files, however due to their size I am unable to open them. I was able to upload my file on Galaxy and view it there, and extract just the columns I wanted, but the file is still huge--something like 20 million rows. If I could split up the file by chromosome, it would be manageable size for working on. However I can't figure out how to do this in Galaxy. Is there a specific program or command (in Galaxy or elsewhere) that can do this? It seems like such a simple task, but I can't find any obvious way to do it. If it helps, I am a Mac user. Thanks!

                      Comment


                      • #12
                        Originally posted by bpb9 View Post
                        Hello, I am also new to the world of genomic datasets, tabix and vcf files. I understand that vcf files are basically giant text files, however due to their size I am unable to open them. I was able to upload my file on Galaxy and view it there, and extract just the columns I wanted, but the file is still huge--something like 20 million rows. If I could split up the file by chromosome, it would be manageable size for working on. However I can't figure out how to do this in Galaxy. Is there a specific program or command (in Galaxy or elsewhere) that can do this? It seems like such a simple task, but I can't find any obvious way to do it. If it helps, I am a Mac user. Thanks!
                        Hi bpb9,

                        You can flexibly view or output the contents of your huge file by a series of commands in the terminal. awk, sed, grep, perl one liners would be good choices.

                        Assuming your .vcf file is tab delimited, you can use awk in the terminal. Open a terminal window on your mac, and 'cd' (change directory) into the directory where your file is saved. (press enter to execute a command)
                        e.g.
                        Code:
                        cd /home/user/directorycontainingyourfile
                        If you do not know which directory you are when you open your terminal, type

                        Code:
                        pwd
                        It should show you 'where' you are. If you are unsure about directory structures in unix/linux, do some googling, it should become apparent pretty quick.

                        Then type the following:

                        Code:
                        awk ' { if ( $1 == "1") print $0 } ' filename.txt > output.txt
                        This means:
                        $1 is the column number, so if it is equals 1 (the chromosome ID), it will output the whole line that fits that condition. $0 means the entire line. You can substitute '1' for the name of you chromosome e.g. chr1 . You can select other columns by changing $1 into $2 etc. The '>' symbol means the output of the awk command is saved in a file called output.txt

                        If you just want to look at what it does first, you can make it show you the lines without outputting to a file.

                        Code:
                        awk ' { if ( $1 == "1") print $0 } ' filename.txt | head
                        same deal as above, except the '|' (pipe) means it takes the output from the awk command, and gives it to head, which shows the first 10 lines of the output. You can vary the number of lines by using the '-n' option. e.g. head -n 20 gives you 20 lines

                        There are many parameters to these commands, so you if you do some search you should be able to get pretty flexible searching.
                        Btw, these commands work for a Linux platform. You might need to adjust on a Mac, but just try it out first.

                        hope that helps.

                        Comment


                        • #13
                          To all who struggle with large files, go http://mh-nexus.de/en/hxd/ and download HxD, you can use this to open and view any size file, at speed of light.

                          Best,

                          dong

                          Comment


                          • #14
                            Thanks for the tips. I will definitely try this and let you know how it goes. I ended up using R yesterday to do this, but it took about an hour just to read in the file. After that I was able to split the files by converting the txt file to a data frame and splitting it into smaller files based on the value in the first column (the chromosome number):

                            Code:
                            >colnames(txtfilename)<-c("CHR","POSITION","ID","Allele1Ref","Allele2Var","Ancestral")
                            >dataframe<-data.frame(txtfilename)
                            >apart<-split(dataframe,dataframe["CHR"])
                            >lapply(split(dataframe, dataframe$CHR),
                            function(x)write.table(x, quote=FALSE, row.names=FALSE, col.names=TRUE, file = paste(x$CHR[1], ".txt", sep = "")))

                            This way I have one file named after each chromosome.

                            The way you guys suggested is probably much faster!

                            P.S. How do you guys enter the code in a separate box like that?

                            Comment


                            • #15
                              Dong, it appears that program is only for Windows users.

                              Kennels, your way was SO much faster! Thanks!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin




                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin


                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 07-19-2024, 07:20 AM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-16-2024, 05:49 AM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-15-2024, 06:53 AM
                              0 responses
                              45 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X