Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • xied75
    replied
    Originally posted by bpb9 View Post
    Dong, it appears that program is only for Windows users.
    Yes, but the thread title mentioned 'Windows' there.

    Anyway, don't use R to process text files, that's not what it was designed for.

    Leave a comment:


  • bpb9
    replied
    Dong, it appears that program is only for Windows users.

    Kennels, your way was SO much faster! Thanks!

    Leave a comment:


  • bpb9
    replied
    Thanks for the tips. I will definitely try this and let you know how it goes. I ended up using R yesterday to do this, but it took about an hour just to read in the file. After that I was able to split the files by converting the txt file to a data frame and splitting it into smaller files based on the value in the first column (the chromosome number):

    Code:
    >colnames(txtfilename)<-c("CHR","POSITION","ID","Allele1Ref","Allele2Var","Ancestral")
    >dataframe<-data.frame(txtfilename)
    >apart<-split(dataframe,dataframe["CHR"])
    >lapply(split(dataframe, dataframe$CHR),
    function(x)write.table(x, quote=FALSE, row.names=FALSE, col.names=TRUE, file = paste(x$CHR[1], ".txt", sep = "")))

    This way I have one file named after each chromosome.

    The way you guys suggested is probably much faster!

    P.S. How do you guys enter the code in a separate box like that?

    Leave a comment:


  • xied75
    replied
    To all who struggle with large files, go http://mh-nexus.de/en/hxd/ and download HxD, you can use this to open and view any size file, at speed of light.

    Best,

    dong

    Leave a comment:


  • Kennels
    replied
    Originally posted by bpb9 View Post
    Hello, I am also new to the world of genomic datasets, tabix and vcf files. I understand that vcf files are basically giant text files, however due to their size I am unable to open them. I was able to upload my file on Galaxy and view it there, and extract just the columns I wanted, but the file is still huge--something like 20 million rows. If I could split up the file by chromosome, it would be manageable size for working on. However I can't figure out how to do this in Galaxy. Is there a specific program or command (in Galaxy or elsewhere) that can do this? It seems like such a simple task, but I can't find any obvious way to do it. If it helps, I am a Mac user. Thanks!
    Hi bpb9,

    You can flexibly view or output the contents of your huge file by a series of commands in the terminal. awk, sed, grep, perl one liners would be good choices.

    Assuming your .vcf file is tab delimited, you can use awk in the terminal. Open a terminal window on your mac, and 'cd' (change directory) into the directory where your file is saved. (press enter to execute a command)
    e.g.
    Code:
    cd /home/user/directorycontainingyourfile
    If you do not know which directory you are when you open your terminal, type

    Code:
    pwd
    It should show you 'where' you are. If you are unsure about directory structures in unix/linux, do some googling, it should become apparent pretty quick.

    Then type the following:

    Code:
    awk ' { if ( $1 == "1") print $0 } ' filename.txt > output.txt
    This means:
    $1 is the column number, so if it is equals 1 (the chromosome ID), it will output the whole line that fits that condition. $0 means the entire line. You can substitute '1' for the name of you chromosome e.g. chr1 . You can select other columns by changing $1 into $2 etc. The '>' symbol means the output of the awk command is saved in a file called output.txt

    If you just want to look at what it does first, you can make it show you the lines without outputting to a file.

    Code:
    awk ' { if ( $1 == "1") print $0 } ' filename.txt | head
    same deal as above, except the '|' (pipe) means it takes the output from the awk command, and gives it to head, which shows the first 10 lines of the output. You can vary the number of lines by using the '-n' option. e.g. head -n 20 gives you 20 lines

    There are many parameters to these commands, so you if you do some search you should be able to get pretty flexible searching.
    Btw, these commands work for a Linux platform. You might need to adjust on a Mac, but just try it out first.

    hope that helps.

    Leave a comment:


  • bpb9
    replied
    Split VCF by chromosome?

    Hello, I am also new to the world of genomic datasets, tabix and vcf files. I understand that vcf files are basically giant text files, however due to their size I am unable to open them. I was able to upload my file on Galaxy and view it there, and extract just the columns I wanted, but the file is still huge--something like 20 million rows. If I could split up the file by chromosome, it would be manageable size for working on. However I can't figure out how to do this in Galaxy. Is there a specific program or command (in Galaxy or elsewhere) that can do this? It seems like such a simple task, but I can't find any obvious way to do it. If it helps, I am a Mac user. Thanks!

    Leave a comment:


  • JohnK@Genome_Quest
    replied
    Originally posted by bnfoguy View Post
    Hello,

    I'm a first timer to bioinformatics research and have recently been working on a project involving genetic variations. I am supposed to use the latest 1k genomes release file which is "ALL.2of4intersection.2010084.genotypes.vcf.gz". I have downloaded the file which is about 61.2GB () but am facing trouble in extracting its contents. I would really appreciate some guidance in this matter.
    I used to use Cygwin when I used Windows. Once you figure out how Cygwin's files are setup and tied in to Windows, which should be fairly easy to figure out, you can figure out how to navigate to whatever directory in your Windows system that is storing your data. Then you can use your gunzip, and all those other 'great' 'nix binaries:



    Once you've done this, you can also use this basic perl command line (CML) template to parse whatever columns:

    < file_name perl -e 'while(<>){ $line = $_; ($var1, $var2, $var3) = split("\t", $_); print "$var1\n"; }' > new_file

    You can modify this to get the job done, and once you figure out the basic syntax from above I'm sure you'll be a perl CML guru.
    Last edited by JohnK@Genome_Quest; 06-06-2011, 07:54 PM.

    Leave a comment:


  • BAMseek
    replied
    Hi Bnfoguy,
    The .tbi files are external indexes that help locate regions within the .gz files. It uses a binning scheme similar to the one used for quickly doing range queries on BAM files. The Tabix program can create and use the indexes to do range queries, using commands like I showed in my earlier post. Looks like you have to build the program from source and that would most easily be done on Linux or Mac. On Windows, you might have the most luck trying to work with the TabixReader.java file that comes in the Tabix download, but that would take some programming skills to create a working program, I would think.

    The .tbi files are not human readable and would really only need to be looked at if you were interested in understanding how the binning scheme works. The Tabix program would take care of building and using those indexes for you. In case you are interested in how the index works, here is the .tbi schema tabix.pdf

    Hope that at least partially answers your question. I know that is alot to bite off.

    BAMseek

    Leave a comment:


  • bnfoguy
    replied
    Hello BAMseek,

    Thanks for your suggestions. I will try these and let you know if I have any success. Would you know of a method to read the .tbi file? Like opening it with notepad or any other windows program.

    Thank you again,

    Bnfoguy

    Leave a comment:


  • BAMseek
    replied
    Hi bnfoguy,

    I added VCF support to the BAMseek large file viewer. You can find the download here. You should be able to open both the uncompressed text file and the compressed gz file. Let me know how it works out for you.

    While adding the VCF support, I found out some additional information that you may find useful. Those .gz files on 1000 genomes are actually BGZF-compressed files - you can decompress them as usual using gzip but you can also jump to locations within the file and begin decompressing from there which allows you to extract chunks from the file without decompressing the whole thing.

    You may have noticed that those files also have an associated .tbi file in the 1000 genomes repository. These are Tabix index files which allow you to extract all features overlapping a genomic region, without requiring you to download the entire file locally first.

    For example,
    would query the file on the ftp server and give you back all features on chromosome 1 between 2 million and 2.1 million. A nice thread on this subject can be found here.

    Hope that helps!

    Leave a comment:


  • BAMseek
    replied
    Hi bnfoguy,

    Not quite a direct answer to your question, but I thought I would mention a tool I have been working on that addresses a similar issue of trying to view very large alignment files. It is called BAMseek and is available at http://code.google.com/p/bamseek/ . Currently, it works for BAM and SAM files but I could easily extend that to work on VCF files. This would allow you to at least view the file, get familiar with its contents, and even do some copying from the file. For more complex needs or repetitive tasks (such as extracting a large number of regions), then a knowledge of command line and scripting is always useful - such as perl or python. As mentioned above, vcftools might have what you need too.

    Leave a comment:


  • swbarnes2
    replied
    According to some googling, zgrep will work.

    Leave a comment:


  • RDW
    replied
    Originally posted by bnfoguy View Post
    Is there a way that I could use specific command line prompts or programs to access specific protions of it?
    You could always use 'less':



    This is standard on Linux and other Unix-like systems, and there are versions for Windows that a quick search will find. But to do anything sensible with this file, you're going to need a program that knows how to parse it and extract what you need (this might, of course, be something simple you could write yourself, or maybe one of the utilities from the vcftools package).

    Leave a comment:


  • bnfoguy
    replied
    Thanks for your reply,

    I have manually extracted the file but as you said the traditional text editors like Notepad or Word are having trouble opening the file due to its size. Is there a way that I could use specific command line prompts or programs to access specific protions of it? Linux is new to me and am still learning the commands. Would python snippets be of any help?

    Leave a comment:


  • ulz_peter
    replied
    As far as I know there was a new release of 1000genome SNP calls:

    ftp://ftp-trace.ncbi.nih.gov/1000gen...hase1_release/

    However, these files are zipped using the GNU ZIP program. I found a link for the windows version :


    You should then be able to decompress the file and view it using a text editor of your choice (as .vcf files are nothing but plain text files).

    Nevertheless:

    1) Uncompressing large files takes a very, very long time. If you use a conventional PC this could be in the hours range.

    2) I don't know if you could actually open the resulting .vcf file as it is extremely large (62GB is the compressed version!)

    3) Are you sure you need the genotypes file? I guess some participants of 1000genomes project will know better, but as far as I know genotypes file contain all the individual genotypes. That's what makes it that big. THe .sites file contain a more condensed version, but should include the same sites, but no inidiviual genotypes....

    Hope that helps

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 11:49 AM
0 responses
15 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
61 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X