Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Separate multi-allelic VCF lines to multiple rows

    The latest VCF formats (4.1+) allow for a single loci to cover multiple rows in the file when there are multiple alleles. The old standard of specifying multiple alleles in the same line is also valid. Unfortunately some analysis requires one standard and some the other. Are there any tools/scripts available which can take a VCF file with multiple alleles on one line and split them out to separate lines including the genotypes in the sample columns?
    Thanks!

  • #2
    Did you find a tool for this? I'm looking too.

    Comment


    • #3
      No I didn't, I ended having to write my own custom ruby script to do it.

      Comment


      • #4
        I wrote something in C++ (https://github.com/ekg/vcflib/blob/m...eakmulti.cpp):

        % vcfbreakmulti --help
        usage: vcfbreakmulti [options] [file]

        If multiple alleles are specified in a single record, break the record into
        multiple lines, preserving allele-specific INFO fields.

        Comment


        • #5
          @ekg

          I tried to compile vcflibs but I got some errors. Below the output of the make command, sorry it is in Italian but I can repeat with english language if needed.

          bw
          Andrea

          ---------------------------

          elmaffo@arc-HP8200i7 ~/Scaricati/vcflib $ make
          cd tabixpp && make
          make[1]: ingresso nella directory "/home/elmaffo/Scaricati/vcflib/tabixpp"
          make[2]: ingresso nella directory "/home/elmaffo/Scaricati/vcflib/tabixpp"
          gcc -c -g -Wall -O2 -fPIC -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE bgzf.c -o bgzf.o
          bgzf.c: In function ‘bgzf_close’:
          bgzf.c:630:8: warning: variable ‘count’ set but not used [-Wunused-but-set-variable]
          gcc -c -g -Wall -O2 -fPIC -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE kstring.c -o kstring.o
          gcc -c -g -Wall -O2 -fPIC -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE knetfile.c -o knetfile.o
          knetfile.c: In function ‘khttp_connect_file’:
          knetfile.c:418:2: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
          knetfile.c: In function ‘kftp_send_cmd’:
          knetfile.c:239:2: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
          gcc -c -g -Wall -O2 -fPIC -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE index.c -o index.o
          gcc -c -g -Wall -O2 -fPIC -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE bedidx.c -o bedidx.o
          ar -cru libtabix.a bgzf.o kstring.o knetfile.o index.o bedidx.o
          ranlib libtabix.a
          gcc -c -g -Wall -O2 -fPIC -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE main.c -o main.o
          gcc -g -Wall -O2 -fPIC -o tabix main.o -lm -lz -L. -ltabix
          ./libtabix.a(bgzf.o): nella funzione "deflate_block":
          /home/elmaffo/Scaricati/vcflib/tabixpp/bgzf.c:311: riferimento non definito a "deflate"
          /home/elmaffo/Scaricati/vcflib/tabixpp/bgzf.c:313: riferimento non definito a "deflateEnd"
          /home/elmaffo/Scaricati/vcflib/tabixpp/bgzf.c:305: riferimento non definito a "deflateInit2_"
          /home/elmaffo/Scaricati/vcflib/tabixpp/bgzf.c:329: riferimento non definito a "deflateEnd"
          /home/elmaffo/Scaricati/vcflib/tabixpp/bgzf.c:345: riferimento non definito a "crc32"
          /home/elmaffo/Scaricati/vcflib/tabixpp/bgzf.c:346: riferimento non definito a "crc32"
          ./libtabix.a(bgzf.o): nella funzione "inflate_block":
          /home/elmaffo/Scaricati/vcflib/tabixpp/bgzf.c:380: riferimento non definito a "inflateInit2_"
          /home/elmaffo/Scaricati/vcflib/tabixpp/bgzf.c:385: riferimento non definito a "inflate"
          /home/elmaffo/Scaricati/vcflib/tabixpp/bgzf.c:391: riferimento non definito a "inflateEnd"
          /home/elmaffo/Scaricati/vcflib/tabixpp/bgzf.c:387: riferimento non definito a "inflateEnd"
          ./libtabix.a(bedidx.o): nella funzione "ks_getuntil":
          /home/elmaffo/Scaricati/vcflib/tabixpp/bedidx.c:11: riferimento non definito a "gzread"
          ./libtabix.a(bedidx.o): nella funzione "bed_read":
          /home/elmaffo/Scaricati/vcflib/tabixpp/bedidx.c:103: riferimento non definito a "gzdopen"
          ./libtabix.a(bedidx.o): nella funzione "ks_getc":
          /home/elmaffo/Scaricati/vcflib/tabixpp/bedidx.c:11: riferimento non definito a "gzread"
          ./libtabix.a(bedidx.o): nella funzione "bed_read":
          /home/elmaffo/Scaricati/vcflib/tabixpp/bedidx.c:138: riferimento non definito a "gzclose"
          /home/elmaffo/Scaricati/vcflib/tabixpp/bedidx.c:103: riferimento non definito a "gzopen64"
          collect2: error: ld returned 1 exit status
          make[2]: *** [tabix] Errore 1
          make[2]: uscita dalla directory "/home/elmaffo/Scaricati/vcflib/tabixpp"
          make[1]: *** [all-recur] Errore 1
          make[1]: uscita dalla directory "/home/elmaffo/Scaricati/vcflib/tabixpp"
          make: *** [tabixpp/tabix.o] Errore 2

          Comment


          • #6
            @Andrea

            Non ti preoccupare, parlo italiano.

            Mi sembra che si manca zlib: http://stackoverflow.com/questions/1...late-with-zlib

            Zlib e' installato nella tua sistema?

            Comment


            • #7
              @ekg

              here is the list of Zlib-related packages on my Ubuntu 12.10 box:

              Installed: zlib1g, zlib1g-dev, zlib1g:i386
              not installed: zlib-bin, zlib-gst, zlib1g-dbg, zlibc,

              do I need anyone of the "not installed"?

              thanks
              Andrea

              Comment


              • #8
                @ekg

                Found out the issue was related to tabixcpp, as specified by guillermo-carrasco in this thread (check out the last messages in the thread):

                Hi, I'm trying to compile tabixpp on ubuntu , but am getting a lot of "undefined reference to xxxxx" errors (https://gist.github.com/2309326) Am I missing a step? Is this a known issue? rob-> make ...


                I edited the Makefile in the tabixpp folder as suggested in the thread. Everything compiled.

                Bye

                Comment


                • #9
                  Here you go

                  For anyone who is interested, I ended up writing a couple of scripts for splitting and merging multi-allelic lines.
                  They are available in the "utils" directory of the Atlas2 trunk.
                  http://sourceforge.net/projects/atlas2/

                  Comment


                  • #10
                    Here is another small tool to do the same thing, written in python:

                    Simple vcf parser, based on PyVCF. Contribute to moonso/vcf_parser development by creating an account on GitHub.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    32 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    37 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    31 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    53 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X