Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Separating indels and snps in vcf

    Task at hand: Remove simple indel loci from vcf file.

    ------ example vcf file excerpt ----------
    CHR POS ID REF ALT
    1 20293 S_20293 A G
    1 22689 S_22689 A -
    1 23251 S_23251 - T
    --------------------------------------------

    $ bcftools filter -e'%TYPE="snp"' in.vcf > indels.vcf
    Results in a VCF file containing only header

    $ bcftools filter -e'%TYPE="indels"' in.vcf > snp.vcf
    Results in a VCF file containing all variants (snps and indels)

    It appears that bcftools is not treating the last two sites above as indels, but as snps. I saw the same behvior with --remove-indels filter in vcftools.

    Does anyone know what's going on?

  • #2
    What VCF version should that be (and what software called those)? It is not proper VCF4.1 or 4.2 - dashes are not allowed as bases, so no wonder that the tools you tried couldn't identify those as indels.

    VCF 4.2 spec, check section 1.4.1 for the REF and ALT fields, as well as section 5.2 for examples with properly formatted indels: http://samtools.github.io/hts-specs/VCFv4.2.pdf
    Last edited by sarvidsson; 03-12-2015, 06:58 AM. Reason: Added spec link

    Comment


    • #3
      sarvidsson: My VCF files were generated using the Tassel GBS pipeline and the header indicates they conform to 4.0 specification.

      Comment


      • #4
        Then their conformity is broken: http://www.1000genomes.org/node/101; the allowed contents of the REF and ALT fields are the same for 4.0 as 4.1.
        Last edited by sarvidsson; 03-12-2015, 07:05 AM.

        Comment


        • #5
          That's interesting. According to the 1000genomes 4.0 spec, the '.' in the ALT column indicates monomorphic site. That's clearly not the same as a '-' which is supposed to indicate an indel, right?

          Comment


          • #6
            A quick and dirty way to filter these files would be to
            Code:
            grep -v $'\t-\t' in.vcf > no_indels.vcf
            in case there are no other tab-separated fields with dashes. Check that with:
            Code:
            grep $'\t-\t' in.vcf > indels.vcf

            Comment


            • #7
              Originally posted by balsampoplar View Post
              That's interesting. According to the 1000genomes 4.0 spec, the '.' in the ALT column indicates monomorphic site. That's clearly not the same as a '-' which is supposed to indicate an indel, right?
              Exactly. I'd check with the TASSEL people that those records are really indels, perhaps they can fix the VCF output for a future version.

              Comment


              • #8
                Thanks. One liners are always useful. I have been doing this by first subsetting REF/ALT columns containing dashes in R and then using that list with --positions filter in vcftools to separate out snp and indels in individual vcf files.

                Still I would like to figure out what dash versus periods mean in the REF/ALT fields.

                Edit: Will post a question to the Tassel list.
                Last edited by balsampoplar; 03-12-2015, 07:14 AM. Reason: cross connection

                Comment


                • #9
                  Originally posted by balsampoplar View Post
                  Thanks. One liners are always useful. I have been doing this by first subsetting REF/ALT columns containing dashes in R and then using that list with --positions filter in vcftools to separate out snp and indels in individual vcf files.
                  You don't need to load the whole thing in R for that,

                  Code:
                  cut -f 3,4 in.vcf | grep '\-$' | cut -f 1 > ref_dash_ID-list
                  cut -f 3,5 in.vcf | grep '\-$' | cut -f 1 > alt_dash_ID-list
                  would be much faster

                  Comment


                  • #10
                    I have been only bringing in the first 5 columns in R, but I like your solution as well. Thanks.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Exploring the Dynamics of the Tumor Microenvironment
                      by seqadmin




                      The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                      07-08-2024, 03:19 PM
                    • seqadmin
                      Exploring Human Diversity Through Large-Scale Omics
                      by seqadmin


                      In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                      06-25-2024, 06:43 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Today, 07:20 AM
                    0 responses
                    15 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-16-2024, 05:49 AM
                    0 responses
                    35 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-15-2024, 06:53 AM
                    0 responses
                    38 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-10-2024, 07:30 AM
                    0 responses
                    41 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X