Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • balsampoplar
    replied
    I have been only bringing in the first 5 columns in R, but I like your solution as well. Thanks.

    Leave a comment:


  • sarvidsson
    replied
    Originally posted by balsampoplar View Post
    Thanks. One liners are always useful. I have been doing this by first subsetting REF/ALT columns containing dashes in R and then using that list with --positions filter in vcftools to separate out snp and indels in individual vcf files.
    You don't need to load the whole thing in R for that,

    Code:
    cut -f 3,4 in.vcf | grep '\-$' | cut -f 1 > ref_dash_ID-list
    cut -f 3,5 in.vcf | grep '\-$' | cut -f 1 > alt_dash_ID-list
    would be much faster

    Leave a comment:


  • balsampoplar
    replied
    Thanks. One liners are always useful. I have been doing this by first subsetting REF/ALT columns containing dashes in R and then using that list with --positions filter in vcftools to separate out snp and indels in individual vcf files.

    Still I would like to figure out what dash versus periods mean in the REF/ALT fields.

    Edit: Will post a question to the Tassel list.
    Last edited by balsampoplar; 03-12-2015, 07:14 AM. Reason: cross connection

    Leave a comment:


  • sarvidsson
    replied
    Originally posted by balsampoplar View Post
    That's interesting. According to the 1000genomes 4.0 spec, the '.' in the ALT column indicates monomorphic site. That's clearly not the same as a '-' which is supposed to indicate an indel, right?
    Exactly. I'd check with the TASSEL people that those records are really indels, perhaps they can fix the VCF output for a future version.

    Leave a comment:


  • sarvidsson
    replied
    A quick and dirty way to filter these files would be to
    Code:
    grep -v $'\t-\t' in.vcf > no_indels.vcf
    in case there are no other tab-separated fields with dashes. Check that with:
    Code:
    grep $'\t-\t' in.vcf > indels.vcf

    Leave a comment:


  • balsampoplar
    replied
    That's interesting. According to the 1000genomes 4.0 spec, the '.' in the ALT column indicates monomorphic site. That's clearly not the same as a '-' which is supposed to indicate an indel, right?

    Leave a comment:


  • sarvidsson
    replied
    Then their conformity is broken: http://www.1000genomes.org/node/101; the allowed contents of the REF and ALT fields are the same for 4.0 as 4.1.
    Last edited by sarvidsson; 03-12-2015, 07:05 AM.

    Leave a comment:


  • balsampoplar
    replied
    sarvidsson: My VCF files were generated using the Tassel GBS pipeline and the header indicates they conform to 4.0 specification.

    Leave a comment:


  • sarvidsson
    replied
    What VCF version should that be (and what software called those)? It is not proper VCF4.1 or 4.2 - dashes are not allowed as bases, so no wonder that the tools you tried couldn't identify those as indels.

    VCF 4.2 spec, check section 1.4.1 for the REF and ALT fields, as well as section 5.2 for examples with properly formatted indels: http://samtools.github.io/hts-specs/VCFv4.2.pdf
    Last edited by sarvidsson; 03-12-2015, 06:58 AM. Reason: Added spec link

    Leave a comment:


  • balsampoplar
    started a topic Separating indels and snps in vcf

    Separating indels and snps in vcf

    Task at hand: Remove simple indel loci from vcf file.

    ------ example vcf file excerpt ----------
    CHR POS ID REF ALT
    1 20293 S_20293 A G
    1 22689 S_22689 A -
    1 23251 S_23251 - T
    --------------------------------------------

    $ bcftools filter -e'%TYPE="snp"' in.vcf > indels.vcf
    Results in a VCF file containing only header

    $ bcftools filter -e'%TYPE="indels"' in.vcf > snp.vcf
    Results in a VCF file containing all variants (snps and indels)

    It appears that bcftools is not treating the last two sites above as indels, but as snps. I saw the same behvior with --remove-indels filter in vcftools.

    Does anyone know what's going on?

Latest Articles

Collapse

  • seqadmin
    Advanced Methods for the Detection of Infectious Disease
    by seqadmin




    The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
    ...
    11-27-2023, 01:15 PM
  • seqadmin
    Strategies for Investigating the Microbiome
    by seqadmin




    Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
    11-09-2023, 07:02 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 10:48 AM
0 responses
15 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 08:26 AM
0 responses
12 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 08:12 AM
0 responses
13 views
0 likes
Last Post seqadmin  
Started by seqadmin, 11-27-2023, 08:12 AM
0 responses
21 views
0 likes
Last Post seqadmin  
Working...
X