Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • shrutijha
    replied
    Error in Last step

    Originally posted by Petr View Post
    Hi,

    I've just extended the documentation of fill-aa to answer this frequently asked question:

    About: This script fills ancestral alleles into INFO column of VCF files. It depends on samtools, therefore the fasta sequence must be gzipped (not bgzipped!) and indexed by samtools faidx. The AA files can be downloaded from
    ftp://ftp.1000genomes.ebi.ac.uk/vol1...ral_alignments
    and processed as shown in the example below. This is because the sequences in the original files are named as 'ANCESTOR_for_chromosome:NCBI36:1:1:247249719', but the underlying FaSplice.pm requires names as 'chr1' or '1'.

    Example:
    # Get the files ready: compress by gzip and index by samtools faidx. Either repeat the
    # following command for each file manually
    bzcat human_ancestor_1.fa.bz2 | sed 's,^>.*,>1,' | gzip -c > human_ancestor_1.fa.gz
    samtools faidx human_ancestor_1.fa.gz

    # .. or use this loop (tested in bash shell)
    ls human_ancestor_*.fa.bz2 | while read IN; do
    OUT=`echo $IN | sed 's,bz2$,gz,'`
    CHR=`echo $IN | sed 's,human_ancestor_,, ; s,.fa.bz2,,'`
    bzcat $IN | sed "s,^>.*,>$CHR," | gzip -c > $OUT
    samtools faidx $OUT
    done

    # After this has been done, the following command should return 'TACGTGGcTGCTCTCACACAT'
    samtools faidx human_ancestor_1.fa.gz 1:1000000-1000020

    # Now the files are ready to use with fill-aa. Note that the VCF file
    # should be sorted (see vcf-sort), otherwise the performance would be seriously
    # affected.
    cat file.vcf | fill-aa -a human_ancestor_ 2>test.err | gzip -c >out.vcf.gz

    Hope this helps,
    Petr

    I am trying to run:
    Code:
    cat chr3.vcf | fill-aa -a human_ancestor_3.fa.bz | gzip -c >out.sort.ALL.chr3.phase3_shapeit2_mvncall_integrated_v5a.20130502.geno                                                                                      types.vcf


    But I am getting the following error:

    [W::fai_fetch] Reference 3:60069-160069 not found in FASTA file, returning empty sequence
    Failed to fetch sequence in 3:60069-160069
    The command "samtools faidx human_ancestor_3\.fa\.bz 3\:60069\-160069" returned non-zero status 256.
    >3:60069-160069
    at /usr/local/share/perl5/FaSlice.pm line 56.
    FaSlice::throw('FaSlice=HASH(0x2227f08)', 'The command "samtools faidx h uman_ancestor_3\.fa\.bz 3\:60069...', '.\x{a}', '>3:60069-160069\x{a}') called a t /usr/local/share/perl5/FaSlice.pm line 79
    FaSlice::cmd('FaSlice=HASH(0x2227f08)', 'samtools faidx human_ancestor_3 \.fa\.bz 3\:60069\-160069') called at /usr/local/share/perl5/FaSlice.pm line 125
    FaSlice::read_chunk('FaSlice=HASH(0x2227f08)', 3, 60069) called at /usr/ local/share/perl5/FaSlice.pm line 153
    FaSlice::get_base('FaSlice=HASH(0x2227f08)', 3, 60069) called at /usr/lo cal/bin/fill-aa line 148
    main::fill_aa('HASH(0x1d52a68)', 'human_ancestor_3.fa.bz') called at /us r/local/bin/fill-aa line 18

    Leave a comment:


  • Petr
    replied
    Hi,
    this README describes what dashes and dots mean
    ftp://ftp.1000genomes.ebi.ac.uk/vol1...gnments/README

    Leave a comment:


  • laura
    replied
    When no genome is complete and many of the genomes used in the alignments are of mixed quality this is unavoidable

    Leave a comment:


  • bpb9
    replied
    Nevermind, I've just answered my own question, I think. Clicking on Petr's link above (ftp://ftp.1000genomes.ebi.ac.uk/vol1...ral_alignments), there is a readme.ancestralalignments txt file in the folder which says that "." means no ancestral information, while "-" means lineage-specific (not present in ancestor, only in human). Is this accurate?

    If so, what does one do about all positions for which the ancestral is "."? It seems a waste to ignore these. Could some alternative approach be used to call an ancestral allele at these positions?
    Last edited by bpb9; 10-02-2012, 11:14 AM. Reason: typo

    Leave a comment:


  • bpb9
    replied
    Anyone know what the "-" and "." mean in fill-aa?

    I used the fill-aa feature in vcf tools to pull the ancestral alleles, according to thousand genomes, for variants I have typed in another population. However, I did not get an ancestral allele call for all of the positions. Most of the positions are biallelic SNPs, so the ancestral allele is listed A, T, G, or C as the case may be.

    But some of the SNPs' ancestral alleles are filled in by a "-" or a "." or sometimes "-GT" of "C-A", etc. for variants bigger than one nucleotide. In some cases, if the variant is many base pairs, for example TGCAAT, the ancestral allele is filled in with "------" or "......".

    I am not sure what "-" and "." means. If "." means reference, why isn't the reference allele just listed instead? Does "-" mean not present, or does it mean variable gap? I apologize if this is a silly question, but I couldn't find the details on the vcf tools page. Please advise if you know! Thanks!

    Leave a comment:


  • bioinfosm
    replied
    Originally posted by jlc_1020 View Post
    Ah, I see what I've done. I was using the ancestral files referred to in Petr's response, which were based on the NCBI36 build.

    I've downloaded the correct files now, and then bzipped them so I could used Petr's shell script above to index them with samtools. Now everything seems to be working correctly; the ancestral alleles line up with the snp variants. Thanks!
    Is it possible for you to share the final solution, and where the ancestral allele sequence can be obtained from?

    I know Petr provided most of the snippets, but if you can summarize the working version you got, would be tremendously helpful.

    Also could you share how you are using the ancestral allele information. I was thinking of using it as in the LOF paper by macArthur et al.

    Thanks!!

    Leave a comment:


  • jlc_1020
    replied
    Ah, I see what I've done. I was using the ancestral files referred to in Petr's response, which were based on the NCBI36 build.

    I've downloaded the correct files now, and then bzipped them so I could used Petr's shell script above to index them with samtools. Now everything seems to be working correctly; the ancestral alleles line up with the snp variants. Thanks!

    Leave a comment:


  • laura
    replied
    Which ones are you using

    The ones in here
    ftp://ftp.1000genomes.ebi.ac.uk/vol1...ical/reference

    Are from ensembl 59 which was based on grch37

    Leave a comment:


  • jlc_1020
    replied
    Ok, so I've run into a new but related problem with this dataset. The human ancestral alignments above are based on the NCBI36 build of the human genome, while the May 2011 SNP calls for the 1000 genomes are based on GRCh37. So when I run fill-aa it fills in the wrong alleles.

    Is there a newer human ancestral alignment that would work? Or some other known workaround for this discrepancy?

    Leave a comment:


  • jlc_1020
    replied
    That worked great, thanks for all your help!

    Leave a comment:


  • Petr
    replied
    Hi,

    I've just extended the documentation of fill-aa to answer this frequently asked question:

    About: This script fills ancestral alleles into INFO column of VCF files. It depends on samtools, therefore the fasta sequence must be gzipped (not bgzipped!) and indexed by samtools faidx. The AA files can be downloaded from
    ftp://ftp.1000genomes.ebi.ac.uk/vol1...ral_alignments
    and processed as shown in the example below. This is because the sequences in the original files are named as 'ANCESTOR_for_chromosome:NCBI36:1:1:247249719', but the underlying FaSplice.pm requires names as 'chr1' or '1'.

    Example:
    # Get the files ready: compress by gzip and index by samtools faidx. Either repeat the
    # following command for each file manually
    bzcat human_ancestor_1.fa.bz2 | sed 's,^>.*,>1,' | gzip -c > human_ancestor_1.fa.gz
    samtools faidx human_ancestor_1.fa.gz

    # .. or use this loop (tested in bash shell)
    ls human_ancestor_*.fa.bz2 | while read IN; do
    OUT=`echo $IN | sed 's,bz2$,gz,'`
    CHR=`echo $IN | sed 's,human_ancestor_,, ; s,.fa.bz2,,'`
    bzcat $IN | sed "s,^>.*,>$CHR," | gzip -c > $OUT
    samtools faidx $OUT
    done

    # After this has been done, the following command should return 'TACGTGGcTGCTCTCACACAT'
    samtools faidx human_ancestor_1.fa.gz 1:1000000-1000020

    # Now the files are ready to use with fill-aa. Note that the VCF file
    # should be sorted (see vcf-sort), otherwise the performance would be seriously
    # affected.
    cat file.vcf | fill-aa -a human_ancestor_ 2>test.err | gzip -c >out.vcf.gz

    Hope this helps,
    Petr

    Leave a comment:


  • jlc_1020
    replied
    Thanks for the help. I think I've almost got this working now. But I'm still getting this error message when I run the fill-aa script:

    FIXME: the sequence names not in '>(chr)?\S+' format [?BCuman_ancestor_2.fa.gz.fai
    at //FaSlice.pm line 56
    FaSlice::throw('FaSlice=HASH(0x10096cc40)', 'FIXME: the sequence names not in \'>(chr)?\S+\' format [\x{1f}\x{8b}\x{8}\x{4}\x{0}...') called at //FaSlice.pm line 92

    I'm not sure what this means. The human ancestral sequences are paired with a .bed index file, but the fill-aa script looks for a .fa.gz.fai, so I made one with samtools by doing:

    samtools faidx human_ancestor_2.fa | bgzip -c >human_ancestor_2.fa.gz.fai

    Is the problem with the format of my index file?

    Leave a comment:


  • laura
    replied
    The current 1000 genomes release does not contain ancestral alleles unfortunately

    You can get ancestral alleles using the vcftools fill-aa script



    and use the Human Ancestral sequence from

    ftp://ftp.1000genomes.ebi.ac.uk/vol1...37_e59.tar.bz2

    Which is based on a 6way EPO alignment from ensembl 59

    Leave a comment:


  • jlc_1020
    started a topic 1000 genomes ancestral alleles?

    1000 genomes ancestral alleles?

    Hello all,

    I've downloaded a subset of SNPs using tabix and then did an allele count using vcftools. I used the -get-INFO AA to get the identity of the ancestral allele for my SNPs but it returned all question marks.

    My question is, what is the next best way to get ancestral alleles from the 1000 genomes SNPs dataset. I've tried using UCSCs liftover to chimp and macaque, but the alleles don't always match up (i.e. sometimes the ancestral allele is neither one of the human variants).

    Any help would be greatly appreciated.

Latest Articles

Collapse

  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM
  • seqadmin
    Techniques and Challenges in Conservation Genomics
    by seqadmin



    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

    Avian Conservation
    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
    03-08-2024, 10:41 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 03-27-2024, 06:37 PM
0 responses
13 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-27-2024, 06:07 PM
0 responses
11 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-22-2024, 10:03 AM
0 responses
53 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-21-2024, 07:32 AM
0 responses
69 views
0 likes
Last Post seqadmin  
Working...
X