Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bioinfun
    Junior Member
    • Jun 2011
    • 4

    synonymous snps from vcf

    Hi

    Anyone has any ideas how would one find out (programmatically) synonymous and non-synonymous snps from vcf files? I have used mpileup on several hundred bacterial genomes to get the vcf file.

    Thanks
  • ymc
    Senior Member
    • Mar 2010
    • 496

    #2
    Well, you either write your own tool to do that or try annovar

    Comment

    • iansealy
      Member
      • Oct 2010
      • 15

      #3
      Or Ensembl's VEP (http://www.ensembl.org/tools.html) or snpEff (http://snpeff.sourceforge.net/) or...

      Comment

      • bioinfun
        Junior Member
        • Jun 2011
        • 4

        #4
        Thanks guys but....

        I am trying to program it myself and I thought I can get some leads into how to do this from a vcf file.

        What do you think of this quick way of doing this:

        1- get the nucleotide sequence of the CDS that has the SNP
        2- perform 6-frame translation
        3- compare with reference translated sequence
        4- if the sequences are different then the SNP at point (1) is non-syn if they are the same then its syn.

        Not accurate but will give you an idea. What do you guys think?

        Comment

        • SeekAnswers
          Member
          • Mar 2012
          • 21

          #5
          You can try comparing the coordinates in the variant VCF with the coding region start/ends in refseq to see where your variant falls in and make a determination based on that.

          Comment

          • swbarnes2
            Senior Member
            • May 2008
            • 910

            #6
            What I've done is using the coordiante from the vcf to get the sequence around and including the SNP. Then I blastx those sequences against a database of the proteins from that bacterium. Then I parse the blastx to find out which changes cause amino acid differences.

            But yes, annovar is easier, if you can get a file for annovar to use to compare to.

            Comment

            • fanx
              Member
              • Sep 2012
              • 22

              #7
              Originally posted by bioinfun View Post
              Hi

              Anyone has any ideas how would one find out (programmatically) synonymous and non-synonymous snps from vcf files? I have used mpileup on several hundred bacterial genomes to get the vcf file.

              Thanks
              bioinfun, I have a similar problem. Are there any solutions now? Thanks.

              Comment

              • JackieBadger
                Senior Member
                • Mar 2009
                • 385

                #8
                Look at this pub. "De novo Transcriptome Assembly and SNP Discovery in the Wing Polymorphic Salt Marsh Beetle Pogonus chalceus (Coleoptera, Carabidae)"

                I now provide a quote from the primary author, reference their paper if you use the script

                "The script for finding amino acid changes uses several data files.

                - I searched the ORFs in the unigenes with this program: http://proteomics.ysu.edu/tools/OrfPredictor.html

                è Output: a CDS file (DNA sequences of the ORFs) and a PEP file (AA sequences of the ORFs, and also contains START, STOP and READINGFRAME of the ORFs)



                - SNP calling with SAMtools

                è Output: VCF file (SNP and positions of SNP)



                - Perl script (SNP_in_ORF_nonsyn.pl) infers whether SNPs are located within an ORF and whether the SNP results in an amino acid change. The script gets the SNP position from the VCF file, mutates the position in the original sequence in the unigene fasta file, then translates that sequence according its ORF (from PEP file) and then checks whether the original sequence differs from the mutated sequence. The script uses bioperl.

                è Output: each line in the VCF file that contains a nonsynonymous SNP. At the end, the number of synonymous and nonsynonymous is also outputted.



                I made the script and data available here: http://users.ugent.be/~slvbelle/NGS/

                (I added an example PEP and VCF file which should work)



                The script should be used as follows:

                ./SNP_in_ORF_nonsyn.pl Trinity_GC018ALL_unique.fasta PEP.fasta SNP.vcf > output"

                Comment

                • fanx
                  Member
                  • Sep 2012
                  • 22

                  #9
                  Originally posted by JackieBadger View Post
                  Look at this pub. "De novo Transcriptome Assembly and SNP Discovery in the Wing Polymorphic Salt Marsh Beetle Pogonus chalceus (Coleoptera, Carabidae)"

                  I now provide a quote from the primary author, reference their paper if you use the script

                  "The script for finding amino acid changes uses several data files.

                  - I searched the ORFs in the unigenes with this program: http://proteomics.ysu.edu/tools/OrfPredictor.html

                  è Output: a CDS file (DNA sequences of the ORFs) and a PEP file (AA sequences of the ORFs, and also contains START, STOP and READINGFRAME of the ORFs)



                  - SNP calling with SAMtools

                  è Output: VCF file (SNP and positions of SNP)



                  - Perl script (SNP_in_ORF_nonsyn.pl) infers whether SNPs are located within an ORF and whether the SNP results in an amino acid change. The script gets the SNP position from the VCF file, mutates the position in the original sequence in the unigene fasta file, then translates that sequence according its ORF (from PEP file) and then checks whether the original sequence differs from the mutated sequence. The script uses bioperl.

                  è Output: each line in the VCF file that contains a nonsynonymous SNP. At the end, the number of synonymous and nonsynonymous is also outputted.



                  I made the script and data available here: http://users.ugent.be/~slvbelle/NGS/

                  (I added an example PEP and VCF file which should work)



                  The script should be used as follows:

                  ./SNP_in_ORF_nonsyn.pl Trinity_GC018ALL_unique.fasta PEP.fasta SNP.vcf > output"
                  I haven’t tried it but I think your script on the end of the post should work (definitely cite that PLoS paper). I also wonder if there are alternative ways because my case is much simple. I sequenced a long and heterogeneous viral ORF using HiSeq 2000. Thus ORF prediction is unnecessary. My destination is to calculate the number of dS and dN over the viral ORF through a sliding window. Only tool I am aware is CLC’s SNP analysis tool from a publication (http://www.ncbi.nlm.nih.gov/pubmed/22278255). There may other facilities to be able to do this job too. Thanks in advance.

                  Comment

                  • fanx
                    Member
                    • Sep 2012
                    • 22

                    #10
                    JackieBadger, I tried the script. It came with:

                    Use of uninitialized value $countSyn in concatenation (.) or string at SNP_in_ORF_nonsyn.pl line 101, <GEN0> line 39393.
                    Use of uninitialized value $countNonSyn in concatenation (.) or string at SNP_in_ORF_nonsyn.pl line 102, <GEN0> line 39393.

                    any advice? pls.

                    Comment

                    • d1antho
                      Member
                      • Mar 2012
                      • 15

                      #11
                      SNPdat

                      SNPdat can be used for this





                      (there is also a short tutorial in the downloads section)

                      You only need a VCF for input, annotation file (GTF) and reference sequence (Fasta file). The annotation and sequence information can be from your own assembly and dont require any preprocessing.

                      Comment

                      • Steven VB
                        Junior Member
                        • Jul 2013
                        • 2

                        #12
                        Originally posted by fanx View Post
                        JackieBadger, I tried the script. It came with:

                        Use of uninitialized value $countSyn in concatenation (.) or string at SNP_in_ORF_nonsyn.pl line 101, <GEN0> line 39393.
                        Use of uninitialized value $countNonSyn in concatenation (.) or string at SNP_in_ORF_nonsyn.pl line 102, <GEN0> line 39393.

                        any advice? pls.
                        Please check again: http://users.ugent.be/~slvbelle/NGS/

                        I made some modifications, it should work now...

                        Comment

                        • maoshigua
                          Junior Member
                          • Aug 2013
                          • 3

                          #13
                          JackieBadger, I tried the script. It came with:
                          Error, Reference nucleotide does not equal the one in the original sequence at ./SNP_in_ORF_nonsyn_multiSNP.pl line 85, <GEN0> line 6.

                          any suggestions, please?

                          Maoshigua

                          Comment

                          • Steven VB
                            Junior Member
                            • Jul 2013
                            • 2

                            #14
                            Hi maoshigua,

                            can you send me ([email protected]) a sample of your data? I will try to fix it.

                            Cheers,
                            Steven

                            Comment

                            • maoshigua
                              Junior Member
                              • Aug 2013
                              • 3

                              #15
                              Hi Steven,
                              i send you those three input files. thanks a lot.

                              Maoshigua

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                Yesterday, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 12:03 PM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 11:40 AM
                              0 responses
                              13 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...