Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Require dbsnp file in vcf format for mycobacterium tuberculosis.

    I have to run GATK's Base Quality Score Recalibration for the organism Mycobacterium Tuberculosis. One of the input files is a dbsnp file for this bacteria. I am unable to get this file in a vcf format as required by the GATK's reclibration program. Is there a way I can convert the snp data from - http://www.ncbi.nlm.nih.gov/snp into a vcf file format for mycobacterium tuberculosis?

    Appreciate the help.

  • #2
    I've written a Perl script for converting from dbSNP format to VCF format. It's not a perfect solution and comes with two pretty large caveats; firstly, it will only convert SNVs and will not convert InDels and secondly, it has only been tested on the mm9 dbSNP128 (which is the database I needed to convert).

    The file can be downloaded from https://sites.google.com/site/peterhickey/home/software and is further explained here http://statsandgenomes.wordpress.com...bsnp-vcf-file/

    Please feel free to modify the script as you desire.

    Cheers,
    Pete
    Last edited by PeteH; 01-24-2012, 03:33 PM.

    Comment


    • #3
      I encountered the same problem when trying to put microbial data through GATK, but I don't think that downloading and converting some file you download off the internet is going to help.

      What do you think the software is going to do with that vcf, and is that operation going to help you answer the problem your data is supposed to help you solve? It seems to me that that including a vcf in that command is supposed to help you filter out SNPs known to be found in the population, so that you can concentrate on SNPs novel to your sample. But is that necessarily what you want to do with your experiment? If my sample has a KatG or GyrA mutation, I don't want those variations ignored because those are already described in the literature, I need to know they are there.

      Comment


      • #4
        Thanks a tonne Pete, one more question for you, what kind of input did you give this code? I mean in what format did you download the snp's from ncbi, it gives a variety of choices - text file, fasta file etc etc. I am a beginner in working with perl, so kinda catching up with it

        Comment


        • #5
          Thanks a tonne Pete, one more question for you, what kind of input did you give this code? I mean in what format did you download the snp's from ncbi, it gives a variety of choices - text file, fasta file etc etc. I am a beginner in working with perl, so kinda catching up with it

          Comment


          • #6
            Originally posted by swbarnes2 View Post
            I encountered the same problem when trying to put microbial data through GATK, but I don't think that downloading and converting some file you download off the internet is going to help.

            What do you think the software is going to do with that vcf, and is that operation going to help you answer the problem your data is supposed to help you solve? It seems to me that that including a vcf in that command is supposed to help you filter out SNPs known to be found in the population, so that you can concentrate on SNPs novel to your sample. But is that necessarily what you want to do with your experiment? If my sample has a KatG or GyrA mutation, I don't want those variations ignored because those are already described in the literature, I need to know they are there.
            ***********************************

            swbarnes2 - I think you have some misunderstanding about how this thing works. The filter out of known snp's is necessary because the program - countcovariates works in a way that it compares how often bases in the organism I am working on mismatches the reference organism's bases. And since a snp will obviously mismatch as it is a change in the base at a particular position in my organism and the reference organism - that is why it is a snp, it will be good to ignore it.

            Ashu

            Comment


            • #7
              Originally posted by ashuchawla View Post
              Thanks a tonne Pete, one more question for you, what kind of input did you give this code? I mean in what format did you download the snp's from ncbi, it gives a variety of choices - text file, fasta file etc etc. I am a beginner in working with perl, so kinda catching up with it
              Unfortunately I'm not familiar with bacterial genomics and it doesn't appear that the SNPs for Mycobacterium Tuberculosis are available from the site I downloaded the mouse data from http://hgdownload.cse.ucsc.edu/downloads.html, or more specifically http://hgdownload.cse.ucsc.edu/golde.../snp128.txt.gz.

              My script will only be useful if you can find your SNPs in a format similar to that of http://hgdownload.cse.ucsc.edu/golde.../snp128.txt.gz.

              Comment


              • #8
                Originally posted by PeteH View Post
                Unfortunately I'm not familiar with bacterial genomics and it doesn't appear that the SNPs for Mycobacterium Tuberculosis are available from the site I downloaded the mouse data from http://hgdownload.cse.ucsc.edu/downloads.html, or more specifically http://hgdownload.cse.ucsc.edu/golde.../snp128.txt.gz.

                My script will only be useful if you can find your SNPs in a format similar to that of http://hgdownload.cse.ucsc.edu/golde.../snp128.txt.gz.
                ************************************
                Yes but they are available at - http://www.ncbi.nlm.nih.gov/snp, if you put mycobacterium tuberculosis in the box right next to "for" and click "Go" you will see a long list of snp. You can download them by clicking "Send To".

                Comment


                • #9
                  Originally posted by ashuchawla View Post
                  ************************************
                  Yes but they are available at - http://www.ncbi.nlm.nih.gov/snp, if you put mycobacterium tuberculosis in the box right next to "for" and click "Go" you will see a long list of snp. You can download them by clicking "Send To".
                  I realise this, but the format is different to that supported by my script. Having looked at your file format I think it should be fairly simple to convert this to VCF with a bit of scripting.

                  Comment


                  • #10
                    Originally posted by PeteH View Post
                    I realise this, but the format is different to that supported by my script. Having looked at your file format I think it should be fairly simple to convert this to VCF with a bit of scripting.
                    Ohh great. thanks for saying that... I am a newbie in this field and this is my first project ... i hope it works out
                    So do you think I should modify your code or write a new one from scratch? what exactly do u meant by scripting here?

                    Also, I am not able to open the link that you have posted, I donno why, I will try some more

                    Comment


                    • #11
                      Originally posted by ashuchawla View Post
                      Ohh great. thanks for saying that... I am a newbie in this field and this is my first project ... i hope it works out
                      So do you think I should modify your code or write a new one from scratch? what exactly do u meant by scripting here?

                      Also, I am not able to open the link that you have posted, I donno why, I will try some more
                      Sorry, I'm not sure why my links aren't working for you.

                      By scripting I mean writing a program in, for example, the Perl or Python programming language. Your program should read in each line of your SNP file one-by-one, convert each line to VCF and write each converted-line to an output file.

                      Do you have any experience programming in a particular language? This sort of problem is a great way to learn basic text-parsing and text-manipulation. I'd start from scratch if I were you since you'll learn a lot more by doing it this way and also because my code is not going to be a lot of help.

                      Be sure that you have a good understanding of the subtleties of the dbSNP format and VCF (the VCF is described in detail at http://www.1000genomes.org/wiki/Anal...mat-version-41. For instance, VCF is a "1-based" format because the first position on a chromosome is called position 1; this is in contrast to "0-based" formats where the first position on a chromosome is called position 0.

                      Good luck!

                      Comment


                      • #12
                        Originally posted by PeteH View Post
                        Sorry, I'm not sure why my links aren't working for you.

                        By scripting I mean writing a program in, for example, the Perl or Python programming language. Your program should read in each line of your SNP file one-by-one, convert each line to VCF and write each converted-line to an output file.

                        Do you have any experience programming in a particular language? This sort of problem is a great way to learn basic text-parsing and text-manipulation. I'd start from scratch if I were you since you'll learn a lot more by doing it this way and also because my code is not going to be a lot of help.

                        Be sure that you have a good understanding of the subtleties of the dbSNP format and VCF (the VCF is described in detail at http://www.1000genomes.org/wiki/Anal...mat-version-41. For instance, VCF is a "1-based" format because the first position on a chromosome is called position 1; this is in contrast to "0-based" formats where the first position on a chromosome is called position 0.

                        Good luck!
                        Thank you so much Pete. I appreciate your help. I have some experience in PL/SQL and JAVA. I should start on my code then...thanks again.

                        Comment


                        • #13
                          Originally posted by ashuchawla View Post
                          Thank you so much Pete. I appreciate your help. I have some experience in PL/SQL and JAVA. I should start on my code then...thanks again.
                          Pete, one more question, did u have the positions of snp's in your dbsnp file? I cannot check this as I am unable to access your files .The field "position" in the vcf file has to be populated by the corresponding "position" in the dbsnp file.
                          But I donot have those position numbers in the ncbi snp data

                          Just wanted to check how did u manage this problem or did u have the position numbers in your snp file already? Is there a way of mapping around 30k snps with a reference genome and getting respective position numbers?

                          Comment


                          • #14
                            Apologies for the delay in my response. My data did have positions in the dbSNP file. I'm not sure how to deal with this issue in the mycobacterium tuberculosis data. Sorry I couldn't be more help.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Exploring the Dynamics of the Tumor Microenvironment
                              by seqadmin




                              The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                              07-08-2024, 03:19 PM
                            • seqadmin
                              Exploring Human Diversity Through Large-Scale Omics
                              by seqadmin


                              In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                              06-25-2024, 06:43 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 07-19-2024, 07:20 AM
                            0 responses
                            40 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 07-16-2024, 05:49 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 07-15-2024, 06:53 AM
                            0 responses
                            64 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 07-10-2024, 07:30 AM
                            0 responses
                            43 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X