Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to find all gi's for an organism in the nucleotide database?

    I am trying to run a command that will give me a list of all gi's given a specific tax id. The idea is to eventually use the ouput.txt as an exclusion list for a blastn procedure.

    I am basing this off of the documentation cookbook found here.

    I reproduce the interesting cookbook command here below:
    Code:
    blastdbcmd -entry all -db ecoli -dbtype nucl -outfmt %g | head -1 | \
    tee exclude_me
    Let us suppose I want to find all the gi's associated with from the nt database? I have something like the following:

    Code:
    blastdbcmd -db nr -entry all -outfmt "%g %T" | awk '{ if ($2 == 7227) {print $1} }'
    I began the job on my local cluster, but I cannot help that perhaps I might be making a mistake? My hope is the above command will give me all the gi's associated with Drosophila with taxid = 7227. Does this look right to you all?
    Last edited by hlyates; 04-13-2015, 08:52 AM.

  • #2
    Did the above work? That is probably the way to exclude GI's that are in the the nr db.

    One other way to get the list of gi's would be to do a taxonomy browser based search (e.g. http://www.ncbi.nlm.nih.gov/nuccore/?term=txid7227[Organism:noexp], replace the taxid with one for organism of interest, example is for fly). Click on display settings and then choose "gi list" in format column. Send the result to a file.

    Comment


    • #3
      Assuming all the BLAST nr database contains all the GI entries, your approach looks viable.

      You could also try using the NCBI Entrez interface, although that has complications too e.g.http://blastedbio.blogspot.co.uk/201...-chimeras.html

      Comment


      • #4
        Originally posted by GenoMax View Post
        Did the above work? That is probably the way to exclude GI's that are in the the nr db.

        One other way to get the list of gi's would be to do a taxonomy browser based search (e.g. http://www.ncbi.nlm.nih.gov/nuccore/?term=txid7227[Organism:noexp], replace the taxid with one for organism of interest, example is for fly). Click on display settings and then choose "gi list" in format column. Send the result to a file.
        I need to do this on organism 6942, but I am getting no results on http://www.ncbi.nlm.nih.gov/nuccore/?term=txid6942. Okay, this is where it gets weird, I can exclude it from a blastn search online. See attachment.

        So why does 6942 not show up in my browser search, but it does show up on the blastn exclusion? I really need that list of gi's for 6942 and super confused why I'm getting this behavior?

        Can anyone help me figure out how to find the gi's for 6942? I know the script command I wrote works, but completely blindsided by the above behavior.
        Attached Files

        Comment


        • #5
          You need to search in the organism field, not the default search. Try:

          Code:
          http://www.ncbi.nlm.nih.gov/nuccore/?term=txid6942[orgn]
          (the square bracketed orgn, short for organism, should be part of the URL or search text)

          Comment


          • #6
            Originally posted by maubp View Post
            You need to search in the organism field, not the default search. Try:

            Code:
            http://www.ncbi.nlm.nih.gov/nuccore/?term=txid6942[orgn]
            (the square bracketed orgn, short for organism, should be part of the URL or search text)
            Are you aware of an organism option for the blastdbcmd so that I will not be returning blank answers for my script command as well sir? I thought -entry all would take care of this problem? But I am getting back blank results with my commandline based approach. Why am I getting different results on the browser versus the commandline approach? I know my syntax is correct, but something is still missing for finding taxid=6942?
            Last edited by hlyates; 04-13-2015, 08:52 AM.

            Comment


            • #7
              I know you are doing things this way because you have been specifically asked to do them this way. You could save a whole lot of time/effort by using BBSplit and tick sequences in a file that you want to exclude. Effort you could put towards something more useful. Could this be used to argue a case?

              Comment


              • #8
                This should get you all entries that have "Amblyomma" in name from nr. You should be able to get the gi's you need from the headers.

                Code:
                $ blastdbcmd -db nr -entry all | grep "Amblyomma" > filename
                Last edited by GenoMax; 04-13-2015, 10:01 AM.

                Comment


                • #9
                  Originally posted by GenoMax View Post
                  This should get you all entries that have "Amblyomma" in name from nr. You should be able to get the gi's you need from the headers.

                  Code:
                  $ blastdbcmd -db nr -entry all | grep "Amblyomma" > filename
                  The docs say this should be possible

                  I just don't understand why this is having so much trouble grabbing the gi and taxonomy info only and then letting me choose the taxid like they state in the docs I just provided. According to it, "%g %t" should
                  get me going. I wonder, are these options case sensitive? I originally had it as "%g %T" which is what was on another ncbi official doc. This could be going down a rabbit hole.

                  Let's come full circle. It is a mystery why I have to specify organism and taxid on the browser search. I was wondering if there was a similar technique for the commandline. I guess not? I'll go with the grep approach if I have to, but that means I will have to write a python script to throw away everything except the gi.
                  Last edited by hlyates; 04-13-2015, 01:24 PM.

                  Comment


                  • #10
                    I am not sure if the nr database is built to include txid information. So even though the command you have is right (may need single quotes '%g %T') it is not producing any output.

                    To get an authoritative answer email blast tech support @NCBI with this question: [email protected]

                    Comment


                    • #11
                      There is a file, gi_taxid_nucl.dmp.gz, which lists all gi numbers and related taxids, in 2-column format (column 1 is gi number, column 2 is taxid). It's quite useful, but very big.

                      You can get it here:
                      ftp://ftp.ncbi.nih.gov/pub/taxonomy/

                      Although, maybe what you're doing with blast is already equivalent; I'm not really sure.

                      Comment


                      • #12
                        Originally posted by GenoMax View Post
                        I am not sure if the nr database is built to include txid information. So even though the command you have is right (may need single quotes '%g %T') it is not producing any output.

                        To get an authoritative answer email blast tech support @NCBI with this question: [email protected]
                        Thanks. I am going to email them. I'll share what I learn. I know scripts can be picky and so will run it with the single quotes. Thank you kind sir for your assistance.

                        Comment


                        • #13
                          Originally posted by hlyates View Post
                          Are you aware of an organism option for the blastdbcmd so that I will not be returning blank answers for my script command as well sir? I thought -entry all would take care of this problem? But I am getting back blank results with my commandline based approach. Why am I getting different results on the browser versus the commandline approach? I know my syntax is correct, but something is still missing for finding taxid=6942?
                          The NR database contains merged records where the same protein sequence was found in multiple organisms - therefore it will have a primary identifier and secondary identifiers.

                          For these cases you should double check what happens with the taxonomy information - since you'd want to check all the taxonomy ids of the merged record. It might be that blastdbcmd -outfmt %T only gives the first taxonomy id, and so fails to find all your matches?

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Genetic Variation in Immunogenetics and Antibody Diversity
                            by seqadmin



                            The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                            11-06-2024, 07:24 PM
                          • seqadmin
                            Choosing Between NGS and qPCR
                            by seqadmin



                            Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                            10-18-2024, 07:11 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 11-08-2024, 11:09 AM
                          0 responses
                          221 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 11-08-2024, 06:13 AM
                          0 responses
                          163 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 11-01-2024, 06:09 AM
                          0 responses
                          80 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 10-30-2024, 05:31 AM
                          0 responses
                          27 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X