Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • blast+ and pre-formatted databases

    Hi,

    I have been using blast+ for a little while now to make custom local databases from fasta files, and I'm thinking about downloading and using the GenBank pre-formatted nr database. Before committing hard drive space to the whole thing, I downloaded and unzipped the first directories (nr.00 - nr.02) to give it a try, but I'm having a hard time figuring out what exactly to do with these.

    I looked in the BLAST+ manual and the only pertinent section I could find just says this:

    "The NCBI makes databases that are searchable on the NCBI web site (such as nr, refseq_rna, and swissprot) available on its FTP site. It is better to download the preformatted databases rather than starting with FASTA. The databases on the FTP site contain taxonomic information for each sequence, include the identifier indices for lookups, and can be up to four times smaller than the FASTA. The original FASTA can be generated from the BLAST database using blastdbcmd."


    Thinking that each directory already contained a blast database, I tried the command:

    blastn -db nr.00 -query query.fa -out Results.out

    BLAST Database error: No alias or index file found for nucleotide database [nr.00] in search path [/Users/Username/Desktop/Example/NonRedundant_BLAST/nr.00::/usr/bin/ncbi-blast-2.2.28+/db:]


    After this I tried the command to make the directory using various input files:

    makeblastdb -in nr -input_type blastdb -dbtype nucl -parse_seqids -out NonRedundant -title "GenBank Non-redundant"

    makeblastdb -in nr.00 -input_type blastdb -dbtype nucl -parse_seqids -out NonRedundant -title "GenBank Non-redundant"

    makeblastdb -in nr.00.phd -input_type blastdb -dbtype nucl -parse_seqids -out NonRedundant -title "GenBank Non-redundant"

    and so on for each of the files in the directory. And I get the same error as above.


    I see that the nr.00 directory has a file called nr.pal that has these contents:

    #
    # Alias file created 12/08/2013 01:27:33
    #
    TITLE All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects
    DBLIST nr.00 nr.01 nr.02 nr.03 nr.04 nr.05 nr.06 nr.07 nr.08 nr.09 nr.10 nr.11 nr.12 nr.13 nr.14
    NSEQ 34869290
    LENGTH 12261267790

    Being the optimist, I tried to modify this file to just list nr.00 - nr.02 and had no luck (I know the NSEQ and LENGTH would be wrong but figured it was worth a shot).

    So, would I have to download the whole nr database in order to try it? What I really want is just the sequences from one model organism, but I don't see a species-specific pre-formatted blast database for it. And if I download the whole thing, then what? Should I put all of the files from each separately downloaded nr directory into one directory? And try to build a single database using the nr.pal file? I'm probably missing something super-obvious here, but I'm stuck.

    Thanks,
    Andreanna
    Last edited by andreanna05; 12-12-2013, 07:25 AM.

  • #2
    Have you read the online documentation about the pre-formatted
    blast databases:

    ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blastdb.html

    The preformatted database files are already formatted, so you don't
    need to run makeblastdb.

    Comment


    • #3
      Originally posted by andreanna05
      What I really want is just the sequences from one model organism, but I don't see a species-specific pre-formatted blast database for it.
      Your best option then is to download the sequences from that model organism, and use makeblastdb to construct a BLAST database from them.

      You can find the makeblastdb documentation here: http://nebc.nerc.ac.uk/bioinformatic...keblastdb.html

      Comment


      • #4
        Two possible options to consider if you are only interested in creating a db of sequences from a specific organism. In either case you can create your own blast db (makeblastdb) once you get the sequences together.

        1. If you are not averse to downloading files (there are multiple) for the nr blast index than you could use the blastdbcmd command to extract sequences specific to your organism. Look for the section on extracting sequences using blastdbcmd in this manual: http://www.ncbi.nlm.nih.gov/books/NBK1763/

        From NCBI:
        Extract all human sequences from the nr database

        Although one cannot select GIs by taxonomy from a database, a combination of unix command line tools will accomplish this:

        $ blastdbcmd -db nr -entry all -outfmt "%g %T" | \
        awk ' { if ($2 == 9606) { print $1 } } ' | \
        blastdbcmd -db nr -entry_batch - -out human_sequences.txt
        2. You could also use NCBI eutils to perform a query to get the sequence data you need. Manual for that is here: http://www.ncbi.nlm.nih.gov/books/NBK1058/
        Application #3 retrieving large datasets may work.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 03-27-2024, 06:37 PM
        0 responses
        12 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-27-2024, 06:07 PM
        0 responses
        11 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        68 views
        0 likes
        Last Post seqadmin  
        Working...
        X