Hi,
I have been using blast+ for a little while now to make custom local databases from fasta files, and I'm thinking about downloading and using the GenBank pre-formatted nr database. Before committing hard drive space to the whole thing, I downloaded and unzipped the first directories (nr.00 - nr.02) to give it a try, but I'm having a hard time figuring out what exactly to do with these.
I looked in the BLAST+ manual and the only pertinent section I could find just says this:
"The NCBI makes databases that are searchable on the NCBI web site (such as nr, refseq_rna, and swissprot) available on its FTP site. It is better to download the preformatted databases rather than starting with FASTA. The databases on the FTP site contain taxonomic information for each sequence, include the identifier indices for lookups, and can be up to four times smaller than the FASTA. The original FASTA can be generated from the BLAST database using blastdbcmd."
Thinking that each directory already contained a blast database, I tried the command:
blastn -db nr.00 -query query.fa -out Results.out
BLAST Database error: No alias or index file found for nucleotide database [nr.00] in search path [/Users/Username/Desktop/Example/NonRedundant_BLAST/nr.00::/usr/bin/ncbi-blast-2.2.28+/db:]
After this I tried the command to make the directory using various input files:
makeblastdb -in nr -input_type blastdb -dbtype nucl -parse_seqids -out NonRedundant -title "GenBank Non-redundant"
makeblastdb -in nr.00 -input_type blastdb -dbtype nucl -parse_seqids -out NonRedundant -title "GenBank Non-redundant"
makeblastdb -in nr.00.phd -input_type blastdb -dbtype nucl -parse_seqids -out NonRedundant -title "GenBank Non-redundant"
and so on for each of the files in the directory. And I get the same error as above.
I see that the nr.00 directory has a file called nr.pal that has these contents:
#
# Alias file created 12/08/2013 01:27:33
#
TITLE All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects
DBLIST nr.00 nr.01 nr.02 nr.03 nr.04 nr.05 nr.06 nr.07 nr.08 nr.09 nr.10 nr.11 nr.12 nr.13 nr.14
NSEQ 34869290
LENGTH 12261267790
Being the optimist, I tried to modify this file to just list nr.00 - nr.02 and had no luck (I know the NSEQ and LENGTH would be wrong but figured it was worth a shot).
So, would I have to download the whole nr database in order to try it? What I really want is just the sequences from one model organism, but I don't see a species-specific pre-formatted blast database for it. And if I download the whole thing, then what? Should I put all of the files from each separately downloaded nr directory into one directory? And try to build a single database using the nr.pal file? I'm probably missing something super-obvious here, but I'm stuck.
Thanks,
Andreanna
I have been using blast+ for a little while now to make custom local databases from fasta files, and I'm thinking about downloading and using the GenBank pre-formatted nr database. Before committing hard drive space to the whole thing, I downloaded and unzipped the first directories (nr.00 - nr.02) to give it a try, but I'm having a hard time figuring out what exactly to do with these.
I looked in the BLAST+ manual and the only pertinent section I could find just says this:
"The NCBI makes databases that are searchable on the NCBI web site (such as nr, refseq_rna, and swissprot) available on its FTP site. It is better to download the preformatted databases rather than starting with FASTA. The databases on the FTP site contain taxonomic information for each sequence, include the identifier indices for lookups, and can be up to four times smaller than the FASTA. The original FASTA can be generated from the BLAST database using blastdbcmd."
Thinking that each directory already contained a blast database, I tried the command:
blastn -db nr.00 -query query.fa -out Results.out
BLAST Database error: No alias or index file found for nucleotide database [nr.00] in search path [/Users/Username/Desktop/Example/NonRedundant_BLAST/nr.00::/usr/bin/ncbi-blast-2.2.28+/db:]
After this I tried the command to make the directory using various input files:
makeblastdb -in nr -input_type blastdb -dbtype nucl -parse_seqids -out NonRedundant -title "GenBank Non-redundant"
makeblastdb -in nr.00 -input_type blastdb -dbtype nucl -parse_seqids -out NonRedundant -title "GenBank Non-redundant"
makeblastdb -in nr.00.phd -input_type blastdb -dbtype nucl -parse_seqids -out NonRedundant -title "GenBank Non-redundant"
and so on for each of the files in the directory. And I get the same error as above.
I see that the nr.00 directory has a file called nr.pal that has these contents:
#
# Alias file created 12/08/2013 01:27:33
#
TITLE All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects
DBLIST nr.00 nr.01 nr.02 nr.03 nr.04 nr.05 nr.06 nr.07 nr.08 nr.09 nr.10 nr.11 nr.12 nr.13 nr.14
NSEQ 34869290
LENGTH 12261267790
Being the optimist, I tried to modify this file to just list nr.00 - nr.02 and had no luck (I know the NSEQ and LENGTH would be wrong but figured it was worth a shot).
So, would I have to download the whole nr database in order to try it? What I really want is just the sequences from one model organism, but I don't see a species-specific pre-formatted blast database for it. And if I download the whole thing, then what? Should I put all of the files from each separately downloaded nr directory into one directory? And try to build a single database using the nr.pal file? I'm probably missing something super-obvious here, but I'm stuck.
Thanks,
Andreanna
Comment