Hi all,
I have been trying to find a way to extract all microbial (and eukaryotic) sequences in the NT database but I am running into a bunch of problems.
I have tried to download the GI lists for all bacterial entries using the NCBI nucleotide database, but the generated files always time out and fail to download the file completely. Then I thought maybe I could get the GI IDs using blastdbcmd, but that also fails. I tried the following:
But that also failed, since the individual entries have their species taxon in the %T field, instead of the domain, etc.
Then I thought maybe I could get a list of all taxon IDs for bacteria, eukaryota, etc., but that also doesn't appear to exist.
So in short - does anybody have an idea how I can extract all microbial sequences (to make a custom database) from the NT database? Whatever method works....
Thanks guys!
I have been trying to find a way to extract all microbial (and eukaryotic) sequences in the NT database but I am running into a bunch of problems.
I have tried to download the GI lists for all bacterial entries using the NCBI nucleotide database, but the generated files always time out and fail to download the file completely. Then I thought maybe I could get the GI IDs using blastdbcmd, but that also fails. I tried the following:
Code:
blastdbcmd -db nt -entry all -outfmt '%g %T' | awk '{ if ($2 == "2") print $1 }' > ../gi/bacteria.gi
Then I thought maybe I could get a list of all taxon IDs for bacteria, eukaryota, etc., but that also doesn't appear to exist.
So in short - does anybody have an idea how I can extract all microbial sequences (to make a custom database) from the NT database? Whatever method works....
Thanks guys!
Comment