Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • kga1978
    Senior Member
    • Nov 2010
    • 100

    Extracting all microbial sequences from NT

    Hi all,

    I have been trying to find a way to extract all microbial (and eukaryotic) sequences in the NT database but I am running into a bunch of problems.

    I have tried to download the GI lists for all bacterial entries using the NCBI nucleotide database, but the generated files always time out and fail to download the file completely. Then I thought maybe I could get the GI IDs using blastdbcmd, but that also fails. I tried the following:

    Code:
    blastdbcmd -db nt -entry all -outfmt '%g %T' | awk '{ if ($2 == "2") print $1 }' > ../gi/bacteria.gi
    But that also failed, since the individual entries have their species taxon in the %T field, instead of the domain, etc.

    Then I thought maybe I could get a list of all taxon IDs for bacteria, eukaryota, etc., but that also doesn't appear to exist.

    So in short - does anybody have an idea how I can extract all microbial sequences (to make a custom database) from the NT database? Whatever method works....

    Thanks guys!
  • nickloman
    Senior Member
    • Jul 2009
    • 355

    #2
    Hey

    Not a full solution, but MEGAN provides files which map GIs to taxon IDs for nt and nr via this link: http://ab.inf.uni-tuebingen.de/data/...d/welcome.html

    Hope that helps

    Comment

    • Richard Finney
      Senior Member
      • Feb 2009
      • 701

      #3
      Easiest method to get taxonomy ids ...
      Just check out this directory: ftp://ftp.ncbi.nih.gov/pub/taxonomy/

      ________________
      If you want bacteria and virsus genome in fasta format files ...

      Check out doucmentation here :

      for NCBI file name extensions.

      You can ftp download data from NCBI here :
      ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/
      Look for the all* files. The ftp://ftp.ncbi.nlm.nih.gov/genomes/B...all.fna.tar.gz file should be all bacterial genomes.

      Virae here : ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/

      "WGS bacteria OLD" is thereabouts, just look around. Draft genomes there abouts, too.

      _____

      Alternate way to get taxon ids for example bacteria ...

      You can get the file "all rpt" file via wget :
      wget ftp://ftp.ncbi.nlm.nih.gov/genomes/B...all.rpt.tar.gz
      Unzip and untar.

      Run the command
      -bash-3.00$ find . -name '*.rpt' -exec grep Taxid {} \; | sort | uniq
      There you go.

      Comment

      • kga1978
        Senior Member
        • Nov 2010
        • 100

        #4
        Wow, thanks so much guys - this was incredible helpful! I got it all covered now

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM
        • SEQadmin2
          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
          by SEQadmin2


          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


          Introduction

          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
          05-22-2026, 06:42 AM
        • SEQadmin2
          Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
          by SEQadmin2

          Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


          Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
          05-06-2026, 09:04 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-02-2026, 12:03 PM
        0 responses
        21 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-02-2026, 11:40 AM
        0 responses
        14 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 05-28-2026, 11:40 AM
        0 responses
        29 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 05-26-2026, 10:12 AM
        0 responses
        31 views
        0 reactions
        Last Post SEQadmin2  
        Working...