Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • afitz
    Junior Member
    • Nov 2013
    • 2

    Obtaining Random Sequences from Given Taxonomic Grouping

    Hello,

    I apologize if this question is too simple, I am new to bioinformatics and am trying to completely my first independent project. I am trying to retrieve DNA sequences from a set of random organisms within a given taxonomic group. For example, I want to be able to input "Mammalia" and retrieve subsets of say, 5 mammalian genomes. I have been looking into the NCBI resources including the taxdump files, the Taxonomy database, and RefSeq, but am struggling to put these resources together in order to traverse a taxonomy and retrieve random sequences from different taxonomic levels.

    Any hints on how/where to begin would be appreciated so much! Thank you!!
  • afitz
    Junior Member
    • Nov 2013
    • 2

    #2
    "So you want a program that will parse a database of curated reference genome sequences based on user input, then extract a subset of those genomes from a subset of those reference genomes?"

    I don't necessarily need to extract a subset of the genomes - I just want a way to obtain a random subset of the genomes contained in a given taxonomic category. For example, I would want to be able to ask for Bacteria and receive several bacterial genomes from a sequence database. Thanks for your question!

    Comment

    • gringer
      David Eccles (gringer)
      • May 2011
      • 845

      #3
      I just want a way to obtain a random subset of the genomes contained in a given taxonomic category. For example, I would want to be able to ask for Bacteria and receive several bacterial genomes from a sequence database.
      This sounds like too specific a task for pre-existing code, but that doesn't mean someone else hasn't thought similarly in the past and made their own solution. Traversing the NCBI taxonomy is somewhat difficult, but doable. You'd probably be working of the taxonomy data from here:

      ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

      In particular, nodes.dmp and names.dmp inside taxdump to get/parse the tree, and gi_taxid_nucl.dmp to get genbank accession ID to taxID mappings. I guess the trick will be to filter those accession IDs to only have full chromosomal (or contig) sequences, rather than subsets of sequence.

      Once you have NCBI accession numbers, you can retrieve the IDs and sequences using eSearch and eFetch:

      This chapter presents several examples of how the E-utilities can be used to build useful applications. These examples use Perl to create the E-utility pipelines, and assume that the LWP::Simple module is installed. This module includes the get function that supports HTTP GET requests. One example (Application 4) uses an HTTP POST request, and requires the LWP::UserAgent module. In Perl, scalar variable names are preceded by a "$" symbol, and array names are preceded by a "@". In several instances, results will be stored in such variables for use in subsequent E-utility calls. The code examples here are working programs that can be copied to a text editor and executed directly. Equivalent HTTP requests can be constructed in many modern programming languages; all that is required is the ability to create and post an HTTP request.


      You do need to be a bit careful when extracting tons of sequence with eFetch, because it has a maximum limit on the sequences that it will return in one request (something like 10,000).

      Another problem for you will be what you mean by "random". The NCBI taxa aren't very well structured, so you will be getting quite a biased sample (i.e. weighted heavily on the more researched organisms) by picking sequences using a uniform distribution.

      Comment

      Latest Articles

      Collapse

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by SEQadmin2, Today, 10:09 AM
      0 responses
      8 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, Yesterday, 08:59 AM
      0 responses
      14 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-02-2026, 12:03 PM
      0 responses
      22 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-02-2026, 11:40 AM
      0 responses
      19 views
      0 reactions
      Last Post SEQadmin2  
      Working...