Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • R human genomic sequence acquisition?

    Hello all,

    I am currently building an R package to assist with some database-driven (both in-house DBs as well as external, like UCSC) human genome analysis.

    In this package, I am trying to write a function which calculates the GC content of a portion of the human genome (indicated by inputs of chromosome, first nucleotide position, and last nucleotide position).

    To do this, I was hoping to pull the queried sequences from somewhere (assuming UCSC) and run a sequinR function to calculate GC content. However, there are a few problems, namely:

    1) I cannot find the table/location via the UCSC MySQL database to access any sequences; and

    2) Using the preexisting package such as BSgenome requires a library load of the entire human genome (~850 mb local download per genome... and since we use both the hg18 and hg19 builds, that is far too inefficient for a simple sequence lookup); and

    3) Other preexisting packages, like biomaRt, cannot acquire a variable length of *genomic* sequence (biomaRt requires a sequence "type", which has many options, but none are close to what I require).


    ...so I've turned to seqanswers. Does anyone either know a workaround for one of these three problems (such as possibly sourcing the BSgenomes from our in-house server, rather than loading them both locally with the library() command?), or does anyone know of any method of getting a desired genomic sequence based on chr/start/stop inputs?

    Alternatively, if anyone just knows a quick way to calculate GC content of a genomic segment given positional input in R, please let me know!

    Thanks a lot. I really appreciate any suggestions that will help here!

    Best,
    Ryan
    Last edited by RyanLCollins; 09-11-2013, 06:19 AM. Reason: specificity -- HUMAN genome

  • #2
    Do you plan to have the hg18/hg19 genomes locally available where you're deploying the package? If so, you can use Rsamtools to load just the relevant region for processing.

    Comment


    • #3
      Originally posted by dpryan View Post
      Do you plan to have the hg18/hg19 genomes locally available where you're deploying the package? If so, you can use Rsamtools to load just the relevant region for processing.
      Hi dpryan, thanks for the prompt reply! At present, we are planning on using RMySQL to access the UCSC MySQL database (which has all tables associated with hg18/hg19), but we don't plan on having the entire genomes locally available.

      Thanks for the suggestion though, I'll look into it further!

      Having never used Rsamtools before, would it be possible to source hg18/hg19 if were were to place them on a secure server? Or do the genomes both have to be strictly local?

      For further info, we are planning on distributing this package amongst roughly one dozen bioinformaticians in our group, all of whom will have access to a central cluster, but who will all be working from different local machines.

      Thanks again!

      Comment


      • #4
        UCSC limits programmatic access to their services (based on number of access attempts from IP block/time). https://genome.ucsc.edu/goldenPath/help/mysql.html

        If several people are going to query the database it may be more useful to have the data locally. You can find the database dumps for hg19 here: ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/ (look for others elsewhere on the same ftp server)

        Comment


        • #5
          Well, I believe that it needs to be available from the local file system, though that doesn't preclude just mounting a remote drive (we have a group drive available via smb/cifs and nfs). If you're running this on a cluster, then copying the files to one of the mountpoints available to each node might prove easiest (I do this with genome indices for alignments, though each node also has access to a filesystem that's also mounted on my desktop).

          Comment


          • #6
            Thank you both for the replies!

            @GenoMax: Thank you for the heads up! I was unaware of the access limits per IP block. I'll ask around our group to estimate our expected requirements and go from there.

            @dpryan: Hmm ok, thank you for the suggestion. I think ideally I would prefer to find a work around, although we have the capabilities to go that route if necessary. Ideally I'd like to keep this package running locally on our analyst's local machines, although if necessary we could run it on a cluster.

            Comment


            • #7
              Hello all,

              I believe I have found the solution to my problem in the package "DASiR". It allows sequence retrieval from DAS servers (including UCSC, of course).

              If others are interested in tackling a similar problem with R, you can find the details regarding DASiR here:
              http://www.bioconductor.org/packages...tml/DASiR.html

              Thanks for the help,
              Ryan

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Advances in Sequencing Analysis Tools
                by seqadmin


                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                05-06-2024, 07:48 AM
              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:35 AM
              0 responses
              14 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-09-2024, 02:46 PM
              0 responses
              18 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-07-2024, 06:57 AM
              0 responses
              17 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-06-2024, 07:17 AM
              0 responses
              19 views
              0 likes
              Last Post seqadmin  
              Working...
              X