Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • RyanLCollins
    Junior Member
    • Sep 2013
    • 5

    R human genomic sequence acquisition?

    Hello all,

    I am currently building an R package to assist with some database-driven (both in-house DBs as well as external, like UCSC) human genome analysis.

    In this package, I am trying to write a function which calculates the GC content of a portion of the human genome (indicated by inputs of chromosome, first nucleotide position, and last nucleotide position).

    To do this, I was hoping to pull the queried sequences from somewhere (assuming UCSC) and run a sequinR function to calculate GC content. However, there are a few problems, namely:

    1) I cannot find the table/location via the UCSC MySQL database to access any sequences; and

    2) Using the preexisting package such as BSgenome requires a library load of the entire human genome (~850 mb local download per genome... and since we use both the hg18 and hg19 builds, that is far too inefficient for a simple sequence lookup); and

    3) Other preexisting packages, like biomaRt, cannot acquire a variable length of *genomic* sequence (biomaRt requires a sequence "type", which has many options, but none are close to what I require).


    ...so I've turned to seqanswers. Does anyone either know a workaround for one of these three problems (such as possibly sourcing the BSgenomes from our in-house server, rather than loading them both locally with the library() command?), or does anyone know of any method of getting a desired genomic sequence based on chr/start/stop inputs?

    Alternatively, if anyone just knows a quick way to calculate GC content of a genomic segment given positional input in R, please let me know!

    Thanks a lot. I really appreciate any suggestions that will help here!

    Best,
    Ryan
    Last edited by RyanLCollins; 09-11-2013, 06:19 AM. Reason: specificity -- HUMAN genome
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    Do you plan to have the hg18/hg19 genomes locally available where you're deploying the package? If so, you can use Rsamtools to load just the relevant region for processing.

    Comment

    • RyanLCollins
      Junior Member
      • Sep 2013
      • 5

      #3
      Originally posted by dpryan View Post
      Do you plan to have the hg18/hg19 genomes locally available where you're deploying the package? If so, you can use Rsamtools to load just the relevant region for processing.
      Hi dpryan, thanks for the prompt reply! At present, we are planning on using RMySQL to access the UCSC MySQL database (which has all tables associated with hg18/hg19), but we don't plan on having the entire genomes locally available.

      Thanks for the suggestion though, I'll look into it further!

      Having never used Rsamtools before, would it be possible to source hg18/hg19 if were were to place them on a secure server? Or do the genomes both have to be strictly local?

      For further info, we are planning on distributing this package amongst roughly one dozen bioinformaticians in our group, all of whom will have access to a central cluster, but who will all be working from different local machines.

      Thanks again!

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        UCSC limits programmatic access to their services (based on number of access attempts from IP block/time). https://genome.ucsc.edu/goldenPath/help/mysql.html

        If several people are going to query the database it may be more useful to have the data locally. You can find the database dumps for hg19 here: ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/ (look for others elsewhere on the same ftp server)

        Comment

        • dpryan
          Devon Ryan
          • Jul 2011
          • 3478

          #5
          Well, I believe that it needs to be available from the local file system, though that doesn't preclude just mounting a remote drive (we have a group drive available via smb/cifs and nfs). If you're running this on a cluster, then copying the files to one of the mountpoints available to each node might prove easiest (I do this with genome indices for alignments, though each node also has access to a filesystem that's also mounted on my desktop).

          Comment

          • RyanLCollins
            Junior Member
            • Sep 2013
            • 5

            #6
            Thank you both for the replies!

            @GenoMax: Thank you for the heads up! I was unaware of the access limits per IP block. I'll ask around our group to estimate our expected requirements and go from there.

            @dpryan: Hmm ok, thank you for the suggestion. I think ideally I would prefer to find a work around, although we have the capabilities to go that route if necessary. Ideally I'd like to keep this package running locally on our analyst's local machines, although if necessary we could run it on a cluster.

            Comment

            • RyanLCollins
              Junior Member
              • Sep 2013
              • 5

              #7
              Hello all,

              I believe I have found the solution to my problem in the package "DASiR". It allows sequence retrieval from DAS servers (including UCSC, of course).

              If others are interested in tackling a similar problem with R, you can find the details regarding DASiR here:
              R package for programmatic retrieval of information from DAS servers


              Thanks for the help,
              Ryan

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM
              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              30 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              96 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              117 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              109 views
              0 reactions
              Last Post SEQadmin2  
              Working...