Hello all,
I am currently building an R package to assist with some database-driven (both in-house DBs as well as external, like UCSC) human genome analysis.
In this package, I am trying to write a function which calculates the GC content of a portion of the human genome (indicated by inputs of chromosome, first nucleotide position, and last nucleotide position).
To do this, I was hoping to pull the queried sequences from somewhere (assuming UCSC) and run a sequinR function to calculate GC content. However, there are a few problems, namely:
1) I cannot find the table/location via the UCSC MySQL database to access any sequences; and
2) Using the preexisting package such as BSgenome requires a library load of the entire human genome (~850 mb local download per genome... and since we use both the hg18 and hg19 builds, that is far too inefficient for a simple sequence lookup); and
3) Other preexisting packages, like biomaRt, cannot acquire a variable length of *genomic* sequence (biomaRt requires a sequence "type", which has many options, but none are close to what I require).
...so I've turned to seqanswers. Does anyone either know a workaround for one of these three problems (such as possibly sourcing the BSgenomes from our in-house server, rather than loading them both locally with the library() command?), or does anyone know of any method of getting a desired genomic sequence based on chr/start/stop inputs?
Alternatively, if anyone just knows a quick way to calculate GC content of a genomic segment given positional input in R, please let me know!
Thanks a lot. I really appreciate any suggestions that will help here!
Best,
Ryan
I am currently building an R package to assist with some database-driven (both in-house DBs as well as external, like UCSC) human genome analysis.
In this package, I am trying to write a function which calculates the GC content of a portion of the human genome (indicated by inputs of chromosome, first nucleotide position, and last nucleotide position).
To do this, I was hoping to pull the queried sequences from somewhere (assuming UCSC) and run a sequinR function to calculate GC content. However, there are a few problems, namely:
1) I cannot find the table/location via the UCSC MySQL database to access any sequences; and
2) Using the preexisting package such as BSgenome requires a library load of the entire human genome (~850 mb local download per genome... and since we use both the hg18 and hg19 builds, that is far too inefficient for a simple sequence lookup); and
3) Other preexisting packages, like biomaRt, cannot acquire a variable length of *genomic* sequence (biomaRt requires a sequence "type", which has many options, but none are close to what I require).
...so I've turned to seqanswers. Does anyone either know a workaround for one of these three problems (such as possibly sourcing the BSgenomes from our in-house server, rather than loading them both locally with the library() command?), or does anyone know of any method of getting a desired genomic sequence based on chr/start/stop inputs?
Alternatively, if anyone just knows a quick way to calculate GC content of a genomic segment given positional input in R, please let me know!
Thanks a lot. I really appreciate any suggestions that will help here!
Best,
Ryan
Comment