Unconfigured Ad

**biznatch** · 06-28-2011, 01:25 PM

I don't know of any studies off hand, but I'm sure at least some of that has been looked at. For example, there's a "CG percent" track in the UCSC browser so someone's studied that.

You can download the human genome in a few different formats: http://hgdownload.cse.ucsc.edu/downloads.html#human

**qtrinh** · 06-29-2011, 01:01 PM

Originally posted by Fixee View Post

Hi all. I'm wondering if any studies have been done on the statistics of human genomic data. In particular, on the distribution of bases and k-mers in human dna.

For example, do G, C, A and T occur with equal frequency among the 3 gigabases? Are there known "long" k-mers that occur with high frequency (I don't care if they code or not)?

Alternatively, if this stuff hasn't been well-studied, are there BAM files with complete genomes that are freely available? (I don't need CIGAR, or Phred scores, just the bases.)

Cheers!

I did the A,C,G,T, and N frequencies analysis a while back and for the 3 gigabases human reference genome including random and unknown, I get:

A 27.25%
C 18.9%
G 18.9%
T 27.28%
N 7.64%

I haven't done the long k-mers analysis.

Q

**BAMseek** · 06-29-2011, 02:01 PM

I second the results given by qtrinh (with slightly different rounding). I used the hg19.2bit file from UCSC.

Total: 3,137,161,264 bases

A: 854,963,149 bases (27.25%)
C: 592,966,724 bases (18.90%)
G: 593,325,228 bases (18.91%)
T: 856,055,361 bases (27.29%)
N: 239,850,802 bases (07.65%)

This is just the results of one strand. If you count bases from both strands, then A = A+T, T = A+T, C = C+G, G = C+G from base complementarity.

There has been work done to find over-represented patterns (a.k.a. motifs) in DNA using in-silico (computational) methods. These motif finding tools can be used to find biologically interesting patterns like transcription factor binding sites and paralagous genes. One example would be the random projection method (http://www.ncbi.nlm.nih.gov/pubmed/12015879) which starts its search by hashing k-mer sequences.

I wish I could remember the paper, but I saw a graphic where they represent genomes as random walks. They start at the origin and move up 1 if the next base is an A, down 1 if it is a T, to the left if it is a C, and to the right if it is a G. If the distribution was random, you would expect a random walk. The genome is not completely random due to things like genes, CG islands, and repetitive regions. I know that people use hidden-markov models to model the distributions of DNA but am not too familiar with specific techniques.

**steven** · 06-30-2011, 06:10 AM

Originally posted by Fixee View Post

Hi all. I'm wondering if any studies have been done on the statistics of human genomic data. In particular, on the distribution of bases and k-mers in human dna.

Yes, people have been looking at that for a while actually.
Since the early 80s for instance, statistics on base and kmer frequencies have been used in the domain of gene finding (detection of coding regions in the genome).

Originally posted by Fixee View Post

For example, do G, C, A and T occur with equal frequency among the 3 gigabases?

No they don't. And within each genome (especially in higher eukaryotes) you can find huge discrepancies. Look for "isochores" for instance (GC-rich regions in the human genome).

Originally posted by Fixee View Post

Are there known "long" k-mers that occur with high frequency (I don't care if they code or not)?

Well you should

The coding property is the strongest constraint on the primary sequence, and your statistics mainly depend on the type of region you are looking at. In the domain of "gene finding" you will find a lot of relevant literature. Just for fun, here are a couple of references, from the oldest:

- Grantham et al, 1980
- Fickett, 1982
- Staden and McLahan, 1982
- Gribskov et al, 1984 (codon usage)
- Claverie and Bougueleret 1986 (k-mer frequencies)
- Fickett and Tung, 1992 (kmers)

Then in the 90s people started modeling k-mer frequencies using probabilistic models like Markov Chains, but this is a long story..

Oh, and don't forget that almost half of the genome is made of "repeated" regions. For instance look for the "Alu" sequence. Over-represented k-mers may correspond to these ones..

**steven** · 07-01-2011, 06:20 AM

Also, a recent one:

Error correction of high-throughput sequencing datasets with
non-uniform coverage
Paul Medvedev1,∗, Eric Scott2, Boyko Kakaradov2 and Pavel Pevzner1

Bioinfornmatics
Vol. 27 ISMB 2011, pages i137–i141
doi:10.1093/bioinformatics/btr208

They definitely look at k-mers there

**Michael.James.Clark** · 07-01-2011, 06:16 PM

Originally posted by Fixee View Post

Hi all. I'm wondering if any studies have been done on the statistics of human genomic data. In particular, on the distribution of bases and k-mers in human dna.

For example, do G, C, A and T occur with equal frequency among the 3 gigabases? Are there known "long" k-mers that occur with high frequency (I don't care if they code or not)?

Alternatively, if this stuff hasn't been well-studied, are there BAM files with complete genomes that are freely available? (I don't need CIGAR, or Phred scores, just the bases.)

Cheers!

Many, many, many, many times.

The genome isn't just a random assortment of nucleotides. In fact, if you look at the ratio of nucleotides to each other in coding regions compared to the whole genome, you'll see a dramatic difference (coding regions are GC rich). Things get more interesting if you start looking at multiple genomes and generating statistics related to transition:transversion ratio (closely linked with species) and indel size distribution between regions, etc.

I assume you can obtain data from any of a number of publicly available next-gen sequences. Most non-clinical sequencing study results are freely available (1000 genomes comes to mind).

**BAMseek** · 07-03-2011, 10:44 PM

I wish I could remember the paper, but I saw a graphic where they represent genomes as random walks.

The paper describing DNA walks can be found here.

Also, here is a review (somewhat old) of some of the visualization methods for analyzing DNA.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, Today, 11:08 AM	0 responses 5 views 0 reactions	Last Post by SEQadmin2 Today, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 18 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 52 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Seeking statistics on genomic data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News