Hi all. I'm wondering if any studies have been done on the statistics of human genomic data. In particular, on the distribution of bases and k-mers in human dna.
For example, do G, C, A and T occur with equal frequency among the 3 gigabases? Are there known "long" k-mers that occur with high frequency (I don't care if they code or not)?
Alternatively, if this stuff hasn't been well-studied, are there BAM files with complete genomes that are freely available? (I don't need CIGAR, or Phred scores, just the bases.)
Cheers!
For example, do G, C, A and T occur with equal frequency among the 3 gigabases? Are there known "long" k-mers that occur with high frequency (I don't care if they code or not)?
Alternatively, if this stuff hasn't been well-studied, are there BAM files with complete genomes that are freely available? (I don't need CIGAR, or Phred scores, just the bases.)
Cheers!
The coding property is the strongest constraint on the primary sequence, and your statistics mainly depend on the type of region you are looking at. In the domain of "gene finding" you will find a lot of relevant literature. Just for fun, here are a couple of references, from the oldest:
Comment