When a genome is fully sequenced, it can be compared to other genomes. For example, human and chimpanzees share 96% of their genome.
My question is what does this "96% similarity" really mean? I realized it's a boiled down summary for the popular press, but what's the actual measurement/comparison behind it, and what tools make these kinds of comparisons?
Does it mean that each genome has been run through a global alignment with each other and only 4% of the sequence has no reasonable match? (ha, is it even practical to consider a global alignment with a 4B bp genome? Are there any tools to even try? Would such a global alignment have any meaning since even the chromosome count differs?)
Or perhaps it means something like "if you take a random 100bp substring from the human genome, there's a 96% chance you can find the identical substring in the chimp genome." This makes some experimental and mathematical sense but in some ways it's poorly defined, since for example if you changed it to taking a random 5 bp substring from the genome, a human would have 100% dupication with an oak tree, since 5bp is so short it's easy to match.
In my particular example of chimp versus human, it was done by the Eichler Lab but it's unclear which of their papers talks about the comparison of genomes.
My question is what does this "96% similarity" really mean? I realized it's a boiled down summary for the popular press, but what's the actual measurement/comparison behind it, and what tools make these kinds of comparisons?
Does it mean that each genome has been run through a global alignment with each other and only 4% of the sequence has no reasonable match? (ha, is it even practical to consider a global alignment with a 4B bp genome? Are there any tools to even try? Would such a global alignment have any meaning since even the chromosome count differs?)
Or perhaps it means something like "if you take a random 100bp substring from the human genome, there's a 96% chance you can find the identical substring in the chimp genome." This makes some experimental and mathematical sense but in some ways it's poorly defined, since for example if you changed it to taking a random 5 bp substring from the genome, a human would have 100% dupication with an oak tree, since 5bp is so short it's easy to match.
In my particular example of chimp versus human, it was done by the Eichler Lab but it's unclear which of their papers talks about the comparison of genomes.
Comment