hi,
I have 3 questions on the same problem:
I'm aligning reference DNAseq (illumina) to itself. I plot the gene coverage histogram and try to calculate the gene copy number assuming the peak represents single copy.
1) how does it happen that i get genes with 30 copies even tho i only allow a single alignment per read to be reported (-k 1)? one explanation is that the genome annotation is not correct, and that there are more copies for some genes than is actually annotated. Is there any other explanation?
2) Using -k30 parameter gives a gene coverage histogram of Poisson distribution with a very long right tail. Using the -k1 parameter gives a histogram with 2 peaks. How can I explain the 2 peaks (a small one followed by the big one, which doesn't fit the fact that the majority of genes have one copy, and a smaller number have 2, etc.)
3) I realize the gene coverage will not be the same in the 2 runs, but why does the order of genes with high-low coverage differ as well?
thanks,
I have 3 questions on the same problem:
I'm aligning reference DNAseq (illumina) to itself. I plot the gene coverage histogram and try to calculate the gene copy number assuming the peak represents single copy.
1) how does it happen that i get genes with 30 copies even tho i only allow a single alignment per read to be reported (-k 1)? one explanation is that the genome annotation is not correct, and that there are more copies for some genes than is actually annotated. Is there any other explanation?
2) Using -k30 parameter gives a gene coverage histogram of Poisson distribution with a very long right tail. Using the -k1 parameter gives a histogram with 2 peaks. How can I explain the 2 peaks (a small one followed by the big one, which doesn't fit the fact that the majority of genes have one copy, and a smaller number have 2, etc.)
3) I realize the gene coverage will not be the same in the 2 runs, but why does the order of genes with high-low coverage differ as well?
thanks,
Comment