What would the recommended standard be for doing a normalization to eliminate or reduce the impact of gene size, and where should I go to obtain the data to do this?
I am working with TCGA MAF files so I have entrez IDs and Hugo names for the genes. I would like to normalize the counts according to the size of the gene.
I THINK tcga only includes mutations that were within the expressed sequences, but I would have to double check to be sure. (if any of you know the answer for certain that would be appreciated.)
Would the answer make a difference as to whether I should normalize only according to total number of nucleotides in exon sequences (and leave out the intron lengths?), or should I go for total gene start-stop length anyway?
The data is Human data. To be clear, by counts I mean # of mutations per gene, per sample or grouping of samples.
I am working with TCGA MAF files so I have entrez IDs and Hugo names for the genes. I would like to normalize the counts according to the size of the gene.
I THINK tcga only includes mutations that were within the expressed sequences, but I would have to double check to be sure. (if any of you know the answer for certain that would be appreciated.)
Would the answer make a difference as to whether I should normalize only according to total number of nucleotides in exon sequences (and leave out the intron lengths?), or should I go for total gene start-stop length anyway?
The data is Human data. To be clear, by counts I mean # of mutations per gene, per sample or grouping of samples.
Comment