I have sixteen RNA-Seq libraries which were aligned with TopHat. Counts of reads mapping to RefSeq genes were generated with htseq-count. My statistician collaborators need to normalize these counts for differences in sequencing depth. Here are my choices for the denominator:

- Total number of reads in the raw data (wc -l on the file from the sequencer)
- Total number of lines in the TopHat SAM file (wc -l on accepted_hits.sam)
- Number of unique reads for which TopHat found at least one location to assign (sort | uniq | wc -l on sequence field from SAM file)
- Sum of counts across all genes within each library

Does anyone have some feedback on this? The range of numbers for choice 1 above is 96396160-131352500.

Any help will be much appreciated,

Thanks,

Shurjo

