I am calculating RPKM to make differential gene expression between two libraries.
First, I am extracting read counts for each gene interval using
bedtools multicov -bams file.bam -bed gene.gff3 > gene_counts.gff &
To compute RPKM i do (10^9 * C)/(N * L) and as "N" I use the sum of read counts i get for all the features calculated with bedtools multicov.
First question... I don't understand why the the sum of read counts i get for all the features calculated with bedtools multicov is much higher than the value I get when i calculate the total number of mapped reads with samtools. Why is that difference?
Second question... the two libraries I am comparing have different size so obviously I have to normalize them before comparison.. I don't know how to do normalization, first or after calculating RPKM?
Third question.. is it better to remove rRNA and tRNA when doing differential gene expression analysis?
Thanks everybody
First, I am extracting read counts for each gene interval using
bedtools multicov -bams file.bam -bed gene.gff3 > gene_counts.gff &
To compute RPKM i do (10^9 * C)/(N * L) and as "N" I use the sum of read counts i get for all the features calculated with bedtools multicov.
First question... I don't understand why the the sum of read counts i get for all the features calculated with bedtools multicov is much higher than the value I get when i calculate the total number of mapped reads with samtools. Why is that difference?
Second question... the two libraries I am comparing have different size so obviously I have to normalize them before comparison.. I don't know how to do normalization, first or after calculating RPKM?
Third question.. is it better to remove rRNA and tRNA when doing differential gene expression analysis?
Thanks everybody
Comment