I am trying to write a PERL script to calculate the RPKM for genes of interest and I need some verification that I am doing this calculation correctly. There are 31.8 million mapped reads on the genome.
Here is the GFF3 file of a gene for example. There are 4,011 reads that map to this gene (between positions 4542759 and 4544980).
Chr2 MSU_osa1r6 gene 4542759 4544980 . + . ID=13102.t00754;Name=unknown gene
Chr2 MSU_osa1r6 mRNA 4542759 4544980 . + . ID=13102.m00974;Parent=13102.t00754
Chr2 MSU_osa1r6 five_prime_UTR 4542759 4543030 . + . Parent=13102.m00974
Chr2 MSU_osa1r6 CDS 4543031 4543177 . + 0 Parent=13102.m00974
Chr2 MSU_osa1r6 CDS 4543287 4543709 . + 0 Parent=13102.m00974
Chr2 MSU_osa1r6 CDS 4543836 4543952 . + 0 Parent=13102.m00974
Chr2 MSU_osa1r6 CDS 4544064 4544423 . + 0 Parent=13102.m00974
Chr2 MSU_osa1r6 three_prime_UTR 4544424 4544980 . + . Parent=13102.m00974
There are 4 exons for this particular gene which contain a total of 1,043 base pairs.
So the RPKM for this particular gene is ((4,011 reads/1.043kb of exon)/31.8mill mapped reads) = 120.9RPKM
Is my calculation correct?
Also, if there are reads that map to the intron regions or partial intron regions, should those reads be excluded from the calculation?
This gene also has 3 other alternative spliced forms, which splicing is the correct one?
Thanks in advance
Here is the GFF3 file of a gene for example. There are 4,011 reads that map to this gene (between positions 4542759 and 4544980).
Chr2 MSU_osa1r6 gene 4542759 4544980 . + . ID=13102.t00754;Name=unknown gene
Chr2 MSU_osa1r6 mRNA 4542759 4544980 . + . ID=13102.m00974;Parent=13102.t00754
Chr2 MSU_osa1r6 five_prime_UTR 4542759 4543030 . + . Parent=13102.m00974
Chr2 MSU_osa1r6 CDS 4543031 4543177 . + 0 Parent=13102.m00974
Chr2 MSU_osa1r6 CDS 4543287 4543709 . + 0 Parent=13102.m00974
Chr2 MSU_osa1r6 CDS 4543836 4543952 . + 0 Parent=13102.m00974
Chr2 MSU_osa1r6 CDS 4544064 4544423 . + 0 Parent=13102.m00974
Chr2 MSU_osa1r6 three_prime_UTR 4544424 4544980 . + . Parent=13102.m00974
There are 4 exons for this particular gene which contain a total of 1,043 base pairs.
So the RPKM for this particular gene is ((4,011 reads/1.043kb of exon)/31.8mill mapped reads) = 120.9RPKM
Is my calculation correct?
Also, if there are reads that map to the intron regions or partial intron regions, should those reads be excluded from the calculation?
This gene also has 3 other alternative spliced forms, which splicing is the correct one?
Thanks in advance
Comment