Why does Cufflinks with mask (-M) option have lower FPKM for mRNA genes?

    Dear all,

    I was curious how much the -M (mask file) option can improve the FPKM from Cufflinks. From the mannual, it says

    -M/--mask-file Tells Cufflinks to ignore all reads that could have come from transcripts in this GTF file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates.
    So, I would expect by providing the mask file containing rRNA, tRNA, mt genes etc. will decrease the "total mapped reads" (e.g. denominator), which will lead a increased FPKM. But actually what I see is, for most mRNA genes, the FPKM values with -M option are smaller than that without -M. See attached figures (e.g. I expect most of the dots are under the red dotted line, which is x=y).

    I have to admit that -M indeed can reduce a lot of the FPKM for rRNA genes. But still, it's mysterious why most mRNA genes have lower FPKM after applying -M option. Does anyone have similar observation?

    btw, here is my cufflinks arguments with -M:

    cufflinks --library-type fr-unstranded -o cufflink_w_M -p 8 -G /data/iGenome/Homo_sapiens/UCSC/hg19/Annotation/Genes/gencode.v13.annotation.karotyped.gtf -M /data/iGenome/Homo_sapiens/UCSC/hg19/Annotation/Genes/chrM.rRNA.tRNA.gtf --multi-read-correct accepted_hits.bam
    and without -M:

    cufflinks --library-type fr-unstranded -o cufflink_wo_M -p 8 -G /data/iGenome/Homo_sapiens/UCSC/hg19/Annotation/Genes/gencode.v13.annotation.karotyped.gtf --multi-read-correct accepted_hits.bam


    p.s. Sorry I re-post my question here. It was originally posted on biostar(
    Cufflinks Mask Option?

    Hi Xianjun

    Were you able to figure out how the cufflinks mask option works?

    I am myself trying to figure this out. In particular, I am confused whether the masked transcripts are excluded from the FPKM denominator. I tried running cufflinks with and without the mask option on the same dataset and the FPKM values for my genes of interest didn't change much, leading me to believe that the masked transcripts are still factored into the FPKM denominator but perhaps not included in estimation of the fragment size distribution (based on how I created the mask file, I should have seen a difference in the FPKM values if the masked transcripts were being excluded from the FPKM denominator).

    Any insights?



      I figured it out by using the "--compatible-hits-norm" option. See detail here: