Unconfigured Ad

**dpryan** · 03-01-2014, 09:13 AM

It's generally a bad idea to deduplicate RNAseq reads. That will seriously deflate the expression of highly expressed genes and likely tank your power. Having a few highly expressed genes driving correlation is unsurprising.

**mparida85** · 03-02-2014, 08:14 AM

Thanks dpRyan for your reply. My only worry is these genes that are unusually highly expressed in one biological replicate vs another have a peptide sequence of no more than 25-30 aa. After I use the deduping program for example :
GSOIDG00016539001 = gene id
get an FPKM of 0 from an FPKM of 4.88E+06 when not deduped.
I am a little confused of how this is working out.
Please comment.
My pipeline:
gtf file processing :
before :
scaffold_1 Gaze gene 588 3589 2089.2116 + . ID=GSOIDG00000001001;Name=GSOIDG00000001001:gamma-aminobutyric acid (gaba-a) subunit alpha 1;Note= GSOIDG00000001001
scaffold_1 Gaze mRNA 588 3589 2089.2116 + . ID=GSOIDT00000001001;Name=GSOIDT00000001001;Parent=GSOIDG00000001001;Note= GSOIDG00000001001

after:
scaffold_1 Gaze gene 588 3589 2089.2116 + . ID=GSOIDG00000001001;Name=GSOIDG00000001001:gamma-aminobutyric acid (gaba-a) subunit alpha 1;Note= GSOIDG00000001001
scaffold_1 Gaze mRNA 588 3589 2089.2116 + . Parent=GSOIDG00000001001;;Name=GSOIDT00000001001;Note= GSOIDG00000001001

Next collected all the rRNA,mtRNA and other non coding genes from the gtf and added them to the mask file.

Finally DEDUPING and cufflinks:

java -Xmx4g -jar /Users/mparida/Qualifier/picard-tools-1.79/MarkDuplicates.jar INPUT=$CONTROL1_1 OUTPUT=$DATA/Sample_1/Sample_1_SORTED_DEDUPED.bam ASSUME_SORTED=true METRICS_FILE=$DATA/Sample_1/Sample_1.metrics VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true

cufflinks -o $DATA/Sample_1/CUFFLINKS2/ -p 8 -b $REF -u --library-type fr-unstranded -N --compatible-hits-norm -G $ANNOTATION -M $MASKFILE $DATA/Sample_1/SAMPLE1_SORTED_DEDUPED.bam

Please comment.
I will be very grateful for your concern.
Rocky

**dpryan** · 03-02-2014, 01:03 PM

I would argue that it's generally better to simply omit testing of genes (or other features of interest) when it seems to be an outlier. If you were to used DESeq2, this gene would likely just get flagged and omitted from testing due to Cook's Distance. The benefit of that approach is that you're not artificially decreasing your power among the higher expressed genes.

You might consider using something like CQN to see if there are particular length-bias differences among your samples that you can try to correct for.

**mparida85** · 03-02-2014, 07:22 PM

Hi dpryan
Thanks for your reply. Questions:
a) what is a length bias difference and where can I read about it?
b) also what does CQN stands for?
Rocky

**dpryan** · 03-03-2014, 01:56 AM

CQN is described in this paper and available from Bioconductor. You'll not use it with cuffdiff, as cuffdiff is only useful for the simplest of experiments. Use DESeq2/edgeR/etc. instead.

Topics	Statistics	Last Post
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, Yesterday, 08:59 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 Yesterday, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 22 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 32 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM

Unconfigured Ad

RNA-seq DEDUPING of Reads

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News