Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • mparida85
    Member
    • Jan 2014
    • 17

    RNA-seq DEDUPING of Reads

    Hi all
    I have a RNA-seq dataset that I am working on.
    I have 2 biological replicates for 6 samples.
    case a) When I used picard tools deduping, I see a 99% correlation between 2 replicates.
    case b) When I am not using deduping I also see a 99% correlation between 2 replicates. In the latter case I see an outlier that has very high FPKM in both the replicates and that is driving the R-square to 0.99 and without that outlier gene I see a R-square of 0.40.

    I am trying to understand if this difference 0.99 from case a) to 0.40 in case b) is solely due to duplicated reads. If that's the case is it wise to remove these duplicated reads using picard mark duplicates.

    I am using Cufflinks v2.1.1 for this analysis.
    Please comment.
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    It's generally a bad idea to deduplicate RNAseq reads. That will seriously deflate the expression of highly expressed genes and likely tank your power. Having a few highly expressed genes driving correlation is unsurprising.

    Comment

    • mparida85
      Member
      • Jan 2014
      • 17

      #3
      Thanks dpRyan for your reply. My only worry is these genes that are unusually highly expressed in one biological replicate vs another have a peptide sequence of no more than 25-30 aa. After I use the deduping program for example :
      GSOIDG00016539001 = gene id
      get an FPKM of 0 from an FPKM of 4.88E+06 when not deduped.
      I am a little confused of how this is working out.
      Please comment.
      My pipeline:
      gtf file processing :
      before :
      scaffold_1 Gaze gene 588 3589 2089.2116 + . ID=GSOIDG00000001001;Name=GSOIDG00000001001:gamma-aminobutyric acid (gaba-a) subunit alpha 1;Note= GSOIDG00000001001
      scaffold_1 Gaze mRNA 588 3589 2089.2116 + . ID=GSOIDT00000001001;Name=GSOIDT00000001001;Parent=GSOIDG00000001001;Note= GSOIDG00000001001

      after:
      scaffold_1 Gaze gene 588 3589 2089.2116 + . ID=GSOIDG00000001001;Name=GSOIDG00000001001:gamma-aminobutyric acid (gaba-a) subunit alpha 1;Note= GSOIDG00000001001
      scaffold_1 Gaze mRNA 588 3589 2089.2116 + . Parent=GSOIDG00000001001;;Name=GSOIDT00000001001;Note= GSOIDG00000001001

      Next collected all the rRNA,mtRNA and other non coding genes from the gtf and added them to the mask file.

      Finally DEDUPING and cufflinks:


      java -Xmx4g -jar /Users/mparida/Qualifier/picard-tools-1.79/MarkDuplicates.jar INPUT=$CONTROL1_1 OUTPUT=$DATA/Sample_1/Sample_1_SORTED_DEDUPED.bam ASSUME_SORTED=true METRICS_FILE=$DATA/Sample_1/Sample_1.metrics VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true

      cufflinks -o $DATA/Sample_1/CUFFLINKS2/ -p 8 -b $REF -u --library-type fr-unstranded -N --compatible-hits-norm -G $ANNOTATION -M $MASKFILE $DATA/Sample_1/SAMPLE1_SORTED_DEDUPED.bam

      Please comment.
      I will be very grateful for your concern.
      Rocky

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #4
        I would argue that it's generally better to simply omit testing of genes (or other features of interest) when it seems to be an outlier. If you were to used DESeq2, this gene would likely just get flagged and omitted from testing due to Cook's Distance. The benefit of that approach is that you're not artificially decreasing your power among the higher expressed genes.

        You might consider using something like CQN to see if there are particular length-bias differences among your samples that you can try to correct for.

        Comment

        • mparida85
          Member
          • Jan 2014
          • 17

          #5
          Hi dpryan
          Thanks for your reply. Questions:
          a) what is a length bias difference and where can I read about it?
          b) also what does CQN stands for?
          Rocky

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #6
            CQN is described in this paper and available from Bioconductor. You'll not use it with cuffdiff, as cuffdiff is only useful for the simplest of experiments. Use DESeq2/edgeR/etc. instead.

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM
            • SEQadmin2
              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
              by SEQadmin2


              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


              Introduction

              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
              05-22-2026, 06:42 AM
            • SEQadmin2
              Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
              by SEQadmin2

              Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


              Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
              05-06-2026, 09:04 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Yesterday, 08:59 AM
            0 responses
            14 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 12:03 PM
            0 responses
            22 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 11:40 AM
            0 responses
            19 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 05-28-2026, 11:40 AM
            0 responses
            32 views
            0 reactions
            Last Post SEQadmin2  
            Working...