Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • inbarpl
    Junior Member
    • Jul 2011
    • 4

    duplicate reads in Illumina short, single end reads of RNAseq data

    Dear all,
    I am performing QC to fastq files of Illumina 76bp length single end reads of RNAseq data. I keep getting indications that there are many PCR duplicates; Fastqc report indicates 74.1% sequence duplication level, but no overrepresented sequences list is given. When SAMtools Duplicates removal (rmdup) option is performed, 74.9% of the sequences were found to be duplicates. When comparing the mapping results before and after the duplicates removal I see that the highly expressed genes, has the highest fraction of duplicates (which were removed). This is not the first illumina single end short reads RNA seq experiment that I see this phenomena.
    I keep wondering whether this is an experimental artifact (then, should we repeat on the experiment?) or just a possible valid result (in this scenario I must than believe that few different inserts which were generated from different copies of the same kind of RNA transcripts were cleaved at the same base, leading to identical 5’ end of an insert which are then sequenced).
    I would be glad to know your opinion in this matter
    Many thanks
    Inbar
  • swbarnes2
    Senior Member
    • May 2008
    • 910

    #2
    With 76-mer single reads, even for a perfectly diverse library, the theortical depth limit at any point is 152 if you use rmdup. So any gene that has more coverage than that ceiling is going to be whacked down to 152x. So you won't be able to quantify expression of those highly expressed genes.

    That library sounds awfully non-diverse, but if your sample is dominated by a couple of genes at super high levels, maybe it's accurate. I guess you could examine the highly represented reads. Do they cover whole genes as if the sample had a huge amount of that RNA? Or is there just one position that has 100K reads, and adjacent positions have much less?

    Comment

    • arvid
      Senior Member
      • Jul 2011
      • 156

      #3
      Exactly, I'd have a look at the shape of the read alignments before de-duplication to see whether it looks like PCR or simply very high coverage. 74 % isn't exceptionally high, I usually see 60-80 % for libraries which look OK.
      In any case, de-duplication on reads for downstream quantification is a delicate matter, as it is difficult to discern PCR copies from valid, high-coverage, reads as swbarnes2 pointed out.

      Comment

      • inbarpl
        Junior Member
        • Jul 2011
        • 4

        #4
        swbarnes2, Thanks a lot for your answer,
        I guess this is exactly the case in my data set, the samples are from Arabidopsis so I guess that Rubisco gene is the dominant in the library. I will check what you've recommended using IGV. Sorry for my ignorance but could you please explain the definition of "theoretical depth limit" and the calculation you did to extract it for my parameters ?
        many thanks
        Inbar

        Comment

        • swbarnes2
          Senior Member
          • May 2008
          • 910

          #5
          Originally posted by inbarpl View Post
          swbarnes2, Thanks a lot for your answer,
          I guess this is exactly the case in my data set, the samples are from Arabidopsis so I guess that Rubisco gene is the dominant in the library. I will check what you've recommended using IGV. Sorry for my ignorance but could you please explain the definition of "theoretical depth limit" and the calculation you did to extract it for my parameters ?
          many thanks
          Inbar
          If you filter single end data for uniqueness, you will have exactly two reads beginning at every point; one in the forward direction, one in the reverse.

          So with 76-mers, the base at position 100 will be covered by 152 reads, 76 in the forward direction, starting at bases 35-100, and 76 in the reverse direction, starting from 100-175. You can't have three reads all running forward, starting at position 75, becuae your rmdup will get rid of two of them.

          With paired end, you can have three reads which run in the forward direction starting at base 75, if their mates all start at different sites, because if their mates are at different sites, they must have come from different fragments. So there's a ceiling there too, depending on how variant your insert sizes are, but it's far higher than the ceiling for single read runs.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Pathogen Surveillance with Advanced Genomic Tools
            by seqadmin




            The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
            Yesterday, 11:48 AM
          • seqadmin
            New Genomics Tools and Methods Shared at AGBT 2025
            by seqadmin


            This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

            The Headliner
            The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
            03-03-2025, 01:39 PM
          • seqadmin
            Investigating the Gut Microbiome Through Diet and Spatial Biology
            by seqadmin




            The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
            02-24-2025, 06:31 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 03-20-2025, 05:03 AM
          0 responses
          26 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-19-2025, 07:27 AM
          0 responses
          33 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-18-2025, 12:50 PM
          0 responses
          25 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-03-2025, 01:15 PM
          0 responses
          190 views
          0 reactions
          Last Post seqadmin  
          Working...