Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Apparent duplication levels incongruence between bismark and fastqc with BS-Seq data

    Hi all,

    I am working with a BS-Seq dataset and I came across this result that puzzles me a bit.

    I ran fastqc on the fastq files first and I got a estimated duplication level of 36.83% (fastqc plot attached)

    Afterwards, I mapped the data using Bismark: Here's the mapping report:

    Number of paired-end alignments with a unique best hit: 165375035
    Mapping efficiency: 71.3%
    Sequences with no alignments under any condition: 52756927
    Sequences did not map uniquely: 13328411

    The number of sequences that did not map uniquely is less than 10% the number of mapped sequences

    So I can only think of two possibilities here:

    1- Our dataset really contains a high level of polyclonality (therefore we'll have to worry about it and improve the protocol we use to prepare the BS-Seq library). This would imply that >20% of the duplicate reads are not mapped at all explaining the difference in duplication levels between fastqc and bismark. Have any bismark users come across something like this before?

    2- Could it be that there is something about the way fastqc estimates the duplicate levels that artificially boosts the numbers of duplicates in our dataset? I'm not really sure about this because I used fastqc in the past and it always seemed to work really well but I wonder if there is something about bisulfite converted reads that could cause this behaviour

    Thanks a lot in andvance for your answers!
    Attached Files

  • #2
    Something more about this. Going through the SEQanswers post related to fastqc I've found a link to this page:



    where Simon Andrews mentions that fastqc only uses the first 50bp of each sequence to search for duplicates. I guess that since the reads in my dataset are 100bp long they duplication levels can be boosted by only considering the first 50bp when looking for identical reads. So now I'm thinking that the correct answer is the 2nd possibility

    Comment


    • #3
      Hi gcarbajosa,

      As you mentioned, FastQC determines an approximate level of sequence duplication by storing the first 50bp of the first 200,000 different sequences it encounters in a sequencing file. These duplicated sequences may for example be be adapter contamination (which would not map at all in Bismark), but could also be duplicate reads that were amplified by PCR during the library construction. These reads might align perfectly well and uniquely to the genome even though they might be technical duplicates.

      So essentially the number of reads mapping non-uniquely (which are being discarded) and duplicated reads is not the same thing, and Bismark does not specifically output anything regarding duplication levels. I hope this helps?

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Recent Advances in Sequencing Analysis Tools
        by seqadmin


        The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
        Yesterday, 07:48 AM
      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 06:57 AM
      0 responses
      9 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 07:17 AM
      0 responses
      13 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-02-2024, 08:06 AM
      0 responses
      19 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-30-2024, 12:17 PM
      0 responses
      22 views
      0 likes
      Last Post seqadmin  
      Working...
      X