Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Very high duplication of sequences in ChIP-Seq sequencing results

    Hey,
    I recently got several ChIP seq datasets back from our collaborators. Upon analysing the results I was quite surprised to find that the results were poor, surprised given the fact that the antibodies all work in ChIP and these ChIPs in particular definitely worked as they were QC'd as much as possible regards enrichments at known binding sites etc.....

    Anyway it was suggested to me to run the raw sequencing files through FASTQC which I did. I had noticed during the analysis a high level of read duplication in the libraries and sure enough FASTQC picked up on this as well. All of the libraries fail miserably on this parameter.

    My question is, where does this high level of read duplication come from? Surely it has to be from the PCR amplication in the library prep protocol (I did 18 cycles). Should I expect much better results if I used fewer rounds of PCR - 12,14.....something like this?

    Thanks
    Optimus

  • #2
    Are you using the multiplexing primers from Illumina? We had similar problems and got a much better yield of unique fragments after switching to full-length adaptors.

    Comment


    • #3
      No we aren't using the multiplexing primers....

      Comment


      • #4
        Check out SEQanswers Thread and threads therein:
        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

        & http://ngsbuzz.blogspot.com/2010/10/...explained.html

        Comment


        • #5
          a) what are your parameters to define "poor results"?

          b) i do not agree that read duplications are PCR artifacts in general. did you evaluate your data with/without de-duplication? did you run a control sample (input)? if yes, do you observe high duplication in the control as well?

          Comment


          • #6
            It sounds like a ligation buffer problem. If the ligation buffer isnt stored properly or has too many freeze thaw cycles it stops working. If your ligation efficiency is very low then the complexity of your library is reduced and after PCR you get duplication because of the low amount of correctly ligated starting material.

            I have seen this problem quite a few times.

            Comment


            • #7
              In general you want to do as few PCR cycles as you can. There's no point in doing 18 cycles if you only need 12 to get enough material. We've seen diversity in libraries dramatically increase from a reduction of only 3-4 PCR cycles.

              Most of the time though the PCR cycles aren't the root cause of your problem. Some other step earlier in the library prep is normally causing too great a loss of material which then necessitates the extra PCR in order to make the library. Getting as much material through the library construction is the key. I know that many of our scientists have found that they were able to eliminate some of the intermediate cleanup steps to reduce the amount of material loss which seems to make a big difference, especially from small amounts of starting material.

              It's also worth checking that you really have a problem with duplication. FastQC makes a general check for duplication levels, but failing the duplicate sequences test doesn't necessarily indicate a problem. In particular if you have a relatively small number of highly enriched sites in your ChIP then you may have saturated the set of potential start sites for reads, and will then start creating duplicates. The real problem comes where you can see duplicated reads in an enriched region which isn't saturated, since these suggest the duplication is technical rather than biological. If you said you could see the duplication when you looked at the data I suspect this is a real problem in your data, but it's worth checking.

              Comment


              • #8
                Originally posted by OptimusBrien View Post
                Hey,
                I recently got several ChIP seq datasets back from our collaborators. Upon analysing the results I was quite surprised to find that the results were poor, surprised given the fact that the antibodies all work in ChIP and these ChIPs in particular definitely worked as they were QC'd as much as possible regards enrichments at known binding sites etc.....

                Anyway it was suggested to me to run the raw sequencing files through FASTQC which I did. I had noticed during the analysis a high level of read duplication in the libraries and sure enough FASTQC picked up on this as well. All of the libraries fail miserably on this parameter.

                My question is, where does this high level of read duplication come from? Surely it has to be from the PCR amplication in the library prep protocol (I did 18 cycles). Should I expect much better results if I used fewer rounds of PCR - 12,14.....something like this?

                Thanks
                Optimus
                Large duplication in SE ChIP-Seq libraries is a typical result of PCR. This can directly reflect the difficulty in isolating your protein via cross-linkers or how effective your antibodies are. Take this example, say you had a very small initial amount of DNA and had to bring your library up to a proper size for sequencing. You bind your antibodies, perform your ChIP, and then you wash away, PCR, and then sequence. You performed your PCR as such. You then sequence. Naturally, you'll expect many clones of the same DNA fragment due to the small initial DNA size. When you map them, you'll see a large number of duplicates, and you must remove them as they'll give you uninformative coverage, and have a direct effect on your peak-calling ability. As the above poster said, you want as many unique start sites (uniquely mapped reads) as possible. A work-around is ensuring you've effectively ChIP'ed your binding sites and sometimes this requires changes in the methods and materials. I've personally seen duplication levels as high as 90% and as low as 40%. In all cases, I remove the duplicates. However, this is all relative. I would maybe consider not removing them if your throughput was very, very low, but these things are so parameterized, and one important thing to consider is the method used in creating your inputs/controls. Having an adequate background could save your low throughput as well.

                Comment


                • #9
                  A lot of great points said already on this thread. I'd just like to add that I am able to substantially decrease the number of amplification PCR cycles by using the Kapa HF polymerase instead of Phusion polymerase. I am not a Kapa salesman, but I am just convinced it is a superior product.

                  That being said, as noted previously, it is not the root cause of your problem if is even a problem at all. While it will not solve your problems, it will push things in the direction you want, i.e. great efficiency through the library generation process and reduce GC-bias as a bonus.
                  --------------
                  Ethan

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    The Impact of AI in Genomic Medicine
                    by seqadmin



                    Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                    02-26-2024, 02:07 PM
                  • seqadmin
                    Multiomics Techniques Advancing Disease Research
                    by seqadmin


                    New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                    A major leap in the field has
                    ...
                    02-08-2024, 06:33 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:12 AM
                  0 responses
                  17 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 02-23-2024, 04:11 PM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 02-21-2024, 08:52 AM
                  0 responses
                  73 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 02-20-2024, 08:57 AM
                  0 responses
                  62 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X