Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • DunderChief
    Junior Member
    • Aug 2012
    • 6

    Is my chip-seq data garbage?

    I received some chip-seq data that had a very high amount of sequence duplication (over 90% of the reads). The experiment was looking at H3K4me3. I aligned with bowtie2 and ran rmdup and ended up with only about 1 million unique reads mapped. Most of the peaks that MACS is calling have only 5 reads in them. I'm wondering if the data is complete garbage or if I can get something legitimate out of these peaks?
  • xubeisi
    Junior Member
    • Dec 2010
    • 2

    #2
    it seems so, check MACS model file, if the watson & crick distance is small, this means it's useless. you may also want check with fastq, this high duplication could probably due to adapters.

    Comment

    • Tobikenobi
      Member
      • Mar 2013
      • 17

      #3
      Originally posted by xubeisi View Post
      ... check MACS model file, if the watson & crick distance is small, this means it's useless...
      How small are we talking about?

      Comment

      • xubeisi
        Junior Member
        • Dec 2010
        • 2

        #4
        Originally posted by Tobikenobi View Post
        How small are we talking about?
        ~100 should be fine, to me, samples less than 50 are trash

        Comment

        • simonandrews
          Simon Andrews
          • May 2009
          • 870

          #5
          Have you actually looked at your data (both before and after duplication)?

          Simply looking at the pattern of mapped reads will very quickly tell you if you're wasting your time trying spending more effort on your analysis.

          Comment

          • Tobikenobi
            Member
            • Mar 2013
            • 17

            #6
            Sorry to hijack this thread...

            Originally posted by xubeisi View Post
            ~100 should be fine, to me, samples less than 50 are trash
            Depending on what number I enter as mfold in MACS (>10), I can get anything from d=51 to d=118. Does that tell me anything, and is it desirable to go for the highest d possible?
            Thank you very much!

            Comment

            • Tobikenobi
              Member
              • Mar 2013
              • 17

              #7
              Originally posted by simonandrews View Post
              Have you actually looked at your data (both before and after duplication)?

              Simply looking at the pattern of mapped reads will very quickly tell you if you're wasting your time trying spending more effort on your analysis.
              Could you please specify what you mean by `before and after duplication`?

              Also, what would I expect to see in case of high duplication levels (I am looking at ~75% duplication according to fastqc myself)?

              Comment

              • simonandrews
                Simon Andrews
                • May 2009
                • 870

                #8
                Originally posted by Tobikenobi View Post
                Could you please specify what you mean by `before and after duplication`?

                Also, what would I expect to see in case of high duplication levels (I am looking at ~75% duplication according to fastqc myself)?
                High duplication can come from a few different sources. It could be that you've got very good enriched data and have saturated the coverage of the enriched regions, in which case it would be bad to deduplicate. It could be that you have a very small number of regions which have enormous coverage or you could have more general low level duplication of your whole library. A 75% duplication level could come from most sequences appearing exactly twice in your data, or it could be a small number of sites with huge coverage.

                If you look at the mapped data before you've done any deduplication you will be able to see whether there is a problem. If you see nicely enriched peaks with even coverage over them then the data might well be OK as it is. If you can see obviously biased coverage with more isolated towers of reads where you have duplication then you would need to deduplicate to stand any chance of getting sensible results out of your data.

                Don't think that you should always deduplicate your data. There are defininte down sides to doing so - for high coverage regions you can end up compressing the dynamic range of your data and reducing the amount of information you have to work with. It can help in some cases, but when we're analysing data we would only deduplicate if we could see that there was a problem with the data which deduplication would help to fix.

                Comment

                • Tobikenobi
                  Member
                  • Mar 2013
                  • 17

                  #9
                  Originally posted by simonandrews View Post
                  High duplication can come from a few different sources. It could be that you've got very good enriched data and have saturated the coverage of the enriched regions, in which case it would be bad to deduplicate. It could be that you have a very small number of regions which have enormous coverage or you could have more general low level duplication of your whole library. A 75% duplication level could come from most sequences appearing exactly twice in your data, or it could be a small number of sites with huge coverage.

                  If you look at the mapped data before you've done any deduplication you will be able to see whether there is a problem. If you see nicely enriched peaks with even coverage over them then the data might well be OK as it is. If you can see obviously biased coverage with more isolated towers of reads where you have duplication then you would need to deduplicate to stand any chance of getting sensible results out of your data.

                  Don't think that you should always deduplicate your data. There are defininte down sides to doing so - for high coverage regions you can end up compressing the dynamic range of your data and reducing the amount of information you have to work with. It can help in some cases, but when we're analysing data we would only deduplicate if we could see that there was a problem with the data which deduplication would help to fix.
                  Thank you very much for your help!
                  I actually looked at the data before and after filterig for duplicates and have attached a picture of my four samples before (top four tracks) and after de-duplication (lower four tracks). It seems that your second suggestion of isolated towers seems to be the case, as I saw similar things across all chromosomes.
                  I then went on to try peak calling on my original files (only clipped the adapters and trimmed a little of the 3` end), for what I randomly selected and omitted lines in the input to get equal numbers of tags. Then MACS gives me the following output in the peaks.xls file:

                  # This file is generated by MACS
                  # ARGUMENTS LIST:
                  # name = E_2_mfold_20
                  # format = SAM
                  # ChIP-seq file = /galaxy/main_pool/pool7/files/005/979/dataset_5979847.dat
                  # control file = /galaxy/main_pool/pool7/files/005/965/dataset_5965128.dat
                  # effective genome size = 1.87e+09
                  # tag size = 50
                  # band width = 300
                  # model fold = 20
                  # pvalue cutoff = 1.00e-05
                  # Ranges for calculating regional lambda are : peak_region,1000,5000,10000
                  # unique tags in treatment: 2868667
                  # total tags in treatment: 22927127
                  # unique tags in control: 8014554
                  # total tags in control: 22927127

                  # d = 51

                  Especially in the treatment, the unique tags are very low compared to the control. This makes FDR unreliable.

                  Is it adviseable to de-duplicate the data and try peak calling then?
                  Also, as I have two replicates, would be reasonable to combine the two replicates to obtain more unique reads, and then try the peak calling again?

                  Again, thank you very much for your input!
                  Attached Files

                  Comment

                  • simonandrews
                    Simon Andrews
                    • May 2009
                    • 870

                    #10
                    It might be worth noting that MACS does an internal deduplication of your data whilst peak calling. It works out the likely duplication level in your data and then removes any tags which are duplicated above that level when calling peaks. It may not remove as much data as doing a complete strict deduplication, but it does look at this information.

                    I had a look at the image you posted but at that resolution it's hard to see what's going on. It's not unusual to see a few huge outliers in the data (which can skew the scale on the y-axis), it's more what happens at a more local level which is important, especially looking at the actual pattern of mapped reads rather than quantitated values.

                    Comment

                    • Tobikenobi
                      Member
                      • Mar 2013
                      • 17

                      #11
                      So if I understand correctly, it may not be necessary at all to deduplicate the data before using MACS, as it will attempt this on its own.
                      Moreover, if I would deduplicate myself, I would omit true duplicates that arise from sequencing depth. So deduplicating would really only make sense if I really wanted the accurate FDR from MACS, which I can only get if I adjust the unique tag number beforehand?

                      Comment

                      Latest Articles

                      Collapse

                      • SEQadmin2
                        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                        by SEQadmin2


                        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                        Here are nine questions we think about, in roughly the order they matter, before...
                        Yesterday, 07:11 AM
                      • SEQadmin2
                        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                        by SEQadmin2


                        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                        ...
                        06-02-2026, 10:05 AM
                      • SEQadmin2
                        Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                        by SEQadmin2


                        With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                        Introduction

                        Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                        05-22-2026, 06:42 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, 06-17-2026, 06:09 AM
                      0 responses
                      20 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-09-2026, 11:58 AM
                      0 responses
                      38 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-05-2026, 10:09 AM
                      0 responses
                      44 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-04-2026, 08:59 AM
                      0 responses
                      49 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...