Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is my chip-seq data garbage?

    I received some chip-seq data that had a very high amount of sequence duplication (over 90% of the reads). The experiment was looking at H3K4me3. I aligned with bowtie2 and ran rmdup and ended up with only about 1 million unique reads mapped. Most of the peaks that MACS is calling have only 5 reads in them. I'm wondering if the data is complete garbage or if I can get something legitimate out of these peaks?

  • #2
    it seems so, check MACS model file, if the watson & crick distance is small, this means it's useless. you may also want check with fastq, this high duplication could probably due to adapters.

    Comment


    • #3
      Originally posted by xubeisi View Post
      ... check MACS model file, if the watson & crick distance is small, this means it's useless...
      How small are we talking about?

      Comment


      • #4
        Originally posted by Tobikenobi View Post
        How small are we talking about?
        ~100 should be fine, to me, samples less than 50 are trash

        Comment


        • #5
          Have you actually looked at your data (both before and after duplication)?

          Simply looking at the pattern of mapped reads will very quickly tell you if you're wasting your time trying spending more effort on your analysis.

          Comment


          • #6
            Sorry to hijack this thread...

            Originally posted by xubeisi View Post
            ~100 should be fine, to me, samples less than 50 are trash
            Depending on what number I enter as mfold in MACS (>10), I can get anything from d=51 to d=118. Does that tell me anything, and is it desirable to go for the highest d possible?
            Thank you very much!

            Comment


            • #7
              Originally posted by simonandrews View Post
              Have you actually looked at your data (both before and after duplication)?

              Simply looking at the pattern of mapped reads will very quickly tell you if you're wasting your time trying spending more effort on your analysis.
              Could you please specify what you mean by `before and after duplication`?

              Also, what would I expect to see in case of high duplication levels (I am looking at ~75% duplication according to fastqc myself)?

              Comment


              • #8
                Originally posted by Tobikenobi View Post
                Could you please specify what you mean by `before and after duplication`?

                Also, what would I expect to see in case of high duplication levels (I am looking at ~75% duplication according to fastqc myself)?
                High duplication can come from a few different sources. It could be that you've got very good enriched data and have saturated the coverage of the enriched regions, in which case it would be bad to deduplicate. It could be that you have a very small number of regions which have enormous coverage or you could have more general low level duplication of your whole library. A 75% duplication level could come from most sequences appearing exactly twice in your data, or it could be a small number of sites with huge coverage.

                If you look at the mapped data before you've done any deduplication you will be able to see whether there is a problem. If you see nicely enriched peaks with even coverage over them then the data might well be OK as it is. If you can see obviously biased coverage with more isolated towers of reads where you have duplication then you would need to deduplicate to stand any chance of getting sensible results out of your data.

                Don't think that you should always deduplicate your data. There are defininte down sides to doing so - for high coverage regions you can end up compressing the dynamic range of your data and reducing the amount of information you have to work with. It can help in some cases, but when we're analysing data we would only deduplicate if we could see that there was a problem with the data which deduplication would help to fix.

                Comment


                • #9
                  Originally posted by simonandrews View Post
                  High duplication can come from a few different sources. It could be that you've got very good enriched data and have saturated the coverage of the enriched regions, in which case it would be bad to deduplicate. It could be that you have a very small number of regions which have enormous coverage or you could have more general low level duplication of your whole library. A 75% duplication level could come from most sequences appearing exactly twice in your data, or it could be a small number of sites with huge coverage.

                  If you look at the mapped data before you've done any deduplication you will be able to see whether there is a problem. If you see nicely enriched peaks with even coverage over them then the data might well be OK as it is. If you can see obviously biased coverage with more isolated towers of reads where you have duplication then you would need to deduplicate to stand any chance of getting sensible results out of your data.

                  Don't think that you should always deduplicate your data. There are defininte down sides to doing so - for high coverage regions you can end up compressing the dynamic range of your data and reducing the amount of information you have to work with. It can help in some cases, but when we're analysing data we would only deduplicate if we could see that there was a problem with the data which deduplication would help to fix.
                  Thank you very much for your help!
                  I actually looked at the data before and after filterig for duplicates and have attached a picture of my four samples before (top four tracks) and after de-duplication (lower four tracks). It seems that your second suggestion of isolated towers seems to be the case, as I saw similar things across all chromosomes.
                  I then went on to try peak calling on my original files (only clipped the adapters and trimmed a little of the 3` end), for what I randomly selected and omitted lines in the input to get equal numbers of tags. Then MACS gives me the following output in the peaks.xls file:

                  # This file is generated by MACS
                  # ARGUMENTS LIST:
                  # name = E_2_mfold_20
                  # format = SAM
                  # ChIP-seq file = /galaxy/main_pool/pool7/files/005/979/dataset_5979847.dat
                  # control file = /galaxy/main_pool/pool7/files/005/965/dataset_5965128.dat
                  # effective genome size = 1.87e+09
                  # tag size = 50
                  # band width = 300
                  # model fold = 20
                  # pvalue cutoff = 1.00e-05
                  # Ranges for calculating regional lambda are : peak_region,1000,5000,10000
                  # unique tags in treatment: 2868667
                  # total tags in treatment: 22927127
                  # unique tags in control: 8014554
                  # total tags in control: 22927127

                  # d = 51

                  Especially in the treatment, the unique tags are very low compared to the control. This makes FDR unreliable.

                  Is it adviseable to de-duplicate the data and try peak calling then?
                  Also, as I have two replicates, would be reasonable to combine the two replicates to obtain more unique reads, and then try the peak calling again?

                  Again, thank you very much for your input!
                  Attached Files

                  Comment


                  • #10
                    It might be worth noting that MACS does an internal deduplication of your data whilst peak calling. It works out the likely duplication level in your data and then removes any tags which are duplicated above that level when calling peaks. It may not remove as much data as doing a complete strict deduplication, but it does look at this information.

                    I had a look at the image you posted but at that resolution it's hard to see what's going on. It's not unusual to see a few huge outliers in the data (which can skew the scale on the y-axis), it's more what happens at a more local level which is important, especially looking at the actual pattern of mapped reads rather than quantitated values.

                    Comment


                    • #11
                      So if I understand correctly, it may not be necessary at all to deduplicate the data before using MACS, as it will attempt this on its own.
                      Moreover, if I would deduplicate myself, I would omit true duplicates that arise from sequencing depth. So deduplicating would really only make sense if I really wanted the accurate FDR from MACS, which I can only get if I adjust the unique tag number beforehand?

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Best Practices for Single-Cell Sequencing Analysis
                        by seqadmin



                        While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                        06-06-2024, 07:15 AM
                      • seqadmin
                        Latest Developments in Precision Medicine
                        by seqadmin



                        Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                        Somatic Genomics
                        “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                        05-24-2024, 01:16 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 07:24 AM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 06-13-2024, 08:58 AM
                      0 responses
                      11 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 06-12-2024, 02:20 PM
                      0 responses
                      16 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 06-07-2024, 06:58 AM
                      0 responses
                      184 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X