Unconfigured Ad

**xubeisi** · 01-24-2013, 08:12 PM

it seems so, check MACS model file, if the watson & crick distance is small, this means it's useless. you may also want check with fastq, this high duplication could probably due to adapters.

**Tobikenobi** · 03-28-2013, 09:55 PM

Originally posted by xubeisi View Post

... check MACS model file, if the watson & crick distance is small, this means it's useless...

How small are we talking about?

**xubeisi** · 03-29-2013, 01:30 AM

Originally posted by Tobikenobi View Post

How small are we talking about?

～100 should be fine, to me, samples less than 50 are trash

**simonandrews** · 03-29-2013, 03:34 AM

Have you actually looked at your data (both before and after duplication)?

Simply looking at the pattern of mapped reads will very quickly tell you if you're wasting your time trying spending more effort on your analysis.

**Tobikenobi** · 03-31-2013, 05:27 AM

Sorry to hijack this thread...

Originally posted by xubeisi View Post

～100 should be fine, to me, samples less than 50 are trash

Depending on what number I enter as mfold in MACS (>10), I can get anything from d=51 to d=118. Does that tell me anything, and is it desirable to go for the highest d possible?
Thank you very much!

**Tobikenobi** · 03-31-2013, 04:10 PM

Originally posted by simonandrews View Post

Have you actually looked at your data (both before and after duplication)?

Simply looking at the pattern of mapped reads will very quickly tell you if you're wasting your time trying spending more effort on your analysis.

Could you please specify what you mean by `before and after duplication`?

Also, what would I expect to see in case of high duplication levels (I am looking at ~75% duplication according to fastqc myself)?

**simonandrews** · 04-02-2013, 12:23 AM

Originally posted by Tobikenobi View Post

Could you please specify what you mean by `before and after duplication`?

Also, what would I expect to see in case of high duplication levels (I am looking at ~75% duplication according to fastqc myself)?

High duplication can come from a few different sources. It could be that you've got very good enriched data and have saturated the coverage of the enriched regions, in which case it would be bad to deduplicate. It could be that you have a very small number of regions which have enormous coverage or you could have more general low level duplication of your whole library. A 75% duplication level could come from most sequences appearing exactly twice in your data, or it could be a small number of sites with huge coverage.

If you look at the mapped data before you've done any deduplication you will be able to see whether there is a problem. If you see nicely enriched peaks with even coverage over them then the data might well be OK as it is. If you can see obviously biased coverage with more isolated towers of reads where you have duplication then you would need to deduplicate to stand any chance of getting sensible results out of your data.

Don't think that you should always deduplicate your data. There are defininte down sides to doing so - for high coverage regions you can end up compressing the dynamic range of your data and reducing the amount of information you have to work with. It can help in some cases, but when we're analysing data we would only deduplicate if we could see that there was a problem with the data which deduplication would help to fix.

**Tobikenobi** · 04-03-2013, 09:27 PM

Originally posted by simonandrews View Post

High duplication can come from a few different sources. It could be that you've got very good enriched data and have saturated the coverage of the enriched regions, in which case it would be bad to deduplicate. It could be that you have a very small number of regions which have enormous coverage or you could have more general low level duplication of your whole library. A 75% duplication level could come from most sequences appearing exactly twice in your data, or it could be a small number of sites with huge coverage.

If you look at the mapped data before you've done any deduplication you will be able to see whether there is a problem. If you see nicely enriched peaks with even coverage over them then the data might well be OK as it is. If you can see obviously biased coverage with more isolated towers of reads where you have duplication then you would need to deduplicate to stand any chance of getting sensible results out of your data.

Don't think that you should always deduplicate your data. There are defininte down sides to doing so - for high coverage regions you can end up compressing the dynamic range of your data and reducing the amount of information you have to work with. It can help in some cases, but when we're analysing data we would only deduplicate if we could see that there was a problem with the data which deduplication would help to fix.

Thank you very much for your help!
I actually looked at the data before and after filterig for duplicates and have attached a picture of my four samples before (top four tracks) and after de-duplication (lower four tracks). It seems that your second suggestion of isolated towers seems to be the case, as I saw similar things across all chromosomes.
I then went on to try peak calling on my original files (only clipped the adapters and trimmed a little of the 3` end), for what I randomly selected and omitted lines in the input to get equal numbers of tags. Then MACS gives me the following output in the peaks.xls file:

# This file is generated by MACS
# ARGUMENTS LIST:
# name = E_2_mfold_20
# format = SAM
# ChIP-seq file = /galaxy/main_pool/pool7/files/005/979/dataset_5979847.dat
# control file = /galaxy/main_pool/pool7/files/005/965/dataset_5965128.dat
# effective genome size = 1.87e+09
# tag size = 50
# band width = 300
# model fold = 20
# pvalue cutoff = 1.00e-05
# Ranges for calculating regional lambda are : peak_region,1000,5000,10000
# unique tags in treatment: 2868667
# total tags in treatment: 22927127
# unique tags in control: 8014554
# total tags in control: 22927127
# d = 51

Especially in the treatment, the unique tags are very low compared to the control. This makes FDR unreliable.

Is it adviseable to de-duplicate the data and try peak calling then?
Also, as I have two replicates, would be reasonable to combine the two replicates to obtain more unique reads, and then try the peak calling again?

Again, thank you very much for your input!

Attached Files

combined_tracks copy.jpg (76.0 KB, 200 views)

**simonandrews** · 04-03-2013, 11:30 PM

It might be worth noting that MACS does an internal deduplication of your data whilst peak calling. It works out the likely duplication level in your data and then removes any tags which are duplicated above that level when calling peaks. It may not remove as much data as doing a complete strict deduplication, but it does look at this information.

I had a look at the image you posted but at that resolution it's hard to see what's going on. It's not unusual to see a few huge outliers in the data (which can skew the scale on the y-axis), it's more what happens at a more local level which is important, especially looking at the actual pattern of mapped reads rather than quantitated values.

**Tobikenobi** · 04-03-2013, 11:54 PM

So if I understand correctly, it may not be necessary at all to deduplicate the data before using MACS, as it will attempt this on its own.
Moreover, if I would deduplicate myself, I would omit true duplicates that arise from sequencing depth. So deduplicating would really only make sense if I really wanted the accurate FDR from MACS, which I can only get if I adjust the unique tag number beforehand?

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 38 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 44 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 49 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

Is my chip-seq data garbage?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News