Unconfigured Ad

**kushald** · 07-12-2013, 03:28 AM

You could consider removing those duplicates using SAMTOOLS (rmdup).

**Heisman** · 07-12-2013, 07:22 AM

Originally posted by kushald View Post

You could consider removing those duplicates using SAMTOOLS (rmdup).

I don't think he/she's referring to PCR duplicates, rather biological replicates.

gene_x, could you explain your experiment in a bit more detail? More specifically, describe how many samples you've sequenced, how many lanes of data you have for each sample, and how many libraries were generated for each sample (if there is more than one lane of data for each sample), and then please briefly describe what you're hoping to do (find peaks in both samples, find peaks present in one set of samples vs. another, etc).

**gene_x** · 07-12-2013, 11:31 AM

Originally posted by Heisman View Post

I don't think he/she's referring to PCR duplicates, rather biological replicates.

gene_x, could you explain your experiment in a bit more detail? More specifically, describe how many samples you've sequenced, how many lanes of data you have for each sample, and how many libraries were generated for each sample (if there is more than one lane of data for each sample), and then please briefly describe what you're hoping to do (find peaks in both samples, find peaks present in one set of samples vs. another, etc).

I just realized my latest post is roughly asking the same question..

to Heisman, I wanted to reanalyze/replicate some ENCODE ChIP-seq data and I couldn't be sure how they did it based on their description.. basically to find peaks in a duplicated sequencing data.

**Heisman** · 07-12-2013, 11:46 AM

I see the similarity between this and your other post I just responded to:

if the same sample was sequenced multiple times with the same exact library then the only difference in the data would be due to differences in the sequencing itself. In this case you could align each lane of data separately and give them separate RG IDs but the same LB and SM (library and sample) IDs.

if the same sample was sequenced multiple times with different libraries (ie, you prepped the sample twice), then you can do the above but make sure the LB ID is different in addition to the RG ID.

If you have completely different samples that are true biological replicates, then you probably don't want to merge the raw or aligned data at all; rather you'll want to probably call peaks on the two samples separately and then compare the results in some capacity (ie, using IDR: https://sites.google.com/site/anshul...e/projects/idr)

Honestly, I'm not experienced enough with peak calling to give great advice, but the above should be solid.

**gene_x** · 07-12-2013, 12:00 PM

Originally posted by Heisman View Post

I see the similarity between this and your other post I just responded to:

if the same sample was sequenced multiple times with the same exact library then the only difference in the data would be due to differences in the sequencing itself. In this case you could align each lane of data separately and give them separate RG IDs but the same LB and SM (library and sample) IDs.

if the same sample was sequenced multiple times with different libraries (ie, you prepped the sample twice), then you can do the above but make sure the LB ID is different in addition to the RG ID.

If you have completely different samples that are true biological replicates, then you probably don't want to merge the raw or aligned data at all; rather you'll want to probably call peaks on the two samples separately and then compare the results in some capacity (ie, using IDR: https://sites.google.com/site/anshul...e/projects/idr)

Honestly, I'm not experienced enough with peak calling to give great advice, but the above should be solid.

What does RG, LB ID stand for and what are they?

**Heisman** · 07-12-2013, 12:05 PM

Originally posted by gene_x View Post

What does RG, LB ID stand for and what are they?

RG = read group
LB = library
SM = sample
PL = platform
ID = ID (identification, haha)

So RG:ID is shorthand for read group ID, for example.

If you're getting into this stuff for the first time and it's not a one-off, I'd glance/read through this: http://samtools.sourceforge.net/SAM1.pdf

The importance of library is when removing duplicate reads. If you sequence the same sample with different libraries, you don't want to remove reads that appear as duplicates between different libraries (because they are from different biological template strands). If you sequence the same library multiple times, though, then if reads appear as duplicates people do typically want to remove them as they are more likely due to PCR amplification of the same original biological template strand (some exceptions here particularly if you have high coverage).

Read group can be important independent of library if some of the sequencing runs were of bad quality, and because a lot of the software of the GATK toolset uses/requires RG to be set.

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 22 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 27 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 38 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 61 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

How to handle duplicates data?

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News