Seqanswers Leaderboard Ad

**kushald** · 07-12-2013, 03:28 AM

You could consider removing those duplicates using SAMTOOLS (rmdup).

**Heisman** · 07-12-2013, 07:22 AM

Originally posted by kushald View Post

You could consider removing those duplicates using SAMTOOLS (rmdup).

I don't think he/she's referring to PCR duplicates, rather biological replicates.

gene_x, could you explain your experiment in a bit more detail? More specifically, describe how many samples you've sequenced, how many lanes of data you have for each sample, and how many libraries were generated for each sample (if there is more than one lane of data for each sample), and then please briefly describe what you're hoping to do (find peaks in both samples, find peaks present in one set of samples vs. another, etc).

**gene_x** · 07-12-2013, 11:31 AM

Originally posted by Heisman View Post

I don't think he/she's referring to PCR duplicates, rather biological replicates.

gene_x, could you explain your experiment in a bit more detail? More specifically, describe how many samples you've sequenced, how many lanes of data you have for each sample, and how many libraries were generated for each sample (if there is more than one lane of data for each sample), and then please briefly describe what you're hoping to do (find peaks in both samples, find peaks present in one set of samples vs. another, etc).

I just realized my latest post is roughly asking the same question..

to Heisman, I wanted to reanalyze/replicate some ENCODE ChIP-seq data and I couldn't be sure how they did it based on their description.. basically to find peaks in a duplicated sequencing data.

**Heisman** · 07-12-2013, 11:46 AM

I see the similarity between this and your other post I just responded to:

if the same sample was sequenced multiple times with the same exact library then the only difference in the data would be due to differences in the sequencing itself. In this case you could align each lane of data separately and give them separate RG IDs but the same LB and SM (library and sample) IDs.

if the same sample was sequenced multiple times with different libraries (ie, you prepped the sample twice), then you can do the above but make sure the LB ID is different in addition to the RG ID.

If you have completely different samples that are true biological replicates, then you probably don't want to merge the raw or aligned data at all; rather you'll want to probably call peaks on the two samples separately and then compare the results in some capacity (ie, using IDR: https://sites.google.com/site/anshul...e/projects/idr)

Honestly, I'm not experienced enough with peak calling to give great advice, but the above should be solid.

**gene_x** · 07-12-2013, 12:00 PM

Originally posted by Heisman View Post

I see the similarity between this and your other post I just responded to:

if the same sample was sequenced multiple times with the same exact library then the only difference in the data would be due to differences in the sequencing itself. In this case you could align each lane of data separately and give them separate RG IDs but the same LB and SM (library and sample) IDs.

if the same sample was sequenced multiple times with different libraries (ie, you prepped the sample twice), then you can do the above but make sure the LB ID is different in addition to the RG ID.

If you have completely different samples that are true biological replicates, then you probably don't want to merge the raw or aligned data at all; rather you'll want to probably call peaks on the two samples separately and then compare the results in some capacity (ie, using IDR: https://sites.google.com/site/anshul...e/projects/idr)

Honestly, I'm not experienced enough with peak calling to give great advice, but the above should be solid.

What does RG, LB ID stand for and what are they?

**Heisman** · 07-12-2013, 12:05 PM

Originally posted by gene_x View Post

What does RG, LB ID stand for and what are they?

RG = read group
LB = library
SM = sample
PL = platform
ID = ID (identification, haha)

So RG:ID is shorthand for read group ID, for example.

If you're getting into this stuff for the first time and it's not a one-off, I'd glance/read through this: http://samtools.sourceforge.net/SAM1.pdf

The importance of library is when removing duplicate reads. If you sequence the same sample with different libraries, you don't want to remove reads that appear as duplicates between different libraries (because they are from different biological template strands). If you sequence the same library multiple times, though, then if reads appear as duplicates people do typically want to remove them as they are more likely due to PCR amplification of the same original biological template strand (some exceptions here particularly if you have high coverage).

Read group can be important independent of library if some of the sequencing runs were of bad quality, and because a lot of the software of the GATK toolset uses/requires RG to be set.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

How to handle duplicates data?

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News