I am getting conflicting information on how to assign RG ID, LB, PU, SM for an exome analysis I am working with.
Can someone just clarify for me how I should assign the RG please?
If you don't want to read the details, take a look at this table. I just need clarification on how to assign the RG ID, SM, LB for these samples, since they were multiplexed and come from different libraries, with some samples being pooled.
Sample, Technical Replicate, Flow Cell ID, Lane ID, Library
A 1 AXXX2 1 Group 1
A 2 AXXX2 2 Group 1
B 1 AXXX2 1 Group 1
B 2 DXCX5 1 Group 1
G 1 AXXX2 1 Group 2
G 2 DXCX5 1 Group 2
Is this correct?
RG ID:AXXX2.1 SM:A LB:Group_1
RG ID:AXXX2.2 SM:A LB:Group_1
RG ID:AXXX2.1 SM:B LB:Group_1
RG IDXCX5.1 SM:B LB:Group_1
RG ID:AXXX2.1 SM:G LB:Group_2
RG IDXCX5.1 SM:G LB:Group_2
or this
RG ID:AXXX2.A.1 SM:A LB:Group_1
RG ID:AXXX2.A.2 SM:A LB:Group_1
RG ID:AXXX2.B.1 SM:B LB:Group_1
RG IDXCX5.B.1 SM:B LB:Group_1
RG ID:AXXX2.G.1 SM:G LB:Group_2
RG IDXCX5.G.1 SM:G LB:Group_2
I'm working with 36 different biological samples that were run with 100 bp PE.
The libraries were pooled in batches of 12, so there are three batches.
Here are the three issues I'm considering.
1) For one of the pooled library batches (Group 2), the first 12 samples were sequenced on two difference Flow Cell Ids.
2) For the second pooled library batch (Group 1), the second 12 samples were sequenced on the same Flow Cell Id, but on two separate Lanes.
3) For the last pooled library batch (Group 3), the last 12 samples were sequenced on the same Flow Cell ID, and same Lane ID, but twice (two different runs).
How do I assign an appropriate RG ID, LB, and SM for these samples?
From what I understand:
Each 12 samples from a single batch/group will have the same unifying library id.
The SM is unique to each sample, but since each sample has two technical replicates, I need to differentiate the technical replicates for the same sample in the RG ID.
For the read group ID, I have read two conflicting answers.
The first was that the ID should simply be Flow_Cell_ID:Lane_ID.
The second was that the ID should be Flow_Cell_ID:SM:Lane_ID.
Should the read group ID be unique for each SM? Or should it only identify the Flow Cell and Lane ID? The read group is used to recalibrate the data for the same sample based on whether it was run on the same lane or not, but since the samples were multiplexed in groups of 12, wouldn't it be informative for the read group ID to be common for all samples that were run on the same flow cell and lane in order to increase the corrective power?
Can someone just clarify for me how I should assign the RG please?
If you don't want to read the details, take a look at this table. I just need clarification on how to assign the RG ID, SM, LB for these samples, since they were multiplexed and come from different libraries, with some samples being pooled.
Sample, Technical Replicate, Flow Cell ID, Lane ID, Library
A 1 AXXX2 1 Group 1
A 2 AXXX2 2 Group 1
B 1 AXXX2 1 Group 1
B 2 DXCX5 1 Group 1
G 1 AXXX2 1 Group 2
G 2 DXCX5 1 Group 2
Is this correct?
RG ID:AXXX2.1 SM:A LB:Group_1
RG ID:AXXX2.2 SM:A LB:Group_1
RG ID:AXXX2.1 SM:B LB:Group_1
RG IDXCX5.1 SM:B LB:Group_1
RG ID:AXXX2.1 SM:G LB:Group_2
RG IDXCX5.1 SM:G LB:Group_2
or this
RG ID:AXXX2.A.1 SM:A LB:Group_1
RG ID:AXXX2.A.2 SM:A LB:Group_1
RG ID:AXXX2.B.1 SM:B LB:Group_1
RG IDXCX5.B.1 SM:B LB:Group_1
RG ID:AXXX2.G.1 SM:G LB:Group_2
RG IDXCX5.G.1 SM:G LB:Group_2
I'm working with 36 different biological samples that were run with 100 bp PE.
The libraries were pooled in batches of 12, so there are three batches.
Here are the three issues I'm considering.
1) For one of the pooled library batches (Group 2), the first 12 samples were sequenced on two difference Flow Cell Ids.
2) For the second pooled library batch (Group 1), the second 12 samples were sequenced on the same Flow Cell Id, but on two separate Lanes.
3) For the last pooled library batch (Group 3), the last 12 samples were sequenced on the same Flow Cell ID, and same Lane ID, but twice (two different runs).
How do I assign an appropriate RG ID, LB, and SM for these samples?
From what I understand:
Each 12 samples from a single batch/group will have the same unifying library id.
The SM is unique to each sample, but since each sample has two technical replicates, I need to differentiate the technical replicates for the same sample in the RG ID.
For the read group ID, I have read two conflicting answers.
The first was that the ID should simply be Flow_Cell_ID:Lane_ID.
The second was that the ID should be Flow_Cell_ID:SM:Lane_ID.
Should the read group ID be unique for each SM? Or should it only identify the Flow Cell and Lane ID? The read group is used to recalibrate the data for the same sample based on whether it was run on the same lane or not, but since the samples were multiplexed in groups of 12, wouldn't it be informative for the read group ID to be common for all samples that were run on the same flow cell and lane in order to increase the corrective power?
Comment