Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Confused about RG ID LB even after reading all posts/GATK best practices

    I am getting conflicting information on how to assign RG ID, LB, PU, SM for an exome analysis I am working with.

    Can someone just clarify for me how I should assign the RG please?

    If you don't want to read the details, take a look at this table. I just need clarification on how to assign the RG ID, SM, LB for these samples, since they were multiplexed and come from different libraries, with some samples being pooled.

    Sample, Technical Replicate, Flow Cell ID, Lane ID, Library
    A 1 AXXX2 1 Group 1
    A 2 AXXX2 2 Group 1
    B 1 AXXX2 1 Group 1
    B 2 DXCX5 1 Group 1
    G 1 AXXX2 1 Group 2
    G 2 DXCX5 1 Group 2

    Is this correct?

    RG ID:AXXX2.1 SM:A LB:Group_1
    RG ID:AXXX2.2 SM:A LB:Group_1
    RG ID:AXXX2.1 SM:B LB:Group_1
    RG IDXCX5.1 SM:B LB:Group_1
    RG ID:AXXX2.1 SM:G LB:Group_2
    RG IDXCX5.1 SM:G LB:Group_2

    or this

    RG ID:AXXX2.A.1 SM:A LB:Group_1
    RG ID:AXXX2.A.2 SM:A LB:Group_1
    RG ID:AXXX2.B.1 SM:B LB:Group_1
    RG IDXCX5.B.1 SM:B LB:Group_1
    RG ID:AXXX2.G.1 SM:G LB:Group_2
    RG IDXCX5.G.1 SM:G LB:Group_2


    I'm working with 36 different biological samples that were run with 100 bp PE.

    The libraries were pooled in batches of 12, so there are three batches.

    Here are the three issues I'm considering.

    1) For one of the pooled library batches (Group 2), the first 12 samples were sequenced on two difference Flow Cell Ids.

    2) For the second pooled library batch (Group 1), the second 12 samples were sequenced on the same Flow Cell Id, but on two separate Lanes.

    3) For the last pooled library batch (Group 3), the last 12 samples were sequenced on the same Flow Cell ID, and same Lane ID, but twice (two different runs).

    How do I assign an appropriate RG ID, LB, and SM for these samples?

    From what I understand:

    Each 12 samples from a single batch/group will have the same unifying library id.
    The SM is unique to each sample, but since each sample has two technical replicates, I need to differentiate the technical replicates for the same sample in the RG ID.

    For the read group ID, I have read two conflicting answers.
    The first was that the ID should simply be Flow_Cell_ID:Lane_ID.
    The second was that the ID should be Flow_Cell_ID:SM:Lane_ID.

    Should the read group ID be unique for each SM? Or should it only identify the Flow Cell and Lane ID? The read group is used to recalibrate the data for the same sample based on whether it was run on the same lane or not, but since the samples were multiplexed in groups of 12, wouldn't it be informative for the read group ID to be common for all samples that were run on the same flow cell and lane in order to increase the corrective power?
    Last edited by Studentlost; 03-05-2016, 08:06 PM.

  • #2
    An "LB" tag should only ever be associated with one sample. This refers to the physical library made from a sample and has absolutely nothing to do with pooling. The hierarchy is:

    SM: A biological sample
    LB: A library made from a single biological sample (if a sample has more than one, you have technical replicates)
    ID: A single instance of a given library. You might have more than one of these per library if you sequenced it on multiple flow cells or multiple lanes (honestly, I would just merge the lanes these days, though).

    Practically speaking, the various tags should be unique. Whether you use this:

    Code:
    RG ID:AXXX2.1 SM:A LB:1
    RG ID:AXXX2.2 SM:A LB:1
    RG ID:AXXX2.1 SM:B LB:2
    RG ID:DXCX5.1 SM:B LB:2
    RG ID:AXXX2.1 SM:G LB:3
    RG ID:DXCX5.1 SM:G LB:3
    or this:

    Code:
    RG ID:red SM:A LB:Group_1
    RG ID:orange SM:A LB:Group_1
    RG ID:yellow SM:B LB:Group_2
    RG ID:green SM:B LB:Group_2
    RG ID:blue SM:G LB:Group_3
    RG ID:purple SM:G LB:Group_3
    Or some other naming scheme it doesn't matter. The only thing that matters is the association between and nesting of the tags.

    Comment


    • #3
      Originally posted by dpryan View Post
      An "LB" tag should only ever be associated with one sample. This refers to the physical library made from a sample and has absolutely nothing to do with pooling. The hierarchy is:

      SM: A biological sample
      LB: A library made from a single biological sample (if a sample has more than one, you have technical replicates)
      ID: A single instance of a given library. You might have more than one of these per library if you sequenced it on multiple flow cells or multiple lanes (honestly, I would just merge the lanes these days, though).

      Practically speaking, the various tags should be unique. Whether you use this:

      Code:
      RG ID:AXXX2.1 SM:A LB:1
      RG ID:AXXX2.2 SM:A LB:1
      RG ID:AXXX2.1 SM:B LB:2
      RG ID:DXCX5.1 SM:B LB:2
      RG ID:AXXX2.1 SM:G LB:3
      RG ID:DXCX5.1 SM:G LB:3
      or this:

      Code:
      RG ID:red SM:A LB:Group_1
      RG ID:orange SM:A LB:Group_1
      RG ID:yellow SM:B LB:Group_2
      RG ID:green SM:B LB:Group_2
      RG ID:blue SM:G LB:Group_3
      RG ID:purple SM:G LB:Group_3
      Or some other naming scheme it doesn't matter. The only thing that matters is the association between and nesting of the tags.

      Thank you for your reply! Just to make sure I follow, the library is basically the sample name so long as it's a single sample and only technical replicates, correct?

      And the read group id is just a way to identify which flow cell/lane the sample was run on for each technical replicate?

      How does base recalibration work in terms of using the flow cell/lane as a co variate? 12 samples were multiplexed at a time, so read number is lower. Shouldn't all samples run on the same flow cell/lane be used as a covariate when doing base recalibration of a single sample?

      Or am I misunderstanding the method?

      Thank you again!

      Comment


      • #4
        Yeah, typically LB and SM are the same and ID is just a random unique identifier. I've never checked the source code for GATK to see exactly how it deals with lane as a covariate.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Genetic Variation in Immunogenetics and Antibody Diversity
          by seqadmin



          The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
          Yesterday, 07:24 PM
        • seqadmin
          Choosing Between NGS and qPCR
          by seqadmin



          Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
          10-18-2024, 07:11 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 11-01-2024, 06:09 AM
        0 responses
        27 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-30-2024, 05:31 AM
        0 responses
        21 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-24-2024, 06:58 AM
        0 responses
        25 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-23-2024, 08:43 AM
        0 responses
        56 views
        0 likes
        Last Post seqadmin  
        Working...
        X