Announcement

Collapse
No announcement yet.

What exactly is AddOrReplaceReadGroups (picard tools) doing?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • What exactly is AddOrReplaceReadGroups (picard tools) doing?

    Hello folks,

    I struggled now a few days, but I don't get it:

    What exactly is AddOrReplaceReadGroups doing?
    http://picard.sourceforge.net/comman...laceReadGroups

    Also, I actually can't find a good definition what a read group is. According to some descriptions on http://www.broadinstitute.org/gsa/wi...sked_Questions I infer that this the assignment of which reads belong to which lane and chip and so all reads in a read group have their own error model.

    But what is now this tool doing? I'm using it in our pipeline, because without it one step breaks up. According to the description it replaces all groups. But why, what might be wrong with the old groups? How do they differ to the prior groups?

    My call:
    Code:
    java -Djava.io.tmpdir='$TMPDIR' -jar /opt/biosw/picard-tools-1.45/AddOrReplaceReadGroups.jar RGLB=fastq/'$basename'.fastq RGPL=solexa RGPU=run RGSM=9111 I=output/'$basename'.bam O=output/'$basename'.sorted.bam SORT_ORDER=coordinate CREATE_INDEX=TRUE VALIDATION_STRINGENCY=LENIENT'
    Thanks in advance,
    Oliver
    Last edited by ocs; 06-07-2011, 10:54 PM. Reason: link added

  • #2
    A ReadGroup will assign an origin to a set of reads in order to assign a specific genotype to this origin when making the SNP/InDel calling. Without this step, you will have a set of SNPs but you cannot assign them to a specific genotype... This AddOrReplace step is requested by GATK pipeline, as it supposed you will call genotype and not only SNP. If you need only a raw set of SNP, you can use PileUp format and VarScan utility Pileup2SNP.
    Francois Sabot, PhD

    Be realistic. Demand the Impossible.
    www.wikiposon.org

    Comment


    • #3
      Hello Francois,

      thank you for your quick answer. I get the glimpse of an idea, but your answer is not fully clear to me. With origin you mean from where the reads came physically (e.g. chip, lane)? And I know what SNP calling is (locating SNPs in comparsion to reference genome), but what is genotype calling? I can imagine that its the sum of all SNPs but I'm not sure. Even with this knowledge I can't imagine what this step is useful for. My thought is that the read groups are determined by somewhat the technology since it knows on which lanes and chips which reads were sequenced. So I think of this groups as a constant which should not be changed, this is actually my problem.

      Thanks for any hints on this!

      Comment


      • #4
        The origin in my case can be either a lane, the name of the individual/organism. You can have eg 10 individuals tagged in a single lane, then mapped individually and then affected to a group (eg Indiv1, Indiv2...). Then all reads from a single individual are tagged by the same flag RG at the end of the SAM line. When you merge all those 10 SAM, each lane is tagged by an origin.
        Then you asked for example to the GATK Genotyper to 'call the genotype'. It means that SNP will be identified, based on depth, quality, etc. And as each read can be affected to a specific individual, you can say obtain in the resultant VCF file an info saying 'Ok, Indiv1 has a A instead of a G at the position chr01:234554'.

        This is the genotype calling, ie affecting the specific SNPs to a specific individual.
        Francois Sabot, PhD

        Be realistic. Demand the Impossible.
        www.wikiposon.org

        Comment


        • #5
          Hello Francois,

          thank you again for your answer. I understand now what a readgroup and genotype-calling is. But the last part of my previous post is still unclear, because I use the fastq files to align to the reference genome but in the AddOrReplaceReadGroups-step I give the same files as a read-group library. This seems redundant to me, ain't it? Shouldn't he have the read - to - read group assignment already? This is what is still confusing me.

          Thanks,
          Oliver

          Comment


          • #6
            Yes it is redundant at first look, but if you did not specified the RG tag during the mapping assay (as BWA allows eg), you did not have this information within the SAM file. Thus you need to add it, as the information in the SAM header in a standard version did not contain any reference to the origin of the reads.

            If you had specified it, then there is no need to perform this step.
            Francois Sabot, PhD

            Be realistic. Demand the Impossible.
            www.wikiposon.org

            Comment


            • #7
              Originally posted by ocs View Post
              Hello Francois,

              thank you again for your answer. I understand now what a readgroup and genotype-calling is. But the last part of my previous post is still unclear, because I use the fastq files to align to the reference genome but in the AddOrReplaceReadGroups-step I give the same files as a read-group library. This seems redundant to me, ain't it? Shouldn't he have the read - to - read group assignment already? This is what is still confusing me.

              Thanks,
              Oliver
              Hi Ocs,

              If RG is not critical to your pipeline, you may use "VALIDATION_STRINGENCY=SILENT" to suppress the warning. I used this option a few months ago but am not sure if it still works. You may give it try and report back if it still works. Picard is under very active and rapid development, as I see it.

              Douglas
              www.contigexpress.com

              Comment

              Working...
              X