Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GATK SNP calling complete workflow

    Hi All,
    I wolud like to consult my GATK workflow for a pair end Illumina data.
    Generally I'm calling SNPs using following steps:

    Code:
    bwa aln -t 4 hg19.fa seq1.fastq > 1.sai
    bwa aln -t 4 hg19.fa seq2.fastq > 2.sai
    bwa sampe -r "@RG\tID:exomeID\tLB:exomeLB\tSM:exomeSM\tPL:illumina\tPU:exomePU" hg19.fa 1.sai 2.sai seq1.fastq seq2.fastq > original.sam
    
    java -Xmx5g -jar FixMateInformation.jar I=original.sam O=fixed.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT
    java -Xmx5g -jar SortSam.jar I=fixed.sam SO=coordinate O=first.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
    java -Xmx5g -jar MarkDuplicates.jar I=first.bam O=marked.bam METRICS_FILE=metricsFile CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true
    
    java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T RealignerTargetCreator -R hg19.fa -o intervalsList -I marked.bam -known Mills_and_1000G_gold_standard.indels.hg19.vcf
    java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T IndelRealigner -R hg19.fa -I marked.bam -targetIntervals intervalsList -known Mills_and_1000G_gold_standard.indels.hg19.vcf -o realigned.bam
    java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T CountCovariates -l INFO -R hg19.fa -I realigned.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile recalFile -knownSites dbsnp_135.hg19.vcf
    java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T TableRecalibration -R hg19.fa -I realigned.bam -o recalibrated.bam -recalFile recalFile
    java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T UnifiedGenotyper -R hg19.fa -I recalibrated.bam -o resultSNPs.vcf -D dbsnp_135.hg19.vcf -metrics UniGenMetrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000 -A DepthOfCoverage -A AlleleBalance -L exomes.bed
    The SNP and indels databases I've downloaded from ftp://[email protected] (bunlde -> 1.5 -> hg19)

    The exome intervals I've gained using UCSC Table Browser http://genome.ucsc.edu/cgi-bin/hgTables?command=start

    I'm not sure if I'm doing it in a correct way. Is my way compact enough? What about yours workflows? Are your steps look simillar? Should I use more GATK subprograms to obtain more accurate results?

    Thanks in advance for suggestions

  • #2
    The Mills and 1000G gold standard indels are a very stringently curated list of indels. For the purposes of indel realignment in GATK you probably want one or more of the broader sets from the GATK bundle instead, though which ones I'm not completely sure.

    Overall though, your workflow looks fine. Bear in mind though, that it will only call SNPs and not indels (but since you're calling your output file resultSNPs I assume you know that already).

    Comment


    • #3
      Is the order of funtions important?
      Does this:
      Code:
      FixMateInformation
      SortSam
      MarkDuplicates
      RealignerTargetCreator
      IndelRealigner
      mean the same as this:
      Code:
      SortSam
      FixMateInformation
      RealignerTargetCreator
      IndelRealigner 
      MarkDuplicates

      Comment


      • #4
        Rocketknight indeed I didn't want indels, but now I need it. Do you know which option should I use to obtain a final file with SNPs and Indels?

        Comment


        • #5
          Yep, the only line you need to change is UnifiedGenotyper. Add the option -glm BOTH and the output VCF file will have SNPs and indels in it. Note that if you want to do Variant Quality Score Recalibration you'll have to do some things differently if you're recalibrating indels too.

          Comment


          • #6
            Some indel positions are off in dbsnp135_b37.vcf. How do u deal with them??

            Comment


            • #7
              "The exome intervals I've gained using UCSC Table Browser http://genome.ucsc.edu/cgi-bin/hgTables?command=start"
              Can you tell me how to get the exome intervals as you have referred

              Comment


              • #8
                Originally posted by vd4mindia View Post
                "The exome intervals I've gained using UCSC Table Browser http://genome.ucsc.edu/cgi-bin/hgTables?command=start"
                Can you tell me how to get the exome intervals as you have referred
                set these paramaters: http://pokazywarka.pl/1ve6r7/ and then click "get output". Adjust your parameters and then click "get BED".

                Comment


                • #9
                  Hi I am facing some problems with the Local realignment step around the indels, I am using the marked bam file after the PCR duplicate marking step but I am getting the following error
                  ##### ERROR MESSAGE: Invalid command line: Cannot process the provided BAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAMs in --unsafe mode, but this GATK feature is currently unsupported.

                  I donot see such errors in the forums. Is it that after marknig the PCR duplicates the bam files needs to be sorted and indexed again? or just indexing and running the target realigner step will be sufficient? Am using the GATK version 2.3 am sure its not a serious problem but am not being able to troubleshoot it. I am trying to get a bit stringent in this local indel realignment so for more specificity am actually using the DBSNP137.vcf , 1000G indel vcf and the Mills and 1000 G gold standard vcf. Please guide me about this. I am using the GATK for the first time and trying to develop a pipeline. Below is the command line am using for this step

                  java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -R /scratch/GT/vdas/test_exome/exome/hg19.fa -I SRR062634_marked.sorted.bam -T RealignerTargetCreator -known /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf -known /scratch/GT/vdas/test_exome/exome/databases/1000G_phase1.indels.hg19.vcf -known /scratch/GT/vdas/test_exome/exome/databases/Mills_and_1000G_gold_standard.indels.hg19.vcf -o SRR062634_marked.sorted.bam.inervals -S LENIENT

                  Comment


                  • #10
                    @vd4mindia: I don't know, maybe try to index it. You shoud have in your directory files SRR062634_marked.sorted.bam and SRR062634_marked.sorted.bai.
                    If this won't help I'm not able to solve this issue.

                    Comment


                    • #11
                      I thought fixMate is no longer required by the GATK pipeline? And if you want to perform the fixMate then sortSam, you can run FixMate with the SO options in SortSam, then your output will be automatically sorted in one go. It is also a good idea to always include Create_Index=TRUE in the pipeline to make sure you have the index for every steps just to be safe

                      Comment


                      • #12
                        @thedamian,

                        Yes I fixed it , I indexed it again and then made the call and the local realignment step is running now. But i need some inputs for the base quality recalibration step. In the above thread this step is written as below
                        java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T CountCovariates -l INFO -R hg19.fa -I realigned.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile recalFile -knownSites dbsnp_135.hg19.vcf

                        Since during my Realigner Target Creator step I used dbSNP_137.vcf for known sites I should be using the same as well for the Base quality step right?

                        java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -R /scratch/GT/vdas/test_exome/exome/hg19.fa -I SRR062634_marked.sorted.bam -T RealignerTargetCreator -known /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf -known /scratch/GT/vdas/test_exome/exome/databases/1000G_phase1.indels.hg19.vcf -known /scratch/GT/vdas/test_exome/exome/databases/Mills_and_1000G_gold_standard.indels.hg19.vcf -o SRR062634_marked.sorted.bam.inervals -S LENIENT

                        As you can see the known sites /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf

                        So my script for this step should be

                        For the Base calibrator( i see some place its written count covariates, which should I do) any suggestion but am following base calibrator below is the command

                        java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -nt 8 -T BaseRecalibrator -R /scratch/GT/vdas/test_exome/exome/hg19.fa -knownSites /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf -I SRR062634.realigned.bam -o SRR062634.realigned.recal.csv -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -S LENIENT

                        for table recalibration

                        java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -T TableRecalibration -R /scratch/GT/vdas/test_exome/exome/hg19.fa -I SRR062634.realigned.bam -o SRR062634.realigned.bam.recal.bam -BQSR SRR062634.realigned.recal.csv -S LENIENT

                        Please let me know if this looks correct or not?

                        Comment


                        • #13
                          Yes @choishingwan I checked and the fixmate step is retired and am not using it but I have another query in the above line please let me know and share your views

                          Comment


                          • #14
                            @vd4mindia: From your code, I guess you are running GATK version before 2.6 right? You are right about using the dbsnp137. From the look of it it seems correct. In case if you want to use the latest gatk, you can refer to the following site: http://gatkforums.broadinstitute.org...se-2-0-retired

                            Personally, I use the following code if it helps:

                            samtools sort <sample>.bwa.bam <sample>.bwa.sort

                            #Index
                            samtools index <sample>.bwa.sort.bam

                            #mark duplicate
                            java -Xmx5g -jar MarkDuplicates.jar INPUT=<sample>.bwa.sort.bam OUTPUT=<sample>.bwa.sort.deduped.bam METRICS_FILE=<sample>.duplicates REMOVE_DUPLICATES=TRUE VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE


                            #Realignment based on known insert sites (Using Java 1.7 from now on as required by GATK)
                            java -Xmx5g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R Reference.fa -I <sample>.bwa.sort.deduped.arg.bam -known 1000G_phase1.indels.hg19.vcf -known Mills_and_1000G_gold_standard.indels.hg19.vcf -o <sample>.realign.intervals -S LENIENT

                            java -Xmx5g -jar GenomeAnalysisTK.jar -T IndelRealigner -R Reference.fa -I <sample>.bwa.sort.deduped.arg.bam -targetIntervals <sample>.realign.intervals -known 1000G_phase1.indels.hg19.vcf -known Mills_and_1000G_gold_standard.indels.hg19.vcf -o <sample>.bwa.sort.deduped.arg.realigned.bam -S LENIENT

                            java -Xmx5g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R Reference.fa -l INFO -I <sample>.bwa.sort.deduped.arg.realigned.bam -knownSites 1000G_phase1.indels.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites dbsnp_137.hg19.vcf -o <sample>.recalibration_report.grp -S LENIENT

                            java -Xmx5g -jar GenomeAnalysisTK.jar -T PrintReads -R Reference.fa -l INFO -I <sample>.bwa.sort.deduped.arg.realigned.bam -BQSR <sample>.recalibration_report.grp -o <sample>.bwa.sort.deduped.arg.realigned.recalibrated.bam -S LENIENT
                            This is how we do the analysis here in our lab, not necessarily the correct way. We did use the latest GATK though, so some of the code might differ from yours.

                            Comment


                            • #15
                              @thedamain,

                              I need some information about the exomes.bed file which you are using. I see that most of the pipelines say that they generate it from the UCSC genome browser to restrict the output to exonic sequences, is there any specific cut off that must be used to generate this bed file for the hg19 ref gene or a simple selection will do? or do i have to specify some extra bp addition for both end to catch the splice sites? Please let me know. On what criteria is the bed file selected or it is just a representation bed file of the whole ref genome restricted only to the exonic regions? Please let me know

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Genetic Variation in Immunogenetics and Antibody Diversity
                                by seqadmin



                                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                                11-06-2024, 07:24 PM
                              • seqadmin
                                Choosing Between NGS and qPCR
                                by seqadmin



                                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                10-18-2024, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 11:09 AM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Today, 06:13 AM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-01-2024, 06:09 AM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-30-2024, 05:31 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X