Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • thedamian
    replied
    Rocketknight indeed I didn't want indels, but now I need it. Do you know which option should I use to obtain a final file with SNPs and Indels?

    Leave a comment:


  • thedamian
    replied
    Is the order of funtions important?
    Does this:
    Code:
    FixMateInformation
    SortSam
    MarkDuplicates
    RealignerTargetCreator
    IndelRealigner
    mean the same as this:
    Code:
    SortSam
    FixMateInformation
    RealignerTargetCreator
    IndelRealigner 
    MarkDuplicates

    Leave a comment:


  • Rocketknight
    replied
    The Mills and 1000G gold standard indels are a very stringently curated list of indels. For the purposes of indel realignment in GATK you probably want one or more of the broader sets from the GATK bundle instead, though which ones I'm not completely sure.

    Overall though, your workflow looks fine. Bear in mind though, that it will only call SNPs and not indels (but since you're calling your output file resultSNPs I assume you know that already).

    Leave a comment:


  • thedamian
    started a topic GATK SNP calling complete workflow

    GATK SNP calling complete workflow

    Hi All,
    I wolud like to consult my GATK workflow for a pair end Illumina data.
    Generally I'm calling SNPs using following steps:

    Code:
    bwa aln -t 4 hg19.fa seq1.fastq > 1.sai
    bwa aln -t 4 hg19.fa seq2.fastq > 2.sai
    bwa sampe -r "@RG\tID:exomeID\tLB:exomeLB\tSM:exomeSM\tPL:illumina\tPU:exomePU" hg19.fa 1.sai 2.sai seq1.fastq seq2.fastq > original.sam
    
    java -Xmx5g -jar FixMateInformation.jar I=original.sam O=fixed.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT
    java -Xmx5g -jar SortSam.jar I=fixed.sam SO=coordinate O=first.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
    java -Xmx5g -jar MarkDuplicates.jar I=first.bam O=marked.bam METRICS_FILE=metricsFile CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true
    
    java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T RealignerTargetCreator -R hg19.fa -o intervalsList -I marked.bam -known Mills_and_1000G_gold_standard.indels.hg19.vcf
    java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T IndelRealigner -R hg19.fa -I marked.bam -targetIntervals intervalsList -known Mills_and_1000G_gold_standard.indels.hg19.vcf -o realigned.bam
    java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T CountCovariates -l INFO -R hg19.fa -I realigned.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile recalFile -knownSites dbsnp_135.hg19.vcf
    java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T TableRecalibration -R hg19.fa -I realigned.bam -o recalibrated.bam -recalFile recalFile
    java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T UnifiedGenotyper -R hg19.fa -I recalibrated.bam -o resultSNPs.vcf -D dbsnp_135.hg19.vcf -metrics UniGenMetrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000 -A DepthOfCoverage -A AlleleBalance -L exomes.bed
    The SNP and indels databases I've downloaded from ftp://[email protected] (bunlde -> 1.5 -> hg19)

    The exome intervals I've gained using UCSC Table Browser http://genome.ucsc.edu/cgi-bin/hgTables?command=start

    I'm not sure if I'm doing it in a correct way. Is my way compact enough? What about yours workflows? Are your steps look simillar? Should I use more GATK subprograms to obtain more accurate results?

    Thanks in advance for suggestions

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin


    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
    Yesterday, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
55 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
51 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
45 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
55 views
0 likes
Last Post seqadmin  
Working...
X