No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pipeline to find somatic mutations

    Hi, I am trying to analyze Illumina paired-end reads of a tumor/normal study with two lanes per sample (i.e. normal.lane1, normal.lane2, tumor.lane1, tumor.lane2) to assess somatic mutations. My intention is to work with a single bam per individual as recommended in the GATK best practice variant detection in tumor-normal studies (

    Therefore, the workflow I am trying to perform is the following:
        For each reads_file (lane):
    - alignment using Bfast bwaaln / localalign / postprocess
    - sam to bam using samtools view
    - add read groups using picard AddOrReplaceReadGroups
    - remove duplicates using picard RemoveDuplicates
    Then, the bams of the two samples are merged:

        sample.bam <- picard MergeSamFile (tumor.lane1.bam, tumor.lane2.bam, normal.lane1.bam, normal.lane2.bam)
    On this single bam (which contains reads from the 4 lanes and also 4 lines of @RG readgroup annotations in the header), I performed two more processing steps:

        realigned.bam <- GATK_realign(individual.bam)
        recal.bam <- GATK_recalibrate(realigned.bam)
    Then I called the variants:

    GATK_call_for_variants(recal.bam) using the GATK UnifiedGenotyper and VariantFiltration

    I used the --genotypeFilterExpression argument in the GATK_VariantFiltration step to have results at sample level, so what I've obtained by using a dummy example with a few reads per sample in the resulting vcf file looks like:

    #CHROM    POS        ID REF  ALT  QUAL FILTER    INFO           FORMAT                    normal                 tumor
    chr1    115401136    .    G    A    40.14    PASS      (..)    GT:AD:DP:GQ:PL    1/1:0,1:1:3:29,3,0    1/1:0,2:2:3.01:44,3,0
    chr2    207625567    .    T    C    32.79    PASS      (..)    GT:AD:DP:GQ:PL    ./.		     1/1:0,2:2:6.01:64,6,0
    Therefore, ignoring the fact that the number of reads supporting the evidence is low in this case, could I interpret that the first SNP is a germline mutation (the same GT is present in both normal and tumor samples of the individual) whereas the second is a somatic mutation (the variant is present only in the normal sample)?

    I'm quite unsure how is the best way to interpret these results, also I'm not convinced whether the workflow I've used is the optimal.

    Any help/suggestion will be really appreciated!

  • #2
    Three suggestions:

    a) map your reads to:

    b) use two distinct mappers (e.g. bwaaln and bwa-sw), call mutations separately and then take the intersection.

    c) (optional) you may consider samtools contrast caller.


    • #3
      I think the question was more referring to how to discover the differences between paired normal and tumor samples from one single patient.

      The multi-sample vcf allows the following information:

      - It lists the variant positions and changes detected in any of the two samples.
      - In the columns after the FORMAT column, normal and tumor in the example, it displays the stats for each of the variants in the two samples. Obviously, there are variants that are recorded in both normal and tumor, and there are also those that are recorded in the normal sample only and not in the tumor and vice versa:
      - Variants present in both "normal" and "tumoral" may be interpreted as "germline" variants that are present in all cells of the patient, tumor or not.
      - Then, we can obtain a rough selection of "tumor-specific" variants (those that are recorded only in "tumor").
      - Variants that appear in "normal" only, may point at deletions of this region in the tumor sample, given sufficient read coverage at the region of interest.

      However, the multi-sample vcf does not give the statistical information described in FORMAT for the respective sample that does NOT contain the variant (entry is ./. ).

      Maybe it would help to run the Unified Genotyper without distinguishing between the two samples, in order to get the overall genotype stats, but I am not sure about that.

      Any opinion on / experience with that?


      • #4
        There are my command lines for calling snps and indels with GATK. It did work well for me and might be useful for you.

        bwa aln -t 4 ucsc.hg19 /home/li/prostate_cancer_RNA/ERR031030_1.fastq > ERR031030_1.sai

        bwa aln -t 4 ucsc.hg19 /home/li/prostate_cancer_RNA/ERR031030_2.fastq > ERR031030_2.sai

        bwa sampe -r "@RG\tID:ERR031030\tLB:ERR031030\tSM:ERR031030\tPL:ILLUMINA" ucsc.hg19 ERR031030_1.sai ERR031030_2.sai /home/li/prostate_cancer_RNA/ERR031030_1.fastq /home/li/prostate_cancer_RNA/ERR031030_2.fastq > ERR031030.sam

        java -Xmx8g -jar /home/li/prostate_cancer_RNA/picard-tools-1.82/SortSam.jar SO=coordinate INPUT=ERR031030.sam OUTPUT=ERR031030.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

        java -Xmx8g -jar /home/li/prostate_cancer_RNA/picard-tools-1.82/MarkDuplicates.jar INPUT=ERR031030.bam OUTPUT=ERR031030.marked.bam METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT

        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ucsc.hg19.fasta -o ERR031030.bam.list -I ERR031030.marked.bam
        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -I ERR031030.marked.bam -R ucsc.hg19.fasta -T IndelRealigner -targetIntervals ERR031030.bam.list -o ERR031030.marked.realigned.bam

        java -Xmx8g -jar /home/li/prostate_cancer_RNA/picard-tools-1.82/FixMateInformation.jar INPUT=ERR031030.marked.realigned.bam OUTPUT=ERR031030.marked.realigned.fixed.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T BaseRecalibrator -R ucsc.hg19.fasta -I ERR031030.marked.realigned.fixed.bam -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites 1000G_phase1.indels.hg19.vcf -knownSites dbsnp_135.hg19.vcf -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -o ERR031030.recal_data.grp

        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T PrintReads -R ucsc.hg19.fasta -I ERR031030.marked.realigned.fixed.bam -BQSR ERR031030.recal_data.grp --out ERR031030.marked.realigned.fixed.recal.bam

        java -Xmx16g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -glm BOTH -R ucsc.hg19.fasta -T UnifiedGenotyper -I ERR031029.marked.realigned.fixed.recal.bam -I ERR031030.marked.realigned.fixed.recal.bam -D dbsnp_135.hg19.vcf -o ERR031030.raw.snps.indels.vcf -metrics snps.metrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000

        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T SelectVariants --variant ERR031030.raw.snps.indels.vcf -o ERR031030.snpsonly.vcf -selectType SNP

        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T SelectVariants --variant ERR031030.raw.snps.indels.vcf -o ERR031030.indelsonly.vcf -selectType INDEL

        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T VariantRecalibrator -R ucsc.hg19.fasta -input ERR031030.snpsonly.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.vcf -resourcemni,known=false,training=true,truth=false,prior=12.0 1000G_phase1.indels.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.hg19.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an MQ -mode SNP -recalFile ERR031030.snp.recal.vcf -tranchesFile ERR031030.snp.tranches.vcf -rscriptFile ERR031030.plots.R

        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T ApplyRecalibration -input ERR031030.snpsonly.vcf -tranchesFile ERR031030.snp.tranches.vcf -recalFile ERR031030.snp.recal.vcf -o ERR031030.snps.filtered.vcf

        java -Xmx16g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T VariantFiltration --variant ERR031030.snps.filtered.vcf -o --filterName "Nov28filters && QD < 2.0 && ReadPosRankSum < -8.0 && MQ < 40.0 && FS > 60.0 && MQRandkSum < -12.5" --filterExpression "HaplotypeScore > 13.0"

        By the way, to obtain a rough selection of "tumor-specific" variants (those that are recorded only in "tumor"), what command line do you use? any suggestion?


        • #5

          I'm far (honestly) from being an expert in this issue, but as far as I know, if you perform the mutation call for the tumor sample and for the normal sample separately, once you have the final set of mutations for each then you simply substract one from the other to get the somatic ones.

          Of course, there is the option to use a variant caller which is already intended to deal with normal/tumor sample experiments. I've used Varscan, and seems pretty good for SNPs (not so much for indels), altough as I told you I'm not an expert in the field and for instance I was not able to use some of the filters that the second release of the tool made available. I also heard good things about SomaticSniper, but I did not use this one. At the end of the day, I think that this issue is far from being solved, and what many people do is to combine the results of several methods to obtain a list of highly confident somatic mutations.

          Hope it helps,


          • #6
            Doing variant calls separately and then looking at the different will produce a large number of false positives. You really should be using somatic variant callers like MuTect or SomaticSniper that look at both the tumor and normal samples at the same time.


            • #7
              Originally posted by id0 View Post
              Doing variant calls separately and then looking at the different will produce a large number of false positives. You really should be using somatic variant callers like MuTect or SomaticSniper that look at both the tumor and normal samples at the same time.
              Do I still need to use two aligners as suggested by lh3?


              Latest Articles


              • seqadmin
                Advanced Methods for the Detection of Infectious Disease
                by seqadmin

                The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
                11-27-2023, 01:15 PM
              • seqadmin
                Strategies for Investigating the Microbiome
                by seqadmin

                Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
                11-09-2023, 07:02 AM





              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 10:35 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 12-05-2023, 02:24 PM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 12-05-2023, 07:37 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 12-04-2023, 08:23 AM
              0 responses
              Last Post seqadmin