Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Detection of somatic mutations in normal & tumour paired NGS data


    I am trying to detect mutations (SNPs, insertions and deletions) from DNA-seq data. I have both reference and tumoral samples.
    I am convinced that a simultaneous comparison of both samples is more rigorous than a "subtraction" method based on independent analyses. Thus, I'm interested in tools allowing simultaneous comparison.

    I have tried VarScan but there are important bugs in it.
    I have found JointSNVMix combined with mutationSeq.

    Could you suggest me other tools for this problematic?


  • #2
    You might try:
    SomaticSniper (, but it will only report SNVs.

    The GATK's somatic indel detector (

    The samtools package's mpileup command plus bcftools in "paired mode" (


    • #3
      Jane M,

      Thank you for your message. I must respectfully disagree with your statement that Varscan "has important bugs in it." There are dozens of groups using it with great success to detect variants in humans and model organisms, and to call somatic mutations in cancer datasets.

      However, I did realize that a few of your questions from this thread were outstanding, and I've done my best to answer them:

      I would like to also recommend another tool developed at our institute for somatic mutation calling, SomaticSniper:


      • #4
        Thank both of you for your answers.

        Dan, thank you for your answers to my questions on the other topic.
        I must say that my main question/issue isn't solved:
        I'm sure that my "missing reads" have not been filtered out due to low mapping or base quality. And as I specified, some other people have the same problem.

        Could the problem not come from the fact that the model isn't adapted to all kinds of data? Or different versions of JDK, JVF? Or libraries, machine configuration?...


        • #5

          Check the bambino out, it reports both SNVs and indels, then you can annotate with ANNOVAR.

          Last edited by patternist; 03-18-2012, 09:12 AM.


          • #6
            Thanks patternist, I didn't know bambino! I have found a publication (Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format, from January 2011) but the model is not described.
            Do you know where I can find the details concerning what is done in it?


            • #7

              VarScan2 is published try it out and share your experience.
              It should be better than the first version (mentioned in the paper) and also outperformce the SomaticSniper.


              • #8
                I have tried VarScan2 (2.2.8) in both modes (simple and somatic) and what I think is at the beginning of the topic and here:


                • #9

                  Ifound this info at VarScan 2 description:

                  Base alignment quality (BAQ) computation is turned on by default. BAQ is a phred-like score representing the probability that a read base is mis-aligned; it lowers the base quality score of mismatches that are near indels. This is to help rule out false positive SNP calls due to alignment artifacts near small indels. There have been recent suggestions, however, that BAQ may be too strict and cause real SNPs to be missed. Several users of the VarScan variant caller have reported that its read counts disagree with what is seen in IGV, or somatic mutations were missed when mpileup was used instead of pileup. These issues are almost always due to BAQ’s downgrade of base qualities to 0 or 1. This adjustment can’t be seen in IGV, but it’s below VarScan’s default base quality threshold. You can disable BAQ with the -B parameter, or perform a more sensitive BAQ calculation with -E. I’ve heard that the latter option will be turned on by default in the next version of SAMtools.

                  I hope it help's


                  • #10
                    Thank you for the info, I haven' read it.
                    Well, I'm rather in the case: "Several users of the VarScan variant caller have reported that its read counts disagree with what is seen in IGV".
                    From what I read, I could solve the problem with -B or -E parameters.
                    Could you please tell me where you got this info? I am wondering since Dan Kobold, who is VarScan maintainer, didn't suggest me that few days ago... Was the solution that you found proposed by the author?
                    Last edited by Jane M; 03-22-2012, 07:08 AM.


                    • #11

                      I found it here:
                      Details on the samtools mpileup command, base alignment quality (BAQ), multi-sample calling, and other features.

                      It depends on samtool parameters, this could be the reason that Kobold didn't find out.




                      • #12
                        Hi airtime,

                        Thanks for the link.
                        Yesterday, I reran samtools with -B option then VarScan2 and all the "bugs"= wrong read counts that I had noticed were now correct!

                        So thank you very much for the info !!!! I have been experiencing this issue for 2-3 months and you solved it Thanks a lot !

                        I must admit that I don't understand yet why this option can change so much the results:
                        For example, at one position, I have:
                        In IGV: 185 (normal sample, reference) 165 (normal sample, variant) 8(tumoral sample,reference) 359(tumoral sample, variant)
                        In VarScan2 (without -B option in samtools) : 183 (normal sample, reference) 4 (normal sample, variant) 8(tumoral sample,reference) 14(tumoral sample, variant)
                        In VarScan2 (with -B option in samtools) : 184 (normal sample, reference) 164 (normal sample, variant) 8(tumoral sample,reference) 359(tumoral sample, variant)
                        I am much more confident in the results now

                        Now, I should apologize to Dan Kobold... The bugs were not in VarScan, sorry!
                        Dan, you told me that dozens of groups are using VarScan to detect variants. Maybe you could try to warn them about this issue, because the ones who are not using -B or -E option for samtools are probably working on incorrect data.

                        The last issue that I'm experiencing with VarScan2 is the strand filter. I am running it this way:
                        java -Xmx10g -jar VarScan.v2.2.8.jar somatic /data/fibros_convertedAB_sorted.pileup /data/296_convertedAB_sorted.pileup --output-snp /data/output_varscan_AB.snp --output-indel /data/output_varscan_AB.indel --min-coverage 10 --min-coverage-normal 10 --min-coverage-tumor 10 --min-var-freq 0.1 --min-freq-for-hom 0.75 --normal-purity 1 --tumor-purity 1 --p-value 0.01 --somatic-p-value 0.01 --strand-filter 1 --min-avg-qual 25 --min-strands2 2 --min-reads2 3
                        then SomaticFilter:
                        java -Xmx20g -jar VarScan.v2.2.8.jar somaticFilter /data/output_varscan_AB.snp --min-strands2 2 --min-avg-qual 25 --min-var-freq 0.1 --p-value 0.05 --indel-file /data/output_varscan_AB.indel --output-file /data/output_somaticFilter_varscan_AB.snp
                        But I get such an output:
                        chrom position ref var normal_reads1 normal_reads2 normal_var_freq normal_gt tumor_reads1 tumor_reads2 tumor_var_freq tumor_gt somatic_status variant_p_value somatic_p_value tumor_reads1_plus tumor_reads1_minus tumor_reads2_plus tumor_reads2_minus
                        chr4 114260538 C T 35 40 53,33% Y 0 86 100% T LOH 1.0 9.50234823641282E-15 0 0 0 86
                        Why this position has not been filtered out by "--strand-filter 1". For me, there is clearly a strand bias here...
                        Last edited by Jane M; 03-23-2012, 02:54 AM.


                        • #13

                          Thank you for this detailed post, and for following up on this strand question. Your site is homozygous in the tumor (due to LOH) but VarScan's strand filter currently only works on sites that are heterozygous in the tumor.

                          This is because it compares the strand representation of the reference allele to the strand representation of the variant allele. If no reference alleles are seen in the tumor, that comparison can't be made.

                          Your comment has me thinking, however, that the strand filtering capabilities in VarScan need some improvement. I'll work on that for the next release.

                          In the meantime, you might try the filtering strategy we outlined in the VarScan 2 paper, in which you run bam-readcount on all sites and then process the results with the VarScan 2 accessory script


                          • #14
                            For the record, I observed the same thing with mpileup and BAQ calculations on a few occaions. Specifically, I observed that some SNPs that were called fine with pileup were vanishing in mpileup, including seom which had been verified with sanger sequencing. When I looked at the pileup files made by mpileup, and compared them to the .sam files, it was clear that mpileup was representing the quality scores of the alternate letters as being almost 0, while in the .sam, the quality scores were high. The older pileup was faithfully carrying over the quality scores in the pileup output file. A little investigating, and I saw that it was the BAQ calculations responsible, on by default in mpileup. When I disengaged them with -B, the quality scores in the pileup output files matched the quality scores in the .sam files, and the SNPs were callable.


                            • #15
                              Thank you for the explanation Dan. Do you do a FET for the strand filter on sites that are heterozygous in the tumor? I guess that you can take the number of reads supporting the reference (in forward and reverse strands) in tumoral sample as theoretical counts and the number of reads supporting the variant (in forward and reverse strands) in tumoral sample as observed counts...

                              I don't enough experiment yet, but I assume that handling only the cases where sites are heterozygous in the tumor allows to filter half of data? Or is it known that in tumour, we observed more heterozygous sites than homozygous mutated sites?

                              I have developed a basic filter to handle the "other half" of the cases, but for now, it's not very good. When do you think to have the next release ready? Any idea?

                              I will try the bam-readcount and to filter more false positives !


                              Latest Articles


                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin

                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin

                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM





                              Topics Statistics Last Post
                              Started by seqadmin, 07-19-2024, 07:20 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 07-16-2024, 05:49 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 07-15-2024, 06:53 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              Last Post seqadmin