Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Discussion about MuTect and its parameters

    Hello,

    I am testing MuTect for a few days. I wanted to create this topic to gather our experience together and discuss results, choice of parameters...

    For me, the first difficulty was to get the bam usable by MuTect.
    The second difficulty is to find a correct file for the --intervals parameter. I generated a .bed file from the UCSC table on the advise of shyam_la, because I couldn't run MuTect with --intervals chr1:1500-2500; chr2:2500-3500 or by providing this information in a txt file... Has one of you experienced this issue?

    Then, the problem is, as always, the choice of parameters. To me, one inconvenient is that we don't know what are the parameters by default...

    At first, I ran Mutect this way:
    java -Xmx10g -jar muTect-1.0.27783.jar --analysis_type MuTect --reference_sequence ~/fasta/hg19.fasta -B:dbsnp,VCF dbsnp_132_b37.leftAligned.vcf --intervals hgTables_exons_plus_50.bed -B:cosmic,VCF hg19_cosmic_v54_120711.vcf --input_file:normal /data/patient/GatkSomaticIndelDetector/picard_s-fibros_converted_sorted.bam --input_file:tumor /data/patient/GatkSomaticIndelDetector/picard_s-296_converted_sorted.bam --out /data/patient/example_MuTect.call_stats.txt --coverage_file /data/patient/example_MuTect.coverage.wig.txt --fraction_contamination 0
    When I keep the "KEEP" and "COVERED" variants, I still have 11500 variants (when using a bed file of exons+50bp as --intervals parameter). I have such results:
    contig position ref_allele alt_allele score dbsnp_site covered power tumor_power normal_power total_pairs improper_pairs map_Q0_reads t_lod_fstar tumor_f contaminant_fraction contaminant_lod t_ref_count t_alt_count t_ref_sum t_alt_sum t_ins_count t_del_count normal_best_gt init_n_lod n_ref_count n_alt_count n_ref_sum n_alt_sum judgement
    chr1 9097687 C A 0 NOVEL COVERED 1 1 1 480 3 0 6.992835 0.016484 0 0 179 3 6856 122 0 0 CC 75.275321 263 1 10049 37 KEEP
    I don't know why it appears as variant...

    Then, I added the parameters:
    Code:
    --fraction_contamination 0.02 --min_qscore 20 --clipping_bias_pvalue_threshold 0.05
    Now, when I keep the "KEEP" and "COVERED" variants, I have 890 (instead of 11500) variants left. That's better but I still have "non significant" variant. Here we see the importance of the parameters

    About the parameters:
    The --clipping_bias_pvalue_threshold parameter fixes a threshold for the FET. That's a pity that the p-value is not provided because it is important to adjust it; I will have to perform the FET by myself in order to be able to adjust the test...

    I haven't found a parameter for filtering on the coverage. I don't want to keep variants with less than 10 reads in one of the 2 samples. Of course, I can do it by myself, but it could be implemented... Maybe I haven't found it in the parameters' list.

    I would like to ask the users of Mutect to share their experience and tell us which parameters they used. There is a list of 40/50 parameters and I would appreciate feedbacks on some of them

    Thank you,
    Jane
    Last edited by Jane M; 06-27-2012, 07:40 AM.

  • #2
    Hey Jane,

    Where did you find the documentation for these 40/50 additional parameters? I did not find them on the MuTect website. I am going to run it with fraction_contamination as my tumor sample is not as pure as yours since I am working with tiny solid tumors (I am assuming yours are leukemias given the very low contamination amount)..

    Thank you for that new piece of info!

    Comment


    • #3
      I'm reporting back on the parameter "fraction_contamination". It reduced my variant number by 25%, which is natural given my contamination was somewhere between 0.15 and 0.2.
      Also the tumor alternate allelic fraction (t_alt_count / t_ref_count), has improved significantly..

      Comment


      • #4
        Originally posted by shyam_la View Post
        Hey Jane,

        Where did you find the documentation for these 40/50 additional parameters? I did not find them on the MuTect website.
        Simply:
        java -Xmx10g -jar muTect-1.0.27783.jar --analysis_type MuTect
        usage: java -jar muTect-1.0.27783.jar -T <analysis_type> [-I <input_file>] [-SM <sample_metadata>] [-rbs <read_buffer_size>]
        [-et <phone_home>] [-rf <read_filter>] [-L <intervals>] [-XL <excludeIntervals>] [-R <reference_sequence>] [-B
        <rodBind>] [-BTI <rodToIntervalTrackName>] [-BTIMR <BTI_merge_rule>] [-ndrs] [-D <DBSNP>] [-dt <downsampling_type>]
        [-dfrac <downsample_to_fraction>] [-dcov <downsample_to_coverage>] [-baq <baq>] [-baqGOP <baqGapOpenPenalty>] [-PF
        <performanceLog>] [-OQ] [-DBQ <defaultBaseQualities>] [-S <validation_strictness>] [-U <unsafe>] [-nt <num_threads>]
        [-im <interval_merging>] [-rgbl <read_group_black_list>] [--disable_experimental_low_memory_sharding] [-l
        <logging_level>] [-log <log_to_file>] [-h] [-o <out>] [--noop] [--tumor_sample_name <tumor_sample_name>]
        [--bam_tumor_sample_name <bam_tumor_sample_name>] [--normal_sample_name <normal_sample_name>] [--force_output]
        [--force_alleles] [--initial_tumor_lod <initial_tumor_lod>] [--initial_tumor_fstar_lod <initial_tumor_fstar_lod>]
        [--tumor_lod <tumor_lod>] [--fraction_contamination <fraction_contamination>] [--minimum_mutation_cell_fraction
        <minimum_mutation_cell_fraction>] [--normal_lod <normal_lod>] [--dbsnp_normal_lod <dbsnp_normal_lod>]
        [--minimum_normal_allele_fraction <minimum_normal_allele_fraction>] [--tumor_f_pretest <tumor_f_pretest>] [--min_qscore
        <min_qscore>] [--gap_events_threshold <gap_events_threshold>] [--heavily_clipped_read_fraction
        <heavily_clipped_read_fraction>] [--clipping_bias_pvalue_threshold <clipping_bias_pvalue_threshold>]
        [--min_mutant_sum_pretest <min_mutant_sum_pretest>] [--fraction_mapq0_threshold <fraction_mapq0_threshold>]
        [--pir_median_threshold <pir_median_threshold>] [--pir_mad_threshold <pir_mad_threshold>]
        [--max_alt_alleles_in_normal_count <max_alt_alleles_in_normal_count>] [--max_alt_alleles_in_normal_qscore_sum
        <max_alt_alleles_in_normal_qscore_sum>] [--max_alt_allele_in_normal_fraction <max_alt_allele_in_normal_fraction>] [-cov
        <coverage_file>] [-cov_q20 <coverage_20_q20_file>] [-pow <power_file>] [-tdf <tumor_depth_file>] [-ndf
        <normal_depth_file>] [--power_constant_qscore <power_constant_qscore>] [--absolute_copy_number_data
        <absolute_copy_number_data>] [--power_constant_af <power_constant_af>]

        -T,--analysis_type <analysis_type> Type of analysis to run
        -I,--input_file <input_file> SAM or BAM file(s)
        -SM,--sample_metadata <sample_metadata> Sample file(s) in JSON format
        -rbs,--read_buffer_size <read_buffer_size> Number of reads per SAM file to buffer in memory
        -et,--phone_home <phone_home> What kind of GATK run report should we generate? Standard is the default, can be verbose or NO_ET so nothing is posted to the run repository (NO_ET|STANDARD|STDOUT|AWS_S3)
        -rf,--read_filter <read_filter> Specify filtration criteria to apply to each read
        individually.
        -L,--intervals <intervals> A list of genomic intervals over which to operate. Can be explicitly specified on the command line or in a file.
        -XL,--excludeIntervals <excludeIntervals> A list of genomic intervals to exclude from processing. Can be explicitly specified on the command line or in a file.
        -R,--reference_sequence <reference_sequence> Reference sequence file
        -B,--rodBind <rodBind> Bindings for reference-ordered data, in the form :<name>,<type> <file>
        -BTIMR,--BTI_merge_rule <BTI_merge_rule> Indicates the merging approach the interval parser should use to combine the BTI track with other -L options (UNION| INTERSECTION)
        -ndrs,--nonDeterministicRandomSeed Makes the GATK behave non deterministically, that is, the
        random numbers generated will be different in every run
        CALCULATE_AS_NECESSARY|RECALCULATE)
        etc ...
        You can find these info in most of the java scripts (VarScan2 for example).

        I am going to run it with fraction_contamination as my tumor sample is not as pure as yours since I am working with tiny solid tumors (I am assuming yours are leukemias given the very low contamination amount)..
        Good guess

        Thank you for that new piece of info!
        You are welcome, that's the aim of the topic!

        Comment


        • #5
          Oh, thank you! I will try out a few of those parameters too..

          Comment


          • #6
            Thank very much for the information. I wonder if you can show the default values of those parameters such as tumor_f_pretest, initial_tumor_lob, tumor_lob etc... I followed the documentation on Mutect website and it seems like using default values is fine. But I want to know these values nevertheless. Thank in advance.

            Comment


            • #7
              Was anyone able to get the default values for all the parameters of mutect ?
              Also is there any parameter in mutect (which I do not find using the -h option) which could change the cutoffs for read counts (total and alt allele) for the tumor sample ?
              Also it seems that the --minimum_normal_allele_fraction parameter is not functioning (has no effect) or may be I do not understand it properly. Any inputs would be appreciated.
              Thanks

              Comment


              • #8
                I am using GenomeAnalysis-3.1-1/GenomeAnalysisTK.jar -T BaseRecalibrator --defaultBaseQualities 1; is it okay ? My samples are from PE101 Illumina of Cancer Cell lines with no normal control.
                My questions are: --defaultBaseQualities 1
                1. can I manually filter out low quality bases later ?
                2. I think, using --defaultBaseQualities 1 will keep all bad/low quality bases and I can mannualy filter them out later ...
                May someone please guide me...
                thank you

                Comment


                • #9
                  Mutect default parameters

                  Hello everyone,

                  I see the thread is little bit old but I guess there are even more people interested in Mutect now so hope the info could be useful.

                  About Mutect default parameters. I was also searching for a long.
                  Some filters are described in Mutect paper (http://www.nature.com/nbt/journal/v3.../nbt.2514.html).

                  And for the rest: I just realized that they are all listed in the vcf header.
                  Eg: clipping_bias_pvalue_threshold=0.05 ; tumor_f_pretest=0.0050 ; initial_tumor_lod=4.0 ; tumor_lod=6.3 ...
                  The thing is that Mutect doesn't output vcf by default (use -vcf parameter) - maybe that's why it took me so long to find it out

                  Comment


                  • #10
                    Hi all,

                    I realise this is an old thread, but I too only recently dove into MuTect.
                    What we've found out so far (running mutect 1.1.4), is that when you use the options:
                    java -Xmx4g -jar ../MuTect/muTect-1.1.4.jar -h
                    It will show the help options of GATK, not MuTect (mutect is based on the GATK framework
                    as far as I saw.

                    output shown here:
                    ---------------------------------------------------------------------------------
                    The Genome Analysis Toolkit (GATK) v2.2-25-g2a68eab, Compiled 2012/11/08 10:30:02
                    Copyright (c) 2010 The Broad Institute
                    For support and documentation go to http://www.broadinstitute.org/gatk
                    ---------------------------------------------------------------------------------
                    ---------------------------------------------------------------------------------
                    usage: java -jar muTect-1.1.4.jar -T <analysis_type> [-args <arg_file>] [-I <input_file>] [-rbs <read_buffer_size>] [-et
                    <phone_home>] [-K <gatk_key>] [-tag <tag>] [-rf <read_filter>] [-L <intervals>] [-XL <excludeIntervals>] [-isr
                    <interval_set_rule>] [-im <interval_merging>] [-ip <interval_padding>] [-R <reference_sequence>] [-ndrs]
                    [--disableRandomization] [-maxRuntime <maxRuntime>] [-maxRuntimeUnits <maxRuntimeUnits>] [-dt <downsampling_type>]
                    [-dfrac <downsample_to_fraction>] [-dcov <downsample_to_coverage>] [-baq <baq>] [-baqGOP <baqGapOpenPenalty>] [-PF
                    <performanceLog>] [-OQ] [-BQSR <BQSR>] [-DIQ] [-EOQ] [-preserveQ <preserve_qscores_less_than>] [-DBQ
                    <defaultBaseQualities>] [-S <validation_strictness>] [-rpr] [-kpr] [-U <unsafe>] [-nt <num_threads>] [-nct
                    <num_cpu_threads_per_data_thread>] [-mte] [-bfh <num_bam_file_handles>] [-rgbl <read_group_black_list>] [-ped
                    <pedigree>] [-pedString <pedigreeString>] [-pedValidationType <pedigreeValidationType>] [-l <logging_level>] [-log
                    <log_to_file>] [-h]

                    -T,--analysis_type <analysis_type> Type of analysis to run
                    -args,--arg_file <arg_file> Reads arguments from the specified file
                    -I,--input_file <input_file> SAM or BAM file(s)
                    -rbs,--read_buffer_size <read_buffer_size> Number of reads per SAM file to buffer in
                    memory
                    -et,--phone_home <phone_home> What kind of GATK run report should we
                    generate? STANDARD is the default, can be
                    NO_ET so nothing is posted to the run
                    repository. Please see
                    -phone-home-and-how-does-it-affect-me#latest
                    for details. (NO_ET|STANDARD|STDOUT)
                    -K,--gatk_key <gatk_key> GATK Key file. Required if running with -et
                    NO_ET. Please see
                    -phone-home-and-how-does-it-affect-me#latest
                    for details.
                    -tag,--tag <tag> Arbitrary tag string to identify this GATK
                    run as part of a group of runs, for later
                    analysis
                    -rf,--read_filter <read_filter> Specify filtration criteria to apply to each
                    read individually
                    -L,--intervals <intervals> One or more genomic intervals over which to
                    operate. Can be explicitly specified on the
                    command line or in a file (including a rod
                    file)
                    -XL,--excludeIntervals <excludeIntervals> One or more genomic intervals to exclude
                    from processing. Can be explicitly specified
                    on the command line or in a file (including
                    a rod file)
                    -isr,--interval_set_rule <interval_set_rule> Indicates the set merging approach the
                    interval parser should use to combine the
                    various -L or -XL inputs (UNION|
                    INTERSECTION)
                    -im,--interval_merging <interval_merging> Indicates the interval merging rule we
                    should use for abutting intervals (ALL|
                    OVERLAPPING_ONLY)
                    -ip,--interval_padding <interval_padding> Indicates how many basepairs of padding to
                    include around each of the intervals
                    specified with the -L/--intervals argument
                    -R,--reference_sequence <reference_sequence> Reference sequence file
                    -ndrs,--nonDeterministicRandomSeed Makes the GATK behave non deterministically,
                    that is, the random numbers generated will
                    be different in every run
                    --disableRandomization Completely eliminates randomization from
                    nondeterministic methods. To be used mostly
                    in the testing framework where dynamic
                    parallelism can result in differing numbers
                    of calls to the generator.
                    -maxRuntime,--maxRuntime <maxRuntime> If provided, that GATK will stop execution
                    cleanly as soon after maxRuntime has been
                    exceeded, truncating the run but not exiting
                    with a failure. By default the value is
                    interpreted in minutes, but this can be
                    changed by maxRuntimeUnits
                    -maxRuntimeUnits,--maxRuntimeUnits <maxRuntimeUnits> The TimeUnit for maxRuntime (NANOSECONDS|
                    MICROSECONDS|MILLISECONDS|SECONDS|MINUTES|
                    HOURS|DAYS)
                    -dt,--downsampling_type <downsampling_type> Type of reads downsampling to employ at a
                    given locus. Reads will be selected
                    randomly to be removed from the pile based
                    on the method described here (NONE|ALL_READS|
                    BY_SAMPLE)
                    -dfrac,--downsample_to_fraction <downsample_to_fraction> Fraction [0.0-1.0] of reads to downsample to
                    -dcov,--downsample_to_coverage <downsample_to_coverage> Coverage [integer] to downsample to at any
                    given locus; note that downsampled reads are
                    randomly selected from all possible reads at
                    a locus
                    -baq,--baq <baq> Type of BAQ calculation to apply in the
                    engine (OFF|CALCULATE_AS_NECESSARY|
                    RECALCULATE)
                    -baqGOP,--baqGapOpenPenalty <baqGapOpenPenalty> BAQ gap open penalty (Phred Scaled).
                    Default value is 40. 30 is perhaps better
                    for whole genome call sets
                    -PF,--performanceLog <performanceLog> If provided, a GATK runtime performance log
                    will be written to this file
                    -OQ,--useOriginalQualities If set, use the original base quality scores
                    from the OQ tag when present instead of the
                    standard scores
                    -BQSR,--BQSR <BQSR> The input covariates table file which
                    enables on-the-fly base quality score
                    recalibration
                    -DIQ,--disable_indel_quals If true, disables printing of base insertion
                    and base deletion tags (with -BQSR)
                    -EOQ,--emit_original_quals If true, enables printing of the OQ tag with
                    the original base qualities (with -BQSR)
                    -preserveQ,--preserve_qscores_less_than <preserve_qscores_less_than> Bases with quality scores less than this
                    threshold won't be recalibrated (with -BQSR)
                    -DBQ,--defaultBaseQualities <defaultBaseQualities> If reads are missing some or all base
                    quality scores, this value will be used for
                    all base quality scores
                    -S,--validation_strictness <validation_strictness> How strict should we be with validation
                    (STRICT|LENIENT|SILENT)
                    -rpr,--remove_program_records Should we override the Walker's default and
                    remove program records from the SAM header
                    -kpr,--keep_program_records Should we override the Walker's default and
                    keep program records from the SAM header
                    -U,--unsafe <unsafe> If set, enables unsafe operations: nothing
                    will be checked at runtime. For expert
                    users only who know what they are doing. We
                    do not support usage of this argument.
                    (ALLOW_UNINDEXED_BAM|
                    ALLOW_UNSET_BAM_SORT_ORDER|
                    NO_READ_ORDER_VERIFICATION|
                    ALLOW_SEQ_DICT_INCOMPATIBILITY|
                    LENIENT_VCF_PROCESSING|ALL)
                    -nt,--num_threads <num_threads> How many data threads should be allocated to
                    running this analysis.
                    -nct,--num_cpu_threads_per_data_thread <num_cpu_threads_per_data_thread> How many CPU threads should be allocated per
                    data thread to running this analysis?
                    -mte,--monitorThreadEfficiency Enable GATK threading efficiency monitoring
                    -bfh,--num_bam_file_handles <num_bam_file_handles> The total number of BAM file handles to keep
                    open simultaneously
                    -rgbl,--read_group_black_list <read_group_black_list> Filters out read groups matching
                    <TAG>:<STRING> or a .txt file containing the
                    filter strings one per line.
                    -ped,--pedigree <pedigree> Pedigree files for samples
                    -pedString,--pedigreeString <pedigreeString> Pedigree string for samples
                    -pedValidationType,--pedigreeValidationType <pedigreeValidationType> How strict should we be in validating the
                    pedigree information? (STRICT|SILENT)
                    -l,--logging_level <logging_level> Set the minimum level of logging, i.e.
                    setting INFO get's you INFO up to FATAL,
                    setting ERROR gets you ERROR and FATAL level
                    logging.
                    -log,--log_to_file <log_to_file> Set the logging location
                    -h,--help Generate this help message
                    To get the mutect specific options, add -T MuTect to the command like this:
                    java -Xmx4g -jar ../MuTect/muTect-1.1.4.jar -T MuTect -h
                    Which will now show additional options like shown below.
                    Arguments for MalformedReadFilter:
                    -filterMBQ,--filter_mismatching_base_and_quals if a read has mismatching number of bases and base qualities, filter
                    out the read instead of blowing up.

                    Arguments for MuTect:
                    --noop used for debugging, basically
                    exit as soon as we get the
                    reads
                    --tumor_sample_name <tumor_sample_name> name to use for tumor in
                    output files
                    --bam_tumor_sample_name <bam_tumor_sample_name> if the tumor bam contains
                    multiple samples, only use
                    read groups with SM equal to
                    this value
                    --normal_sample_name <normal_sample_name> name to use for normal in
                    output files
                    --force_output force output for each site
                    --force_alleles force output for all alleles
                    at each site
                    --only_passing_calls only emit passing calls
                    --initial_tumor_lod <initial_tumor_lod> Initial LOD threshold for
                    calling tumor variant
                    --tumor_lod <tumor_lod> LOD threshold for calling
                    tumor variant
                    --fraction_contamination <fraction_contamination> estimate of fraction (0-1) of
                    physical contamination with
                    other unrelated samples
                    --minimum_mutation_cell_fraction <minimum_mutation_cell_fraction> minimum fraction of cells
                    which are presumed to have a
                    mutation, used to handle
                    non-clonality and
                    contamination
                    --normal_lod <normal_lod> LOD threshold for calling
                    normal non-germline
                    --dbsnp_normal_lod <dbsnp_normal_lod> LOD threshold for calling
                    normal non-variant at dbsnp
                    sites
                    --somatic_classification_normal_power_threshold Power threshold for normal to
                    <somatic_classification_normal_power_threshold> determine germline vs variant
                    --minimum_normal_allele_fraction <minimum_normal_allele_fraction> minimum allele fraction to be
                    considered in normal, useful
                    for normal sample contaminated
                    with tumor
                    --tumor_f_pretest <tumor_f_pretest> for computational efficiency,
                    reject sites with allelic
                    fraction below this threshold
                    --min_qscore <min_qscore> threshold for minimum base
                    quality score
                    --gap_events_threshold <gap_events_threshold> how many gapped events
                    (ins/del) are allowed in
                    proximity to this candidate
                    --heavily_clipped_read_fraction <heavily_clipped_read_fraction> if this fraction or more of
                    the bases in a read are
                    soft/hard clipped, do not use
                    this read for mutation calling
                    --clipping_bias_pvalue_threshold <clipping_bias_pvalue_threshold> pvalue threshold for fishers
                    exact test of clipping bias in
                    mutant reads vs ref reads
                    --fraction_mapq0_threshold <fraction_mapq0_threshold> threshold for determining if
                    there is relatedness between
                    the alt and ref allele read
                    piles
                    --pir_median_threshold <pir_median_threshold> threshold for clustered read
                    position artifact median
                    --pir_mad_threshold <pir_mad_threshold> threshold for clustered read
                    position artifact MAD
                    --required_maximum_alt_allele_mapping_quality_score required minimum value for
                    <required_maximum_alt_allele_mapping_quality_score> tumor alt allele maximum
                    mapping quality score
                    --max_alt_alleles_in_normal_count <max_alt_alleles_in_normal_count> threshold for maximum
                    alternate allele counts in
                    normal
                    --max_alt_alleles_in_normal_qscore_sum <max_alt_alleles_in_normal_qscore_sum> threshold for maximum
                    alternate allele quality score
                    sum in normal
                    --max_alt_allele_in_normal_fraction <max_alt_allele_in_normal_fraction> threshold for maximum
                    alternate allele fraction in
                    normal
                    --power_constant_qscore <power_constant_qscore> Phred scale quality score
                    constant to use in power
                    calculations
                    --absolute_copy_number_data <absolute_copy_number_data> Absolute Copy Number Data, as
                    defined by Absolute, to use in
                    power calculations
                    --power_constant_af <power_constant_af> Allelic fraction constant to
                    use in power calculations
                    -o,--out <out> Call-stats output
                    -vcf,--vcf <vcf> VCF output of mutation
                    candidates
                    -dbsnp,--dbsnp <dbsnp> VCF file of DBSNP information
                    -cosmic,--cosmic <cosmic> VCF file of COSMIC sites
                    -cov,--coverage_file <coverage_file> write out coverage in WIGGLE
                    format to this file
                    -cov_q20,--coverage_20_q20_file <coverage_20_q20_file> write out 20x of Q20 coverage
                    in WIGGLE format to this file
                    -pow,--power_file <power_file> write out power in WIGGLE
                    format to this file
                    -tdf,--tumor_depth_file <tumor_depth_file> write out tumor read depth in
                    WIGGLE format to this file
                    -ndf,--normal_depth_file <normal_depth_file> write out normal read depth in
                    WIGGLE format to this file

                    Available Reference Ordered Data types:
                    Name FeatureType Documentation
                    BCF2 VariantContext http://www.broadinstitute.org/gatk/g...BCF2Codec.html
                    BED BEDFeature http://www.broadinstitute.org/gatk/g..._BEDCodec.html
                    EXAMPLEBINARY Feature http://www.broadinstitute.org/gatk/g...naryCodec.html
                    GELITEXT GeliTextFeature http://www.broadinstitute.org/gatk/g...TextCodec.html
                    OLDDBSNP OldDbSNPFeature http://www.broadinstitute.org/gatk/g...bSNPCodec.html
                    VCF VariantContext http://www.broadinstitute.org/gatk/g..._VCFCodec.html

                    For a full description of this walker, see its GATKdocs at:
                    http://www.broadinstitute.org/gatk/g...ct_MuTect.html

                    So yes, I found the documentation not very helpfull at all, and I'm struggling with the package as well. Hope this info helps.

                    - Frank.

                    Comment


                    • #11
                      Mutect Parameters

                      I'm not sure how pertinent this will be considering the age of the thread but I thought I would reply.
                      I found the Mutect default parameters by running Mutect once with the following parameters:
                      --enable_extended_output \
                      --vcf
                      This is not the default. The resulting vcf header will have the defaults thresholds for Mutect. Here's an example of clip of my vcf header. THIS JUST AN EXAMPLE. I tweaked some of these parameters so you are not viewing the defaults. So make sure you run your own version of Mutect and look at the vcf headerto find the defaults. It seems silly to bury it in the vcf header and not posting it ANYWHERE else.
                      ...
                      downsample_to_coverage=1000 enable_experimental_downsampling=false baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 defaultBaseQualities=-1 validation_strictness=SILENT remove_program_records=false keep_program_records=false unsafe=null num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false logging_level=INFO log_to_file=null help=false noop=false enable_extended_output=true artifact_detection_mode=false tumor_sample_name=1002_tumor_52 bam_tumor_sample_name=null normal_sample_name=1002_Normal_53 force_output=false force_alleles=false only_passing_calls=false initial_tumor_lod=4.0 tumor_lod=6.3 fraction_contamination=0.02 minimum_mutation_cell_fraction=0.0 normal_lod=2.2 normal_artifact_lod=1.0 strand_artifact_lod=2.0 strand_artifact_power_threshold=0.9 dbsnp_normal_lod=5.5 somatic_classification_normal_power_threshold=0.95 minimum_normal_allele_fraction=0.0 tumor_f_pretest=0.0050 min_qscore=5 gap_events_threshold=3 heavily_clipped_read_fraction=0.3 clipping_bias_pvalue_threshold=0.05 fraction_mapq0_threshold=0.5 pir_median_threshold=10.0 pir_mad_threshold=3.0 required_maximum_alt_allele_mapping_quality_score=20 max_alt_alleles_in_normal_count=2 max_alt_alleles_in_normal_qscore_sum=20 max_alt_allele_in_normal_fraction=0.03 power_constant_qscore=30 absolute_copy_number_data=null power_constant_af=0.30000001192092896 vcf=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub no_cmdline_in_header=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub
                      ...

                      So it's a complicated process to figure out why a single mutation may have failed Mutect filters. Here's how I do it.
                      1) Check your Mutect callstats.txt file and find your mutation. Check out the "failure reason" column and it should give you a reason, ie. normal_lod, f_star_tumor_lod, alt_allele_in_normal, etc.
                      2) Go to this website: http://gatkforums.broadinstitute.org...date-mutations
                      This site connects the failure reason to the column name in the callstats.txt output that it's associated with.
                      For example, if the failure reason is "alt_allele_in_normal" then go to the "n_alt_count" or "normal_f" column in your extended output callstats.txt file to find the value.
                      3) Look at the vcf header to find the default threshold:
                      (from above)
                      max_alt_alleles_in_normal_count=2
                      max_alt_allele_in_normal_fraction=0.03
                      4) Rerun mutect with these thresholds lowered accordingly to include your mutation. So adding parameters to your command like:
                      --max_alt_alleles_in_normal_count=5 \
                      --max_alt_allele_in_normal_fraction=0.2

                      You could also filter callstats.txt manually if you weren't concerned about getting corrected VCF and other files.

                      It's a wonder that Mutect finds anything interesting at all with the stringency of their filters. 90% of the time Mutect defaults filter out my primary somatic mutation in most of the cancer types I study. This is most likely because of the heterogeity of the tumor and the often mixed tumor contamainated nature of the paired normal. However, even with a lowered threshold of alt allelic fraction to 20% and the a count of about 6, it still will sometimes miss some mutations that are near indels. There's is a gap_threshold parameter in Mutect but if you've followed the GATK best practices protocol and realigned around indels those gaps are pretty frequent, and I've noticed a decrease sensitivity in mutation detection around these areas, while lowering the default threshold will explode the number of false positives.
                      Broad has yet to fix it's deprecated indel detector, but I predict future best practices pipeline will start with an indel caller first which will be used by the SNP detector (Mutect) to discover SNPs. Knowing the location of the indels, might fix these problems.

                      Comment


                      • #12
                        Mutect Parameters

                        I'm not sure how pertinent this will be considering the age of the thread but I thought I would reply, since I wasn't able to find any sources of information when I was struggling with this.
                        I found the Mutect default filters by running Mutect once with the following parameters:
                        --enable_extended_output \
                        --vcf
                        This is not the default. The resulting vcf header will have the defaults thresholds for Mutect. Here's an example of a clip of my vcf header. THIS JUST AN EXAMPLE. I tweaked some of these parameters so you are not viewing the defaults. So make sure you run your own version of Mutect and look at the vcf header to find the defaults. It seems silly to bury it in the vcf header and not posting it ANYWHERE else.

                        ##fileformat=VCFv4.1
                        ##FILTER=<ID=PASS,Description="Accept as a confident somatic mutation">
                        ...
                        ##MuTect="analysis_type=MuTect ...
                        ...
                        downsample_to_coverage=1000 enable_experimental_downsampling=false baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 defaultBaseQualities=-1 validation_strictness=SILENT remove_program_records=false keep_program_records=false unsafe=null num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false logging_level=INFO log_to_file=null help=false noop=false enable_extended_output=true artifact_detection_mode=false tumor_sample_name=1002_tumor_52 bam_tumor_sample_name=null normal_sample_name=1002_Normal_53 force_output=false force_alleles=false only_passing_calls=false initial_tumor_lod=4.0 tumor_lod=6.3 fraction_contamination=0.02 minimum_mutation_cell_fraction=0.0 normal_lod=2.2 normal_artifact_lod=1.0 strand_artifact_lod=2.0 strand_artifact_power_threshold=0.9 dbsnp_normal_lod=5.5 somatic_classification_normal_power_threshold=0.95 minimum_normal_allele_fraction=0.0 tumor_f_pretest=0.0050 min_qscore=5 gap_events_threshold=3 heavily_clipped_read_fraction=0.3 clipping_bias_pvalue_threshold=0.05 fraction_mapq0_threshold=0.5 pir_median_threshold=10.0 pir_mad_threshold=3.0 required_maximum_alt_allele_mapping_quality_score=20 max_alt_alleles_in_normal_count=2 max_alt_alleles_in_normal_qscore_sum=20 max_alt_allele_in_normal_fraction=0.03 power_constant_qscore=30 absolute_copy_number_data=null power_constant_af=0.30000001192092896 vcf=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub no_cmdline_in_header=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub
                        ...

                        So to figure out why a single mutation may have failed Mutect filters is a complicated process. Here's how I do it.
                        1) Check your Mutect callstats.txt file and find your mutation. Check out the "failure reason" column and it should give you a reason, ie. normal_lod, f_star_tumor_lod, alt_allele_in_normal, etc.
                        2) Go to this website: http://gatkforums.broadinstitute.org...date-mutations
                        This site connects the failure reason to the column name in the callstats.txt output that it's associated with.
                        For example, if the failure reason is "alt_allele_in_normal" then go to the "n_alt_count" or "normal_f" column in your extended output callstats.txt file to find the value.
                        3) Look at the vcf header to find the default threshold:
                        (from above)
                        max_alt_alleles_in_normal_count=2
                        max_alt_allele_in_normal_fraction=0.03
                        4) Rerun mutect with these thresholds lowered/adjusted accordingly to include your mutation. So adding parameters to your command like:
                        --max_alt_alleles_in_normal_count=5 \
                        --max_alt_allele_in_normal_fraction=0.1

                        You could also filter callstats.txt manually if you weren't concerned about getting corrected VCF and other files.

                        It's a wonder that Mutect finds anything interesting at all with the stringency of their filters, it really depends on the purity of the your paired normal. 90% of the time Mutect's defaults filter out my primary somatic mutation in most of the cancer types I study. This is most likely because of the heterogeneity of the tumor and the often mixed tumor contamainated nature of the paired normal. However, even with a lowered threshold of alt allelic fraction to 20% and the a count of about 6, it still will sometimes miss some mutations that are near indels. There's is a gap_threshold parameter in Mutect but if you've followed the GATK best practices protocol and realigned around indels those gaps occur pretty frequently because you're realigned around them, and I've noticed a marked decrease in sensitivity in mutation detection around these areas. Lowering the gap_threshold defaults, however, will explode the number of false positives you get, so be careful which parameters you tweak and make sure you have a reason to tweak them.

                        Broad has yet to fix it's deprecated indel detector, but I predict future best practices pipeline will start with an indel caller first which will be subsequently used by the SNP detector (Mutect) to discover SNPs. Knowing the location of the indels, might fix this decreased sensitivity around indels.

                        Good luck!

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin




                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                          Yesterday, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        57 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        53 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        45 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        55 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X