I just used MuTect for calling mutations and it gave me 3800 mutations in a tab delimited text file.
The columns are given below, from the MuTect website (https://confluence.broadinstitute.or...GATools/MuTect)
What tool/tools can I use to annotate this data? And how should I go about it, since this output has what I believe are a lot of redundant columns?
You may also notice that output has quite a few columns in it. Here are some of the more prominent ones along with their definitions:
contig - the contig location of this candidate
position - the 1-based position of this candidate on the given contig
ref_allele - the reference allele for this candidate
alt_allele - the mutant (alternate) allele for this candidate
tumor_name - name of the tumor as given on the command line, or extracted from the BAM
normal_name - name of the normal as given on the command line, or extracted from the BAM
score - for future development
dbsnp_site - is this a dbsnp site as defined by the dbsnp bitmask supplied to the caller
covered - was the site powered to detect a mutation (80% power for a 0.3 allelic fraction mutation)
power - tumor_power * normal_power
tumor_power - given the tumor sequencing depth, what is the power to detect a mutation at 0.3 allelic fraction
normal_power - given the normal sequencing depth, what power did we have to detect (and reject) this as a germline variant
total_pairs - total tumor and normal read depth which come from paired reads
improper_pairs - number of reads which have abnormal pairing (orientation and distance)
map_Q0_reads - total number of mapping quality zero reads in the tumor and normal at this locus
init_t_lod - deprecated
t_lod_fstar - CORE STATISTIC: Log of (likelihood tumor event is real / likelihood event is sequencing error )
tumor_f - allelic fraction of this candidated based on read counts
contaminant_fraction - estimate of contamination fraction used (supplied or defaulted)
contaminant_lod - log likelihood of ( event is contamination / event is sequencing error )
t_ref_count - count of reference alleles in tumor
t_alt_count - count of alternate alleles in tumor
t_ref_sum - sum of quality scores of reference alleles in tumor
t_alt_sum - sum of quality scores of alternate alleles in tumor
t_ins_count - count of insertion events at this locus in tumor
t_del_count - count of deletion events at this locus in tumor
normal_best_gt - most likely genotype in the normal
init_n_lod - log likelihood of ( normal being reference / normal being altered )
n_ref_count - count of reference alleles in normal
n_alt_count - count of alternate alleles in normal
n_ref_sum - sum of quality scores of reference alleles in normal
n_alt_sum - sum of quality scores of alternate alleles in normal
judgement - final judgement of site KEEP or REJECT (not enough evidence or artifact)
The columns are given below, from the MuTect website (https://confluence.broadinstitute.or...GATools/MuTect)
What tool/tools can I use to annotate this data? And how should I go about it, since this output has what I believe are a lot of redundant columns?
You may also notice that output has quite a few columns in it. Here are some of the more prominent ones along with their definitions:
contig - the contig location of this candidate
position - the 1-based position of this candidate on the given contig
ref_allele - the reference allele for this candidate
alt_allele - the mutant (alternate) allele for this candidate
tumor_name - name of the tumor as given on the command line, or extracted from the BAM
normal_name - name of the normal as given on the command line, or extracted from the BAM
score - for future development
dbsnp_site - is this a dbsnp site as defined by the dbsnp bitmask supplied to the caller
covered - was the site powered to detect a mutation (80% power for a 0.3 allelic fraction mutation)
power - tumor_power * normal_power
tumor_power - given the tumor sequencing depth, what is the power to detect a mutation at 0.3 allelic fraction
normal_power - given the normal sequencing depth, what power did we have to detect (and reject) this as a germline variant
total_pairs - total tumor and normal read depth which come from paired reads
improper_pairs - number of reads which have abnormal pairing (orientation and distance)
map_Q0_reads - total number of mapping quality zero reads in the tumor and normal at this locus
init_t_lod - deprecated
t_lod_fstar - CORE STATISTIC: Log of (likelihood tumor event is real / likelihood event is sequencing error )
tumor_f - allelic fraction of this candidated based on read counts
contaminant_fraction - estimate of contamination fraction used (supplied or defaulted)
contaminant_lod - log likelihood of ( event is contamination / event is sequencing error )
t_ref_count - count of reference alleles in tumor
t_alt_count - count of alternate alleles in tumor
t_ref_sum - sum of quality scores of reference alleles in tumor
t_alt_sum - sum of quality scores of alternate alleles in tumor
t_ins_count - count of insertion events at this locus in tumor
t_del_count - count of deletion events at this locus in tumor
normal_best_gt - most likely genotype in the normal
init_n_lod - log likelihood of ( normal being reference / normal being altered )
n_ref_count - count of reference alleles in normal
n_alt_count - count of alternate alleles in normal
n_ref_sum - sum of quality scores of reference alleles in normal
n_alt_sum - sum of quality scores of alternate alleles in normal
judgement - final judgement of site KEEP or REJECT (not enough evidence or artifact)
Comment