Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need help annotating..

    I just used MuTect for calling mutations and it gave me 3800 mutations in a tab delimited text file.

    The columns are given below, from the MuTect website (https://confluence.broadinstitute.or...GATools/MuTect)

    What tool/tools can I use to annotate this data? And how should I go about it, since this output has what I believe are a lot of redundant columns?


    You may also notice that output has quite a few columns in it. Here are some of the more prominent ones along with their definitions:

    contig - the contig location of this candidate
    position - the 1-based position of this candidate on the given contig
    ref_allele - the reference allele for this candidate
    alt_allele - the mutant (alternate) allele for this candidate
    tumor_name - name of the tumor as given on the command line, or extracted from the BAM
    normal_name - name of the normal as given on the command line, or extracted from the BAM
    score - for future development
    dbsnp_site - is this a dbsnp site as defined by the dbsnp bitmask supplied to the caller
    covered - was the site powered to detect a mutation (80% power for a 0.3 allelic fraction mutation)
    power - tumor_power * normal_power
    tumor_power - given the tumor sequencing depth, what is the power to detect a mutation at 0.3 allelic fraction
    normal_power - given the normal sequencing depth, what power did we have to detect (and reject) this as a germline variant
    total_pairs - total tumor and normal read depth which come from paired reads
    improper_pairs - number of reads which have abnormal pairing (orientation and distance)
    map_Q0_reads - total number of mapping quality zero reads in the tumor and normal at this locus
    init_t_lod - deprecated
    t_lod_fstar - CORE STATISTIC: Log of (likelihood tumor event is real / likelihood event is sequencing error )
    tumor_f - allelic fraction of this candidated based on read counts
    contaminant_fraction - estimate of contamination fraction used (supplied or defaulted)
    contaminant_lod - log likelihood of ( event is contamination / event is sequencing error )
    t_ref_count - count of reference alleles in tumor
    t_alt_count - count of alternate alleles in tumor
    t_ref_sum - sum of quality scores of reference alleles in tumor
    t_alt_sum - sum of quality scores of alternate alleles in tumor
    t_ins_count - count of insertion events at this locus in tumor
    t_del_count - count of deletion events at this locus in tumor
    normal_best_gt - most likely genotype in the normal
    init_n_lod - log likelihood of ( normal being reference / normal being altered )
    n_ref_count - count of reference alleles in normal
    n_alt_count - count of alternate alleles in normal
    n_ref_sum - sum of quality scores of reference alleles in normal
    n_alt_sum - sum of quality scores of alternate alleles in normal
    judgement - final judgement of site KEEP or REJECT (not enough evidence or artifact)

  • #2
    With a few lines of Perl (or equivalent) you can convert the MuTect output into something that ANNOVAR can read - I think ANNOVAR needs explicit start and end positions for the variant, even if these are identical (as with a SNP), so you'll need to duplicate MuTect's single variant position column:



    ANNOVAR only cares about the first 5 columns of data, but can (optionally) retain the other columns in its output, which can be useful.

    Comment


    • #3
      Originally posted by RDW View Post
      With a few lines of Perl (or equivalent) you can convert the MuTect output into something that ANNOVAR can read - I think ANNOVAR needs explicit start and end positions for the variant, even if these are identical (as with a SNP), so you'll need to duplicate MuTect's single variant position column:



      ANNOVAR only cares about the first 5 columns of data, but can (optionally) retain the other columns in its output, which can be useful.
      I am a MD, not a Bioinformatician proper and writing a few lines of perl is beyond me. I would have given ANNOVAR a shot, if I had a .vcf file as output from MuTect..

      In any case, I used SNPeff - it accepts text input now but support unfortunately is soon to be discontinued in later builds. I retained only the first 4 columns of MuTect output and it did a great job..

      Thank you.

      Q: Can IGV display annotated tab delimited text files graphically? Any other software recommendations for visualisation?

      Comment


      • #4
        Hi Shyam_la,
        GeneTalk is designed for non Bioinformaticians analyzing human sequence variants. So this might be an option for you. Who gave you the data? There is a standard format for reporting variants in NGS resequencing projects. It is called variant call format, vcf. Usually your sequencing facility can provide your data in this format. The rest is automatically done in GeneTalk. you just upload the data and the annotation will be done automatically in the background. A tutorial video tutorial explains how to filter.
        best,
        peter

        Comment


        • #5
          Originally posted by shyam_la View Post
          I am a MD, not a Bioinformatician proper and writing a few lines of perl is beyond me. I would have given ANNOVAR a shot, if I had a .vcf file as output from MuTect.
          ANNOVAR can also accept text files - see my link. You just need an extra column that duplicates the variant position (since for SNPs, the start position is the same as the end). Any program that can manipulate tab-delimited text files can handle this, even Excel (just check that the first five columns don't get mangled!).

          Comment


          • #6
            BLAST2GO program

            Comment


            • #7
              Originally posted by krawitz View Post
              Hi Shyam_la,
              GeneTalk is designed for non Bioinformaticians analyzing human sequence variants. So this might be an option for you. Who gave you the data? There is a standard format for reporting variants in NGS resequencing projects. It is called variant call format, vcf. Usually your sequencing facility can provide your data in this format. The rest is automatically done in GeneTalk. you just upload the data and the annotation will be done automatically in the background. A tutorial video tutorial explains how to filter.
              best,
              peter
              Hi Peter,

              Our sequencing facility provides only the raw reads. Im doing all the downstream analyses. I have perfected my pipeline upto annotation and my mutation caller MuTect provides only text output at the moment, as it is in beta stage (but provides excellent mutation calls in my opinion). vcf is not an option now..
              Thank you anyway.

              Comment


              • #8
                Originally posted by RDW View Post
                ANNOVAR can also accept text files - see my link. You just need an extra column that duplicates the variant position (since for SNPs, the start position is the same as the end). Any program that can manipulate tab-delimited text files can handle this, even Excel (just check that the first five columns don't get mangled!).
                Oh, thank you! I had assumed for another source that annovar was not capable of that.. Will give it a try too now!!

                Comment


                • #9
                  Originally posted by JackieBadger View Post
                  BLAST2GO program
                  Just checked out their home page.. Sounds good! Will give it a shot right away..

                  Comment


                  • #10
                    Originally posted by shyam_la View Post
                    Just checked out their home page.. Sounds good! Will give it a shot right away..
                    If you want to know if these SNPs are non-synonymous (and you do not know the reading frame) you should use the tBLASTx against the nr database. Then once you have these results you can get GOannotations. You can also predict open reading frames using the OrfFinder program.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    27 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    30 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    26 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X