Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interpreting INDELs

    HI all
    Please how can i interpret indels? i have no previous knowlegde about this.

    this is what the data looks like

    chr3 4466963 . TGGAG TGGAGGAG 999 PASS AC1=5;AF1=0.4167;DP4=7,197,3,84;DP=347;FQ=999;G3=0.1667,0.8333,8.319e-50;HWE=0.0465;INDEL;MQ=44;MfGt=0/1;MinDP=28;NeqMfGt=1;PV4=1,1.5e-70,2e-112,1 GT:PL: DP:SP:GQ 0/1:93,0,255:33:0:95 0/0:0,122,255:59:0:99 0/1:53,0,241:42:0:55 0/1:139,0,250:59:7:99 0/1:78,0,255:70:3:80 0/1:62,0,238:28:0:64

    thank you

  • #2
    Your question is too general -- e.g., what do you mean by "interpret"? I can tell you that you have a 'GAG' insert at a very low frequency (0.4); is that an "interpretation"?
    Plus you give no indication that you have looked at the manual for the program that created your file.

    In other words, if you want someone to help you then do your homework, give us more information and come back with a specific question(s).


    • #3
      Ok, i will like to say i am not a bio informatician, and i am sorry if i ask very stupid questions, its my first time of doing this. having said that
      my objective is to compare the genotypes and find out what indels are common and unique to eah genotype, i have filtered out the indels from the vcf file using the vcftools so now i want to be able to compare the genotypes i.e wildtype consisting of 3 libraries (0/1:93,0,255:33:0:95, 0/0:0,122,255:59:0:99 0/1:53,0,241:42:0:55) and mutant consisting of 3 libraries (0/1:139,0,250:59:7:99, 0/1:78,0,255:70:3:80 0/1:62,0,238:28:0:64)

      i am sorry if this is also a stupid question but i dont know how else i can put it or right thing to do when you are looking for common and uniques in genotypes without regards to the reference.


      • #4
        Step 1) Understand your experiment well enough to ask sensible questions about it.

        It's going to be very hard for anyone here to help you with that.


        • #5
          Thank you for the additional information. At least it gets closer to us being able to help you. Additionally I suspect that your native language is not English thus is may be hard to formulate good questions. However your question is still too general. To put it in non-bioinformatic terms your first post was like asking someone:

          "I would like to travel, please help me."

          Your second post, with more information, is like asking:

          "I would like to travel to Italy, do you have any suggestions?"

          In other words it is a more clear question but still not suitable for a good answer.

          Going back to your post, it is still unclear to me what program created your VCF file. I do not know what all of those INFO tags mean. They should be clear from the header portion of the VCF file. All I can say at this point is that, yes, you have an InDel on Chromosome that is of high quality (999) and passes all filtering. You could certainly look for other Indels of similar characteristics and put this in your paper and thus have reasonable results. However I suspect that is not what you want to accomplish.


          • #6
            Ok, people is it realy about the english? or understanding the project? i dont think so, may be because its the first time of handling such data and its a learning curve and i am not ashamed to ask more and more or shade more light just to get much needed help. having said that

            the variants were called using samtools. i only have two individuals (with 3 libraries for each) which are Near Isogenic Lines. so i am only interested in seeing which indels(that passed the filtering criteria) that are in the wildtype and are not in the mutant and vise versa i.e (absolute difference meaning present in only one indivdual but not the other)

            at the moment i have filtered out all the indels that pass the filtering criteria (or with the pass tag) using a comand line.
            so my first question is, for snps i understand the REF= 0 and ALT = 1 can i interpret the indels in this way. e,g
            0/1:139,0,250:59:7:99 0/1:78,0,255:70:3:80 0/1:62,0,238:28:0:64

            0/1:93,0,255:33:0:95 0/0:0,122,255:59:0:99 0/1:53,0,241:42

            i want be compare a vcf file contining the 3 wt libraries ( 0/1:139,0,250:59:7:99 0/1:78,0,255:70:3:80 0/1:62,0,238:28:0:64)
            to another vcf file comtaining mt libraires.
            (0/1:93,0,255:33:0:95 0/0:0,122,255:59:0:99 0/1:53,0,241:42) to find chr positions that habour indels that are unique and common to both individuals. but i do not know how i can do this with the vcftools and i dont know if any body have done this kind of thing before that is why i came to this site
            so sorry again but we are all learning isnt it.

            vcf header
            ##samtoolsVersion=0.1.17 (r973:277)
            ##INFO=<ID=DP Number=1 Type=Integer Description="Raw read depth">
            ##INFO=<ID=DP4 Number=4 Type=Integer Description="# high-quality ref-forward bases ref-reverse alt-forward and alt-reverse bases">
            ##INFO=<ID=MQ Number=1 Type=Integer Description="Root-mean-square mapping quality of covering reads">
            ##INFO=<ID=FQ Number=1 Type=Float Description="Phred probability of all samples being the same">
            ##INFO=<ID=AF1 Number=1 Type=Float Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
            ##INFO=<ID=AC1 Number=1 Type=Float Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
            ##INFO=<ID=G3 Number=3 Type=Float Description="ML estimate of genotype frequencies">
            ##INFO=<ID=HWE Number=1 Type=Float Description="Chi^2 based HWE test P-value based on G3">
            ##INFO=<ID=CLR Number=1 Type=Integer Description="Log ratio of genotype likelihoods with and without the constraint">
            ##INFO=<ID=UGT Number=1 Type=String Description="The most probable unconstrained genotype configuration in the trio">
            ##INFO=<ID=CGT Number=1 Type=String Description="The most probable constrained genotype configuration in the trio">
            ##INFO=<ID=PV4 Number=4 Type=Float Description="P-values for strand bias baseQ bias mapQ bias and tail distance bias">
            ##INFO=<ID=INDEL Number=0 Type=Flag Description="Indicates that the variant is an INDEL.">
            ##INFO=<ID=PC2 Number=2 Type=Integer Description="Phred probability of the nonRef allele frequency in group1 samples being larger ( smaller) than in group2.">
            ##INFO=<ID=PCHI2 Number=1 Type=Float Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
            ##INFO=<ID=QCHI2 Number=1 Type=Integer Description="Phred scaled PCHI2.">
            ##INFO=<ID=PR Number=1 Type=Integer Description="# permutations yielding a smaller PCHI2.">
            ##FORMAT=<ID=GT Number=1 Type=String Description="Genotype">
            ##FORMAT=<ID=GQ Number=1 Type=Integer Description="Genotype Quality">
            ##FORMAT=<ID=GL Number=3 Type=Float Description="Likelihoods for RR RA AA genotypes (R=ref A=alt)">
            ##FORMAT=<ID=DP Number=1 Type=Integer Description="# high-quality bases">
            ##FORMAT=<ID=SP Number=1 Type=Integer Description="Phred-scaled strand bias P-value">
            ##FORMAT=<ID=PL Number=. Type=Integer Description="List of Phred-scaled genotype likelihoods number of values is (#ALT+1)*(#ALT+2)/2">
            ##FILTER=<ID=StrandBias Description="Min P-value for strand bias (given PV4) [0.0001]">
            ##FILTER=<ID=EndDistBias Description="Min P-value for end distance bias [0.0001]">
            ##FILTER=<ID=MaxDP Description="Maximum read depth [10000000]">
            ##FILTER=<ID=BaseQualBias Description="Min P-value for baseQ bias [1e-100]">
            ##FILTER=<ID=MinMQ Description="Minimum RMS mapping quality for SNPs [10]">
            ##FILTER=<ID=Qual Description="Minimum value of the QUAL field [10]">
            ##FILTER=<ID=MinAB Description="Minimum number of alternate bases [2]">
            ##FILTER=<ID=GapWin Description="Window size for filtering adjacent gaps [10]">
            ##FILTER=<ID=MapQualBias Description="Min P-value for mapQ bias [0]">
            ##FILTER=<ID=SnpGap Description="SNP within INT bp around a gap to be filtered [10]">
            ##FILTER=<ID=MinDP Description="Minimum read depth [5]">
            ##FILTER=<ID=RefN Description="Reference base is N []">
            ##source_20120121.1=./bin/vcftools/perl/vcf-annotate #NOME? +/d=5
            ##INFO=<ID=MinDP Number=1 Type=Integer Description="The smallest sample DP">
            ##INFO=<ID=MfGt Number=1 Type=String Description="The Most Frequent GenoType in the Cohort">
            ##INFO=<ID=NeqMfGt Number=1 Type=Integer Description="Number of sample mismatching The Most Frequent GenoType in the Cohort">
            ##FILTER=<ID=MfGtMis Description="Less than [0] (or equal) samples have a genotype mismatching MfGt">
            ##FILTER=<ID=AltSup Description="According to DP4 the alternative is observed in less than [2] reads in one of the direction">
            ##FILTER=<ID=SynAA Description="The mutation affect a CDS but all the alternatives imply a amino-acid sequence identical to the reference amino-acid sequence">
            #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 110506_SN132_A_s_3_seq_GQJ-3 110506_SN132_A_s_3_seq_GQJ-4 110506_SN132_A_s_4_seq_GQJ-5 110506_SN132_A_s_4_seq_GQJ-6 110616_SN365_A_s_7_seq_GQJ-1 110616_SN365_A_s_7_seq_GQJ-2


            • #7
              So, if I understood well, this is about comparing samples and identifying the mutations (indels) that differentiate them, right? By the title of this thread, I would not have guessed. Here you have my suggestions.

              If you have two individuals and three libraries from each, you should pool together the libraries before discovering the indels, and have a vcf file with only two (not 6) samples. That way, you could easily use the most likely genotype of each sample to make the comparison, for example.

              If you want to be very specific and avoid as many false positives as possible, you may want to try tools designed for cancer genomics, which compare a tumor sample and a normal tissue sample and search for tumor-specific mutations. Even if your samples are not tumor and normal, you can just label them as such and give it a try. For indels, you have the SomaticIndelDetector in GATK. You feed it the bam files of your samples, identified as either normal or tumor, and you obtain a vcf file with all the indels (common and specific), but with the indels specific of the "tumor" sample labelled as "SOMATIC" in the INFO field.

              A limitation of this approach is that indels specific of the sample labelled as "normal" may not be marked. But you can just switch labels and repeat the analysis.


              • #8
                Dear aforntacc

                I know that this post is old, but reading about the filters that you used for filter the variants I noticed that you have one particular filter called "AltSup" that I am interested.

                ##FILTER=<ID=AltSup Description="According to DP4 the alternative is observed in less than [2] reads in one of the direction">

                I did not find this option in vcftools, so I think you created/edited this one. I am trying to do the same, can you tell me how is this filter please?
                I think what this filter will do is have sufficient alternate reads in both strands.



                Latest Articles


                • seqadmin
                  Latest Developments in Precision Medicine
                  by seqadmin

                  Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                  Somatic Genomics
                  “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                  Yesterday, 01:16 PM
                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin

                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM





                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 07:15 AM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 05-23-2024, 10:28 AM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 05-23-2024, 07:35 AM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 05-22-2024, 02:06 PM
                0 responses
                Last Post seqadmin