Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Exome Analysis.. Annotation - unusual observation; need explanation..

    I have a bioinformatics query on the exome project we are running. We are using a NimbleGenV2 exome capture kit for target capture.

    It's a unusual sort of question, and has been nagging me for more than a week now and nobody could provide a good answer yet:

    Lets say I have processed raw reads from a tumor-normal paired exome experiment and made them fit for mutation calling. I have two bam files (one each for tumor and normal) that I feed into a mutation caller and since its an exome experiment,

    Case 1: I limit the variant calls to mutations limited to the target regions only by using the .bed file from the NimbleGen website, as an interval parameter.

    Now, theoretically all the mutation calls made by the caller are exonic or splicing. I have 2100 SNVs.

    I run these calls through an annotation software and annotate it against a refgene set (Annovar (uses directly downloaded UCSC refgene set), more than 92% of the SNVs are annotated as "exonic" or "splicing" as expected..


    Case 2: I limit the variant calls to mutations limited to exons + 10 bases only by generating a .bed file of refgenes from the UCSC table browser, and use it as an interval parameter.

    Now, once again theoretically all the mutation calls made by the caller are exonic or splicing. I have 2700 SNVs.

    But when I run these calls through an annotation software and annotate it against a refgene set (Annovar again), only approximately 65%-75% of the calls are exonic or splicing. The rest are annotated as intronic, upstream, downstream and a zillion other things..

    (1) My understanding is that the 2100 vs 2700 are because of possible misalignment of a fraction of the reads into non target regions and hence the extra 600 SNVs comprise false positive mutation calls, for the most part (correct me if I am wrong).
    (2) The 92% vs 65-75% on the other hand is quite inexplicable. In both cases the caller was asked to call variants in only exonic regions; which in the former case was the capture target regions, and in the latter case was the refgene set of exons got from the Table Browser. I would have expected >90% exonic variants in Case 2 also..


    Have you noticed this before? Is there an explanation as to why (2) is happening?

  • #2
    Hi shyam_la,

    1) Try to compare the two bed files (nimblegen and refgenes) to how different they are.
    2) It does not seem too much to extend 10bp, but a big chunk of human exons are <200bp so the chance of getting non-exonic/splicing variants is quite big.
    3) If you are curious, try the nimblegen bed file but extending 10bp; and try the refgenes without extending 10bp. I am quite interested in what you get.

    Best regards,
    Douglas

    Comment


    • #3
      yeah the bed files can vary... which will ultimately effect the statistics, one more thing i want to ask is 2100 included in the 2700 you get in case 2 ??

      Comment


      • #4
        Yes, the 2100 are included in the 2700. Of course the bed files vary - but that is not an explanation for my observation..

        Originally posted by ersgupta View Post
        yeah the bed files can vary... which will ultimately effect the statistics, one more thing i want to ask is 2100 included in the 2700 you get in case 2 ??

        Comment


        • #5
          HI,

          1) On IGV, they are not very different at the genomic level.. If I zoom in to look at finer details, the NimbleGen one has a lot of exons missing that are present in the refseq one (which is expected)..
          I will try out mutation calling without the +10 bp - though doubt thats going to reduce the numbers very much..
          Will update with results.

          Originally posted by DZhang View Post
          Hi shyam_la,

          1) Try to compare the two bed files (nimblegen and refgenes) to how different they are.
          2) It does not seem too much to extend 10bp, but a big chunk of human exons are <200bp so the chance of getting non-exonic/splicing variants is quite big.
          3) If you are curious, try the nimblegen bed file but extending 10bp; and try the refgenes without extending 10bp. I am quite interested in what you get.

          Best regards,
          Douglas
          www.contigexpress.com

          Comment


          • #6
            Did it on 1 sample..
            Got 2492 SNVs (exons only) vs 2768 (exons + 10 bp).
            78% of those were annotated as exonic/splicing vs 70% (exon + 10bp)..

            So, 8% of the difference is due to the extra 10bp that I had used. But 78% is still a low proportion.. Expected: atleast 90%

            Originally posted by DZhang View Post
            Hi shyam_la,

            1) Try to compare the two bed files (nimblegen and refgenes) to how different they are.
            2) It does not seem too much to extend 10bp, but a big chunk of human exons are <200bp so the chance of getting non-exonic/splicing variants is quite big.
            3) If you are curious, try the nimblegen bed file but extending 10bp; and try the refgenes without extending 10bp. I am quite interested in what you get.

            Best regards,
            Douglas
            www.contigexpress.com

            Comment


            • #7
              2100x.92=1932
              2492X.78=1944

              So the absolute exonic/splicing numbers are quite close. Without examining carefully the difference in the two bed files and the actual SNV variants unique to the refgene bed file, it is hard to explain why.

              Douglas

              Comment


              • #8
                Yeah, exactly my thoughts..
                Thank you.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                22 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                16 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                47 views
                0 likes
                Last Post seqadmin  
                Working...
                X