Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Exome quality

    I'm trying to analyze my first result of exome sequencing and am having some problems.
    I ran one cloud analaysis (almost the default that I'm was reading here: FastQC, ngsqctookkit, bwa, samtools, Picard, GATK, NGSrich, ANNOVAR and Wesparser) via WEP's site. Although the software had informed me that the result was 200x coverage (I believe it had only considered the number of nucleotides sequenced divided by size of exome - ~ 6,000,000) some statistics not reported the same thing.

    The first impression of File FastQC (file1.pdf attached) was good, high phred, many reads etc, however its gave two flags: GC content and sequence duplication levels. What is the real impact of this second statistics?
    The Performance of Sample Enrichment file file3 (file2.pdf attached) told me that just 36.75% of the exons had coverage over than 30x. In the same file exists a table with several genes that were not covered. I would like some help, if these results are correct, the analysis may have been done wrong... in summary what can I do? By my calculations ~ 8% of total genes were not covered. With all this, I'm concerned about the confidence of my results.

    I appreciate your attention.
    Attached Files

  • #2
    Hey famarques,

    file 1:-
    your sequencing fastqc report is pretty good.
    you need worry at all.
    all bases quality score ranges above 30 expect the last few.

    --> 1)Regarding GC content, that is ok. the symbol represents just a warning.
    --> 2) your duplication levels. that you need not worry. the fastqc by default cannot give 100 % confidence values on sequence duplication.

    --> when you perform the analysis, once you proceed with samtools, you can remove duplicates using either samtools rmdup
    or picard Markduplicates options etc...

    i really have no much idea. on that.
    by the way - could you let me know. how do you get those statistics. which tool have you used for that.


    • #3
      Hello vishnuamaram,
      Thanks for reply.

      Those Statistics and metrics analysis were did using NGSrich (0.7.8) from BAM filtered files
      The main problem is: Why many genes were not coverage, since I had good quality in my sequence as well as a large amount of reads?

      I got from a cloud analysis in They have a pipeline of exomes analysis.

      Look how they described his tool:

      "The WEP resource performs a complete whole-exome sequencing pipeline and provides easy access through interface to intermediate and final results.

      The pipeline is composed of several steps:
      Verification of input integrity, quality checks, read trimming and primer contamination removal;
      Gapped alignment;
      BAM conversion, sorting and indexing;
      Duplicates removal, as they result as PCR amplification bias;
      A local realignment around known IN-DELs position, carried on to delete the other artifacts;
      Quality score recalibration to refine some oddness caused by sequencing and mapping on quality scores;
      Variants (SNV and DIP) calling from the filtered mapping data obtained from the previous steps;
      Association of as many annotation as possible to the variant list (i.e. annotation stored in database like dbSNP, 1000 Genomes Project, etc.);
      Data post processing: raw outputs are parsed and stored into custom databases to allow cross-linking and intersections, statistics and much more.
      Through our tool a user can perform the whole analysis without knowing the underlying hardware and software architecture, dealing with both paired and single end data. The interface provides an easy and intuitive access for data submission and user-friendly web pages for annotated variant visualization.

      Non-IT mastered users can access through WEP to the most updated and tested whole exome sequencing algorithms, ad-hoc tuned to maximize the quality of variants called while minimizing artifacts and false positives."


      Latest Articles


      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin

        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        Yesterday, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin

        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM





      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      Last Post seqadmin