No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Evaluation of Targeted Enrichment Sequence Analysis Using NextGENe® Software and GS R

    Lan-Szu Chou,1 Megan Manion2 and CS Jonathan Liu2
    1Institute for Clinical and Experimental Pathology, ARUP Laboratories, Salt Lake City, UT 84108-1221.
    2SoftGenetics LLC, 100 Oakwood Avenue Ste 350, State College, PA 16801.
    The development of Next Generation sequencing technologies has greatly increased the speed while reducing the cost of producing DNA sequencing data. The Genome Sequencer FLX Instrument from Roche Applied Science (454 Sequencing) is able to produce up to 1 million reads of 100 – 400bp per read.
    Targeted Genome Enrichment capture techniques reduce the cost and improve the efficiency of sequencing by utilizing the flexibility of microarray
    technology or other capture technologies and the high data throughput of Next Generation sequencing (1). Allowing researchers to focus only on sequencing of interested regions creates increased opportunity for disease study by providing greater coverage of targeted regions at a reduced time and cost.
    NextGENe software and the GS Reference Mapper application of Roche Genome Sequencer Data Analysis Software were used for SNP and Indel detection using sequence capture reads from two datasets; s1, consisting of 254,743 reads with average 210 bps, and s2, consisting of 271,921 reads with average 222 bps. The DNA sample was obtained using NimbleGen Sequence Capture arrays and was sequenced using the Roche Genome Sequencer FLX Instrument. The resultant sequence data contains sequences from the target region as well as whole genome interference captured with the array. Results from the two software packages were compared. NextGENe software mapped more reads to the genome and the targeted region even at higher stringency settings as shown in Table 1. Utilizing more reads enables NextGENe software to statistically remove sequencing errors producing highly accurate SNP and Indel detection.
    Comparison of Results:
    NextGENe Alignment Methodology
    NextGENe’s Sequence Alignment module is used to accurately
    map reads to a reference sequence and detect mutations. The
    alignment algorithm identifies matching positions for each read
    using 12-mer sequences. When a match is found with the
    highest uniqueness score, the alignment is extended. Once
    reads are aligned, mutation positions are identified and
    highlighted. The aligned reads, consensus sequence, reference
    sequence and mutation calls as well as complete annotation
    information is displayed in a single view when the project is
    completed. Specialized reports are also available in the results.
    Table 1: Comparison of alignment results from NextGENe software and GS Reference Mapper
    Figure 1: NextGENe’s alignment viewer provides detailed annotation of the reference sequence and the alignment results.
    GS Reference Mapper Alignment Methodology (Information from Roche Genome Sequencer Data Analysis Software Manual, Software Version 2.0.00, October 2008)
    The GS Mapper program (from the Roche Genome Analyzer Data Analysis Software Package) aligns reads to the reference sequence for each chromosome sequentially. Reads can map to multiple locations in different chromosomes. Then multiple contigs are generated from the aligned reads in a local region. A consensus sequence is created for the contigs. Reads with low matching percentage to other reads aligned in the local region are discarded and are not used to generate the consensus sequences. (This can result in loss of data, especially in highly variable regions.) Variations between the reference and the consensus sequence for each contig or subsets within each contig are detected and reported. The software outputs the contig consensus sequence(s) with quality values,
    alignment of reads and contigs and the coverage and consensus accuracy (quality values) for each position, and a list of the positions of variants (2).
    2. Align to the whole genome
    Aligning reads to the whole genome allows for identification of sequence tags, and for SNP and indel detection. NextGENe aligns the sequence reads either to the whole human genome at once or to one chromosome at a time. It requires large computer memory to produce statistics for the alignment of reads to the whole human genome and it takes some time to curate mutation detection results by a user in diagnostic settings. Because of this, a simplified
    approach was used to conduct analysis on a laptop or desktop computer with as little as 4 GB RAM. All of the reads are aligned to one chromosome
    at a time with 70% stringency, and the reference sequences in the covered regions are output. The covered references of all chromosomes are combined as abridged references. The abridged references are consistent for both of the samples, indicating that DNA capture is specific to the probe design and interference is determined by genome sequence. The abridged reference allows identification of variants using a desktop computer. It also speeds up the alignment process and facilitates human review for diagnostics.
    To remove linker sequences, NextGENe’s Sequence Operation Tool is used to trim the linker sequences from the ends of reads. Reads can be trimmed where the linker sequences are found. A closer view of the aligned reads can be used to determine the linker sequence.
    Figure 2: NextGENe’s Format Conversion tool trims the low quality ends of reads and removes linker sequences.
    1. Linker sequence trimming and poor quality trimming.
    The output sequences from the Roche FLX Instrument include either two files, sequence file (*.fna) and quality file (*.qual), or one file (*.sff) containing quality scores and flow grams with luminescence intensities.
    The linker sequence is often shown in the beginning or ending of the sequence, or both, depending on the sequence library length. The linker sequence could also form the linker di-mer and tri-mer and attach to the sequence library. For this data, the linker sequence of CTCGAGAATTCTGGATCCTC--
    was found. The readout of the linker sequence may include errors. To allow for this, six sequences, CTCGAGAATT (first 10 bases of linker), CTGGATCCTC (last 10 bases of linker), GAATTCTGGA (middle 10 bases of linker) and the reverse complement for each were used to match the starting and ending sequence (within 25 bps) of the reads, to determine the linker sequence and to tolerate the sequence errors. When found, these sequences were removed from the read.
    Sequence quality is often low at the beginnings and the ends of the reads. Quality scores were used to trim the sequences at the ends when the scores of 3 consecutive nucleotides, excluding homopolymers, were below 18. The reads were removed if the median quality was less than 20 or length is less than 25 bps after trimming. In the analysis by NextGENe software, the Condensation Tool™ for error reduction was not used to allow for accurate comparison of software packages.
    Figure 3: Raw reads aligned to the entire genome are shown. Variant positions are highlighted in blue.
    The prevalence of variants at the ends of reads indicates the presence of the linker sequences.
    3. Align to abridged reference.
    The sequence reads are aligned to the abridged reference to remove the interference caused by the capture of other genomic regions and to determine the SNPs and indels that are truly from the target regions. For reads that match to more than one location with some mismatch, NextGENe will map the reads to the most likely location using the highest uniqueness score. A read is allowed to map to multiple locations if the read matches perfectly and “Allow Ambiguous
    Locations” option is selected.
    Most commonly, linkers are found at the 5’ end of reads. However, there are also some reads that appear to have no linker sequence because of linker sequences
    with many errors or that are less than 10bp, and other reads that have linkers at the 3’end. Reads are also found that have linker sequences at both ends. Although these are found at low frequency, they occur more often than reads with linkers at 3’ end only because they occur with short fragments that amplify most efficiently. Least common are reads that contain a linker sequence dimer at the 5’ end.
    Figure 4: Linker sequences can be found with reads in a few different locations.
    NextGENe’s alignment tool was able to align a greater percentage of reads to the whole genome reference as well as to the target region of the genome compared
    to GS Mapper software (see Table 1).
    Reads that did not map to the human reference were checked. These were found to be short reads with an average length of 100 bps, or reads with high error rate. These short reads are likely to be sequence artifacts.
    Insertion and deletion detection is an important aspect of NextGENe software. Short insertions and deletions up to 30% of read length can be detected using the reads align to the genome. For this data, a deletion of 60 nucleotides and insertion of 30 nucleotides was detected with reads of 200 bps. Special attention is required for the detection of large insertions and deletions, such as a 300 bp Alu insertion. To check for a large indel event, unmatched sequences from a high stringency alignment can be assembled and then the assembled sequence aligned to the genome with low stringency.
    1. T J Albert, et al. 2007. Direct selection of human genomic loci by microarray hybridization. NatureMethods. 4: 903-905.
    2. Roche Diagnostics GmbH. October 2008. Genome Sequencer Data Analysis Software Manual.
    Trademarks are property of their respective owners.
    Last edited by SoftGenetics; 09-03-2009, 11:51 AM.

  • #2
    Can you provide a link to this poster so we can download it?



    • #3
      hi it would be easier for you to download from our website:


      • #4
        go to our website and download it there:


        Latest Articles


        • seqadmin
          Advanced Methods for the Detection of Infectious Disease
          by seqadmin

          The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
          11-27-2023, 01:15 PM
        • seqadmin
          Strategies for Investigating the Microbiome
          by seqadmin

          Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
          11-09-2023, 07:02 AM





        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 08:23 AM
        0 responses
        Last Post seqadmin  
        Started by seqadmin, 12-01-2023, 09:55 AM
        0 responses
        Last Post seqadmin  
        Started by seqadmin, 11-30-2023, 10:48 AM
        0 responses
        Last Post seqadmin  
        Started by seqadmin, 11-29-2023, 08:26 AM
        0 responses
        Last Post seqadmin