Evaluation of Targeted Enrichment Sequence Analysis Using NextGENe® Software and GS R

SoftGenetics

Registered Vendor

Join Date: Apr 2009

Posts: 32
- Share
- Tweet
#1

Evaluation of Targeted Enrichment Sequence Analysis Using NextGENe® Software and GS R

08-25-2009, 06:19 AM

Lan-Szu Chou,1 Megan Manion2 and CS Jonathan Liu2
1Institute for Clinical and Experimental Pathology, ARUP Laboratories, Salt Lake City, UT 84108-1221.
2SoftGenetics LLC, 100 Oakwood Avenue Ste 350, State College, PA 16801.
Introduction
The development of Next Generation sequencing technologies has greatly increased the speed while reducing the cost of producing DNA sequencing data. The Genome Sequencer FLX Instrument from Roche Applied Science (454 Sequencing) is able to produce up to 1 million reads of 100 – 400bp per read.
Targeted Genome Enrichment capture techniques reduce the cost and improve the efficiency of sequencing by utilizing the flexibility of microarray
technology or other capture technologies and the high data throughput of Next Generation sequencing (1). Allowing researchers to focus only on sequencing of interested regions creates increased opportunity for disease study by providing greater coverage of targeted regions at a reduced time and cost.
NextGENe software and the GS Reference Mapper application of Roche Genome Sequencer Data Analysis Software were used for SNP and Indel detection using sequence capture reads from two datasets; s1, consisting of 254,743 reads with average 210 bps, and s2, consisting of 271,921 reads with average 222 bps. The DNA sample was obtained using NimbleGen Sequence Capture arrays and was sequenced using the Roche Genome Sequencer FLX Instrument. The resultant sequence data contains sequences from the target region as well as whole genome interference captured with the array. Results from the two software packages were compared. NextGENe software mapped more reads to the genome and the targeted region even at higher stringency settings as shown in Table 1. Utilizing more reads enables NextGENe software to statistically remove sequencing errors producing highly accurate SNP and Indel detection.
Comparison of Results:
Methodology
NextGENe Alignment Methodology
NextGENe’s Sequence Alignment module is used to accurately
map reads to a reference sequence and detect mutations. The
alignment algorithm identifies matching positions for each read
using 12-mer sequences. When a match is found with the
highest uniqueness score, the alignment is extended. Once
reads are aligned, mutation positions are identified and
highlighted. The aligned reads, consensus sequence, reference
sequence and mutation calls as well as complete annotation
information is displayed in a single view when the project is
completed. Specialized reports are also available in the results.
Table 1: Comparison of alignment results from NextGENe software and GS Reference Mapper
Figure 1: NextGENe’s alignment viewer provides detailed annotation of the reference sequence and the alignment results.
GS Reference Mapper Alignment Methodology (Information from Roche Genome Sequencer Data Analysis Software Manual, Software Version 2.0.00, October 2008)
The GS Mapper program (from the Roche Genome Analyzer Data Analysis Software Package) aligns reads to the reference sequence for each chromosome sequentially. Reads can map to multiple locations in different chromosomes. Then multiple contigs are generated from the aligned reads in a local region. A consensus sequence is created for the contigs. Reads with low matching percentage to other reads aligned in the local region are discarded and are not used to generate the consensus sequences. (This can result in loss of data, especially in highly variable regions.) Variations between the reference and the consensus sequence for each contig or subsets within each contig are detected and reported. The software outputs the contig consensus sequence(s) with quality values,
alignment of reads and contigs and the coverage and consensus accuracy (quality values) for each position, and a list of the positions of variants (2).
Procedure
2. Align to the whole genome
Aligning reads to the whole genome allows for identification of sequence tags, and for SNP and indel detection. NextGENe aligns the sequence reads either to the whole human genome at once or to one chromosome at a time. It requires large computer memory to produce statistics for the alignment of reads to the whole human genome and it takes some time to curate mutation detection results by a user in diagnostic settings. Because of this, a simplified
approach was used to conduct analysis on a laptop or desktop computer with as little as 4 GB RAM. All of the reads are aligned to one chromosome
at a time with 70% stringency, and the reference sequences in the covered regions are output. The covered references of all chromosomes are combined as abridged references. The abridged references are consistent for both of the samples, indicating that DNA capture is specific to the probe design and interference is determined by genome sequence. The abridged reference allows identification of variants using a desktop computer. It also speeds up the alignment process and facilitates human review for diagnostics.
To remove linker sequences, NextGENe’s Sequence Operation Tool is used to trim the linker sequences from the ends of reads. Reads can be trimmed where the linker sequences are found. A closer view of the aligned reads can be used to determine the linker sequence.
Figure 2: NextGENe’s Format Conversion tool trims the low quality ends of reads and removes linker sequences.
1. Linker sequence trimming and poor quality trimming.
The output sequences from the Roche FLX Instrument include either two files, sequence file (*.fna) and quality file (*.qual), or one file (*.sff) containing quality scores and flow grams with luminescence intensities.
The linker sequence is often shown in the beginning or ending of the sequence, or both, depending on the sequence library length. The linker sequence could also form the linker di-mer and tri-mer and attach to the sequence library. For this data, the linker sequence of CTCGAGAATTCTGGATCCTC--
was found. The readout of the linker sequence may include errors. To allow for this, six sequences, CTCGAGAATT (first 10 bases of linker), CTGGATCCTC (last 10 bases of linker), GAATTCTGGA (middle 10 bases of linker) and the reverse complement for each were used to match the starting and ending sequence (within 25 bps) of the reads, to determine the linker sequence and to tolerate the sequence errors. When found, these sequences were removed from the read.
Sequence quality is often low at the beginnings and the ends of the reads. Quality scores were used to trim the sequences at the ends when the scores of 3 consecutive nucleotides, excluding homopolymers, were below 18. The reads were removed if the median quality was less than 20 or length is less than 25 bps after trimming. In the analysis by NextGENe software, the Condensation Tool™ for error reduction was not used to allow for accurate comparison of software packages.
Figure 3: Raw reads aligned to the entire genome are shown. Variant positions are highlighted in blue.
The prevalence of variants at the ends of reads indicates the presence of the linker sequences.
3. Align to abridged reference.
The sequence reads are aligned to the abridged reference to remove the interference caused by the capture of other genomic regions and to determine the SNPs and indels that are truly from the target regions. For reads that match to more than one location with some mismatch, NextGENe will map the reads to the most likely location using the highest uniqueness score. A read is allowed to map to multiple locations if the read matches perfectly and “Allow Ambiguous
Locations” option is selected.
Results
Most commonly, linkers are found at the 5’ end of reads. However, there are also some reads that appear to have no linker sequence because of linker sequences
with many errors or that are less than 10bp, and other reads that have linkers at the 3’end. Reads are also found that have linker sequences at both ends. Although these are found at low frequency, they occur more often than reads with linkers at 3’ end only because they occur with short fragments that amplify most efficiently. Least common are reads that contain a linker sequence dimer at the 5’ end.
Figure 4: Linker sequences can be found with reads in a few different locations.
NextGENe’s alignment tool was able to align a greater percentage of reads to the whole genome reference as well as to the target region of the genome compared
to GS Mapper software (see Table 1).
Reads that did not map to the human reference were checked. These were found to be short reads with an average length of 100 bps, or reads with high error rate. These short reads are likely to be sequence artifacts.
Discussion
Insertion and deletion detection is an important aspect of NextGENe software. Short insertions and deletions up to 30% of read length can be detected using the reads align to the genome. For this data, a deletion of 60 nucleotides and insertion of 30 nucleotides was detected with reads of 200 bps. Special attention is required for the detection of large insertions and deletions, such as a 300 bp Alu insertion. To check for a large indel event, unmatched sequences from a high stringency alignment can be assembled and then the assembled sequence aligned to the genome with low stringency.
References
1. T J Albert, et al. 2007. Direct selection of human genomic loci by microarray hybridization. NatureMethods. 4: 903-905.
2. Roche Diagnostics GmbH. October 2008. Genome Sequencer Data Analysis Software Manual.
Trademarks are property of their respective owners.

Last edited by SoftGenetics; 09-03-2009, 11:51 AM.
Tags: None
dottomarco

Member

Join Date: Jul 2009

Posts: 32
- Share
- Tweet
#2

09-10-2009, 01:54 AM

Can you provide a link to this poster so we can download it?

Thanks
Comment
SoftGenetics

Registered Vendor

Join Date: Apr 2009

Posts: 32
- Share
- Tweet
#3

09-10-2009, 02:19 PM

hi it would be easier for you to download from our website: http://www.softgenetics.com/Evaluati...ysis_Paper.pdf
Comment
SoftGenetics

Registered Vendor

Join Date: Apr 2009

Posts: 32
- Share
- Tweet
#4

09-10-2009, 02:19 PM

go to our website and download it there: http://www.softgenetics.com/Evaluati...ysis_Paper.pdf
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Evaluation of Targeted Enrichment Sequence Analysis Using NextGENe® Software and GS R

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News