I want to simulate WGS with 100 bp reads and apply an error model that most accurately reflects the reality of sequencing on an Illumina HiSeq. Does anyone have any suggestions about which programs (and paramters) could do this?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
wgsim models errors as uniformly distributed along the reads, and therefore assigns the same base quality to all bases, which is no realistic. I haven't tried MetaSim, but I think it allows you to use empirical error models. You may want to try it:
-
Hi oiiio,
We have recently written a fastq silmulator which has the option of generating reads with an error rate following an exponential decay model. So if you simulate an error rate of say 1% over the entire read, the first cycles (possibly 50-70) will have hardly any errors, however the quality will then drop more sharply towards the last cycles, resulting in an overall error rate of 1% per base per read.
The simulator was originally written for BS-Seq data but it works just as well for normal genomic data. Currently it only simulates single-end reads and features the following options:
- generate any number of sequences
- generate sequences of any length
- generate either completely random sequences or use genomic sequences (can be specified)
- adjustable bisulfite conversion rate from 0-100% for either all cytosines or cytosines in CH and CG context individually (default: 100)
- generate directional or non-directional libraries (only relevant for BS-Seq)
- write sequence out in base space or ABI color space format
- adjustable default Phred quality score (Sanger encoding, Phred+33) (default: 40)
- sequences can have a constant Phred quality throughout the read (with default quality)
- introduce a variable number of SNPs into each read. All bp will have a constant quality score throughout the read which can be set manually (and is 40 by default).
- sequences can have quality scores following an exponential decay curve. The overall error rate for sequences of varying length follows the calculated error model, and the overall error rate can be specified by the user. For example, a 0.1% error rate will eventually harbour 0.1% SNPs resembling 'real' data error curves (cf. introducing a fixed number of SNPs per sequence).
- introduce a fixed amount of adapter sequence at the 3' end of all sequences. Available for all error models.
- introduce a variable amount of adapter sequence at various positions at the 3' end of reads. For this the user can specify a mean insert size of their library, e.g. 150bp. The simulator then calculates a normal distribution of fragment sizes around this mean, and introduces variable bp of adapter sequence into the reads if the fragment size was smaller than the read length. Available for all error models.
- introduce a variable percentage of adapter sequence (full read length) as contamination. Available for all error models.
One more word to the error model. As it stands, the error model will be applied to all reads uniformly, which is probably not exactly what a real dataset would look like. We have therefore generated a couple of different test data sets with various error levels (e.g. 0%, 0.1%, 0.2%, 0.5%, 1%, 2% and 5% errors (and thus miscalled bases) and simply concatenated the files to produce a silghtly more realistic dataset.
Most of the features have been tested to be working correctly with FastQC and by various other means, just let me know if you are interested.
Comment
-
Thanks for the replies. Unfortunately, I really need a simulator that can do paired-end data, although yours sounds like a good tool.
I was looking at the MetaSim program, and I do not see an option in the 'new project' parameters that allows for Illumina data. Does anyone know how to enable this option?
Additionally, I found a simulator called simNGS (http://www.ebi.ac.uk/goldman-srv/simNGS/) that can do Illumina data. This looks to be the program that I need, but I'm not really sure what parameters should be used for 100bp reads and a realistic error model. Any suggestions?
Comment
-
Originally posted by fkrueger View Postintroduce a variable number of SNPs into each read. All bp will have a constant quality score throughout the read which can be set manually (and is 40 by default).
Comment
-
Our simulator either uses the error model and introduces errors according to the error probability. Alternatively, you can introduce a fixed number of errors per read for which the quality scores will be kept constant. We have used this to assess the influence of 1,2,3 etc. errors on certain mapping conditions, as this is not easy to tell if you use an error model.
Feel free to take a look here: http://www.bioinformatics.babraham.a...jects/sherman/.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 11:09 AM
|
0 responses
22 views
0 likes
|
Last Post
by seqadmin
Today, 11:09 AM
|
||
Started by seqadmin, Today, 06:13 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
Today, 06:13 AM
|
||
Started by seqadmin, 11-01-2024, 06:09 AM
|
0 responses
30 views
0 likes
|
Last Post
by seqadmin
11-01-2024, 06:09 AM
|
||
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, 10-30-2024, 05:31 AM
|
0 responses
21 views
0 likes
|
Last Post
by seqadmin
10-30-2024, 05:31 AM
|
Comment