Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looking for the right WGS simulator

    I want to simulate WGS with 100 bp reads and apply an error model that most accurately reflects the reality of sequencing on an Illumina HiSeq. Does anyone have any suggestions about which programs (and paramters) could do this?

  • #2
    wgsim models errors as uniformly distributed along the reads, and therefore assigns the same base quality to all bases, which is no realistic. I haven't tried MetaSim, but I think it allows you to use empirical error models. You may want to try it:

    Comment


    • #3
      Hi oiiio,

      We have recently written a fastq silmulator which has the option of generating reads with an error rate following an exponential decay model. So if you simulate an error rate of say 1% over the entire read, the first cycles (possibly 50-70) will have hardly any errors, however the quality will then drop more sharply towards the last cycles, resulting in an overall error rate of 1% per base per read.

      The simulator was originally written for BS-Seq data but it works just as well for normal genomic data. Currently it only simulates single-end reads and features the following options:

      - generate any number of sequences
      - generate sequences of any length
      - generate either completely random sequences or use genomic sequences (can be specified)
      - adjustable bisulfite conversion rate from 0-100% for either all cytosines or cytosines in CH and CG context individually (default: 100)
      - generate directional or non-directional libraries (only relevant for BS-Seq)
      - write sequence out in base space or ABI color space format
      - adjustable default Phred quality score (Sanger encoding, Phred+33) (default: 40)
      - sequences can have a constant Phred quality throughout the read (with default quality)
      - introduce a variable number of SNPs into each read. All bp will have a constant quality score throughout the read which can be set manually (and is 40 by default).
      - sequences can have quality scores following an exponential decay curve. The overall error rate for sequences of varying length follows the calculated error model, and the overall error rate can be specified by the user. For example, a 0.1% error rate will eventually harbour 0.1% SNPs resembling 'real' data error curves (cf. introducing a fixed number of SNPs per sequence).
      - introduce a fixed amount of adapter sequence at the 3' end of all sequences. Available for all error models.
      - introduce a variable amount of adapter sequence at various positions at the 3' end of reads. For this the user can specify a mean insert size of their library, e.g. 150bp. The simulator then calculates a normal distribution of fragment sizes around this mean, and introduces variable bp of adapter sequence into the reads if the fragment size was smaller than the read length. Available for all error models.
      - introduce a variable percentage of adapter sequence (full read length) as contamination. Available for all error models.


      One more word to the error model. As it stands, the error model will be applied to all reads uniformly, which is probably not exactly what a real dataset would look like. We have therefore generated a couple of different test data sets with various error levels (e.g. 0%, 0.1%, 0.2%, 0.5%, 1%, 2% and 5% errors (and thus miscalled bases) and simply concatenated the files to produce a silghtly more realistic dataset.

      Most of the features have been tested to be working correctly with FastQC and by various other means, just let me know if you are interested.

      Comment


      • #4
        Thanks for the replies. Unfortunately, I really need a simulator that can do paired-end data, although yours sounds like a good tool.

        I was looking at the MetaSim program, and I do not see an option in the 'new project' parameters that allows for Illumina data. Does anyone know how to enable this option?

        Additionally, I found a simulator called simNGS (http://www.ebi.ac.uk/goldman-srv/simNGS/) that can do Illumina data. This looks to be the program that I need, but I'm not really sure what parameters should be used for 100bp reads and a realistic error model. Any suggestions?

        Comment


        • #5
          Originally posted by fkrueger View Post
          introduce a variable number of SNPs into each read. All bp will have a constant quality score throughout the read which can be set manually (and is 40 by default).
          i am interested in this program but i am wondering: doesn't the above statement conflict with your statement about an error rate that varies by position? or maybe you mean that all "miscalls" have the same Q score?

          Comment


          • #6
            Our simulator either uses the error model and introduces errors according to the error probability. Alternatively, you can introduce a fixed number of errors per read for which the quality scores will be kept constant. We have used this to assess the influence of 1,2,3 etc. errors on certain mapping conditions, as this is not easy to tell if you use an error model.

            Feel free to take a look here: http://www.bioinformatics.babraham.a...jects/sherman/.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 08:47 AM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            59 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Working...
            X