Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looking for the right WGS simulator

    I want to simulate WGS with 100 bp reads and apply an error model that most accurately reflects the reality of sequencing on an Illumina HiSeq. Does anyone have any suggestions about which programs (and paramters) could do this?

  • #2
    wgsim models errors as uniformly distributed along the reads, and therefore assigns the same base quality to all bases, which is no realistic. I haven't tried MetaSim, but I think it allows you to use empirical error models. You may want to try it:

    Comment


    • #3
      Hi oiiio,

      We have recently written a fastq silmulator which has the option of generating reads with an error rate following an exponential decay model. So if you simulate an error rate of say 1% over the entire read, the first cycles (possibly 50-70) will have hardly any errors, however the quality will then drop more sharply towards the last cycles, resulting in an overall error rate of 1% per base per read.

      The simulator was originally written for BS-Seq data but it works just as well for normal genomic data. Currently it only simulates single-end reads and features the following options:

      - generate any number of sequences
      - generate sequences of any length
      - generate either completely random sequences or use genomic sequences (can be specified)
      - adjustable bisulfite conversion rate from 0-100% for either all cytosines or cytosines in CH and CG context individually (default: 100)
      - generate directional or non-directional libraries (only relevant for BS-Seq)
      - write sequence out in base space or ABI color space format
      - adjustable default Phred quality score (Sanger encoding, Phred+33) (default: 40)
      - sequences can have a constant Phred quality throughout the read (with default quality)
      - introduce a variable number of SNPs into each read. All bp will have a constant quality score throughout the read which can be set manually (and is 40 by default).
      - sequences can have quality scores following an exponential decay curve. The overall error rate for sequences of varying length follows the calculated error model, and the overall error rate can be specified by the user. For example, a 0.1% error rate will eventually harbour 0.1% SNPs resembling 'real' data error curves (cf. introducing a fixed number of SNPs per sequence).
      - introduce a fixed amount of adapter sequence at the 3' end of all sequences. Available for all error models.
      - introduce a variable amount of adapter sequence at various positions at the 3' end of reads. For this the user can specify a mean insert size of their library, e.g. 150bp. The simulator then calculates a normal distribution of fragment sizes around this mean, and introduces variable bp of adapter sequence into the reads if the fragment size was smaller than the read length. Available for all error models.
      - introduce a variable percentage of adapter sequence (full read length) as contamination. Available for all error models.


      One more word to the error model. As it stands, the error model will be applied to all reads uniformly, which is probably not exactly what a real dataset would look like. We have therefore generated a couple of different test data sets with various error levels (e.g. 0%, 0.1%, 0.2%, 0.5%, 1%, 2% and 5% errors (and thus miscalled bases) and simply concatenated the files to produce a silghtly more realistic dataset.

      Most of the features have been tested to be working correctly with FastQC and by various other means, just let me know if you are interested.

      Comment


      • #4
        Thanks for the replies. Unfortunately, I really need a simulator that can do paired-end data, although yours sounds like a good tool.

        I was looking at the MetaSim program, and I do not see an option in the 'new project' parameters that allows for Illumina data. Does anyone know how to enable this option?

        Additionally, I found a simulator called simNGS (http://www.ebi.ac.uk/goldman-srv/simNGS/) that can do Illumina data. This looks to be the program that I need, but I'm not really sure what parameters should be used for 100bp reads and a realistic error model. Any suggestions?

        Comment


        • #5
          Originally posted by fkrueger View Post
          introduce a variable number of SNPs into each read. All bp will have a constant quality score throughout the read which can be set manually (and is 40 by default).
          i am interested in this program but i am wondering: doesn't the above statement conflict with your statement about an error rate that varies by position? or maybe you mean that all "miscalls" have the same Q score?

          Comment


          • #6
            Our simulator either uses the error model and introduces errors according to the error probability. Alternatively, you can introduce a fixed number of errors per read for which the quality scores will be kept constant. We have used this to assess the influence of 1,2,3 etc. errors on certain mapping conditions, as this is not easy to tell if you use an error model.

            Feel free to take a look here: http://www.bioinformatics.babraham.a...jects/sherman/.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Non-Coding RNA Research and Technologies
              by seqadmin




              Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

              Nobel Prize for MicroRNA Discovery
              This week,...
              10-07-2024, 08:07 AM
            • seqadmin
              Recent Developments in Metagenomics
              by seqadmin





              Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
              09-23-2024, 06:35 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 02:44 PM
            0 responses
            7 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 10-11-2024, 06:55 AM
            0 responses
            14 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 10-02-2024, 04:51 AM
            0 responses
            110 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 10-01-2024, 07:10 AM
            0 responses
            116 views
            0 likes
            Last Post seqadmin  
            Working...
            X