Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Simulate Illumina read-pairs


    I want to simulate read-pairs using a read-length greater than 35 (up to 75). If I run MAQ, this works:

    maq simulate -N 1000 -1 35 -2 35 human_b37_chr22.fasta calib-36.dat

    But this does not:

    maq simulate -N 1000 -1 70 -2 70 human_b37_chr22.fasta" calib-36.dat

    calib-36.dat was downloaded from The flags "-1" and "-2" do not work if the user wants read lengths greater than those specified in the .dat file.

    If anybody knows about other sequencers and other simulators, that is helpful too. I want to sequence a genome subject to random mutations in the sequencing process.

  • #2

    I think the up to date simulator is wgsim

    Reads simulator. Contribute to lh3/wgsim development by creating an account on GitHub.


    Wgsim was modified from MAQ's read simulator by dropping dependencies to other
    source codes in the MAQ package and incorporating patches from Colin Hercus
    which allow to simulate INDELs longer than 1bp. Wgsim was originally released
    in the SAMtools software package. I forked it out in 2011 as a standalone
    project. A few improvements were also added in this course.


    • #3
      Software list

      I am currently reviewing software for this purpose so I know of quite a few options. Most of these you can just google "[prog name] simulation genome" or something and you will find them in the top few hits. The illumina one you need to write to them to ask for and as far as I know it is not official.

      * wgsim -> PE only, uniform error
      * dwgsim -> Position specific error. PE only
      * metasim -> PE only, specialized for simulating from a population
      * in-house illumina C++ -> doesn't model mate-pair chimeras, uses sampling of illumina error strings as the error for the output. Doesn't model base specific error though, error is the same for each underlying base if it occurs.
      * in-house illumina perl -> This adds in proper handling of mate-pair simulation, but it uses the same base level error strategy as the C++ version, this is the main reason we chose to write our own. Doesn't model pe-contamination in MP lib, but the developer notes it would be easy to separately generate PE reads and mix them into the output file. Although ours ended up being backwards, we still successfully modeled different error rates depending on the underlying base.
      * PEMer -> no mate-pair chimeras
      * reseqsim -> focuses on SV analysis, doesn't do MP modeling
      * simnext -> flat error rate like wgsim
      * mason -> doesn't model mate-pair chimeras
      * flux-capacitor -> models RNA-seq reads

      And of course there is the one I wrote which we used in the first Assemblathon:


      • #4
        Thanks everyone for your replies.

        I want a sequence error simulator that should match Illumina in the 1000 Genomes Project. That is where I am getting my data from. (Illumina-specific is not a die-hard requirement, but it helps a bit. The type of error should not depend on the read size of reads.)

        I need read-pairs. Read length should be specifiable by the user. The insert size should follow a random distribution - Normal or whatever - that can be specified. SimSeq seems to satisfy those criteria at the moment but I have not tried it yet.

        I have my own tailored donor genome for a particular kind of mutation that needs sequencing errors.


        • #5
          If I want to use dwgsim for simulating read-pairs, can anyone explain the flags for me (

          What do -e and -E mean technically? What are the error rates relative to?

          I think that -r is the mutation rate per base pair. Can that be confirmed?

          What does -R, the fraction of indels, mean? Fraction of what?

          -X and -y are also confusing. What are those probabilities relative to?


          Latest Articles


          • seqadmin
            The Impact of AI in Genomic Medicine
            by seqadmin

            Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
            02-26-2024, 02:07 PM
          • seqadmin
            Multiomics Techniques Advancing Disease Research
            by seqadmin

            New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

            A major leap in the field has
            02-08-2024, 06:33 AM





          Topics Statistics Last Post
          Started by seqadmin, 02-28-2024, 06:12 AM
          0 responses
          Last Post seqadmin  
          Started by seqadmin, 02-23-2024, 04:11 PM
          0 responses
          Last Post seqadmin  
          Started by seqadmin, 02-21-2024, 08:52 AM
          0 responses
          Last Post seqadmin  
          Started by seqadmin, 02-20-2024, 08:57 AM
          0 responses
          Last Post seqadmin