Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Simulating FastQ libraries for BS-Seq or normal applications using Sherman

    We have just made available a FastQ simulation script, termed Sherman, for high-throughput bisulfite (or standard genomic) sequencing datasets. It can generate single-end or paired-end data in both nucleotide-/base-space (such as from the Illumina platform) and color-space (such as from the SOLiD platform).

    Sherman was designed to assess the influence of common problems observed in many Next-Gen Sequencing libraries on the primary analysis of BS-Seq data. Thus, it allows the user to introduce various 'contaminants' into the simulated libraries, including basecall errors (following an exponential decay model), SNPs, Illumina adapter fragments and more.

    These are the main features:
    • Generate any number of sequences of any length
    • Generate either completely random sequences or use genomic sequences (genome can be specified)
    • Generates single-end or paired-end data with variable fragment sizes
    • Adjustable bisulfite conversion rate from 0-100% for either all cytosines or cytosines in CH and CG context individually
    • Generate directional or non-directional libraries
    • Generate sequences in base-space or SOLiD color-space format
    • Adjustable default Phred quality score (Sanger encoding, Phred+33 format)
    • Sequences can have constant Phred qualities throughout the read or can have quality scores following an exponential decay curve, which will eventually result in basecall errors (note that this is handled slightly different for base- and color-space data)
    • Introduce a variable number of random SNPs into each read
    • Introduce a fixed amount of adapter sequence at the 3' end of all sequences
    • Introduce a variable amount of adapter sequence at various positions at the 3' end of reads

    While including the paired-end option, Sherman has received a major overhaul so it should now run much quicker and be less memory-intensive. Initially, Sherman was designed to generate the kinds of library contaminations we were interested in, but if you have any ideas or suggestions which could be implemented (_easily_) we would love to hear from you.

    Sherman can be found at

  • #2
    identical qualities?

    Hi, this looks to be quite useful.

    I call like:

    ./Sherman -n 100000 -l 50 -cr 0 --colorspace --error_rate 1 --genome_folder ~/data/hg19/ --quality 30
    If I do the following, I get only 1 line of output:
    $ awk '(NR %2 == 0)' simulated_QV.qual | uniq
    e.g. There is no randomness in the quality values.
    Is this as intended?



    • #3
      Hi Brent,

      It is true that all reads have the same quality values at each position, and this is modeled so that on average there is a certain chance, of in your case 1%, of incorporating a sequencing error spread over the entire sequence. A certain degree of randomness is achieved at the point when the error is actually introduced, because this is decided randomly against the Phred score (= probability that a basecall is wrong) for each bp individually.

      Hope this isn't too confusing.



      • #4
        Got it. Thanks for the explanation.


        • #5
          We have just released an updated version of Sherman (v0.1.1) which fixes an issue with the simulation of non-directional paired-end data and improves some other minor aspects.


          • #6
            We have updated Sherman (v0.1.2) so that reads which were simulated from an existing genome carry the genomic coordinates in the sequence ID. This makes it easier to determine the accuracy of different aligners..


            • #7
              We have released a new version of the bisulfite simulator Sherman (v.0.1.4). This update fixes the following flaw:

              During context specific cytosine conversion, until now Sherman assumed that a C at the last position was in CH context. This did however cause a weird blip in the M-bias plots (introduced into the Bismark methylation extractor as of v0.8.0) of simulated data at the end or read 1 and at the start of read 2 whenever the read was actually in CpG context. To account for this, Sherman does now determine the sequence context of the last position in a read correctly.

              Sherman is available here: https://www.bioinformatics.babraham....jects/sherman/.


              • #8

                I'm using Sherman to generate sets of 32 bp genomic sequences for use as random control "libraries" to some transcriptome libraries our lad has made. I compare the distribution of these random "reads" in different annotated genomic categories (how many fall within genes, transposons, etc.) to that of the transcriptome libraries.

                So, a question about the --genome_folder option: How random are the sequences generated when this option is chosen? How are, for example, the different 32-mers chosen from the chromosome coordinates given?

                This is the command I use:

                ./Sherman -l 32 -n 51402229 --genome_folder /genome/ZmB73_Refgen/

                Just looking at two simulated files generated by using the identical command, I see they're not the same, but I just wanted to get a sense of how different they are.




                • #9
                  Hi Karl,
                  The starting position in the genome is determined by first concatenating all chromosomes into one big long sequence, and then generating random numbers using the Perl rand() function. Using this number it does then first determine which chromosome and starting position this would correspond to, and extract 32bp sequence at this position. So in essence it should be as 'random' as the Perl rand() function is. Hope this helps.


                  • #10
                    Ah, I was guessing it might be the Perl rand() function generating the coordinates, but wanted to be sure. Thanks very much!


                    Latest Articles


                    • seqadmin
                      Exploring the Dynamics of the Tumor Microenvironment
                      by seqadmin

                      The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                      07-08-2024, 03:19 PM
                    • seqadmin
                      Exploring Human Diversity Through Large-Scale Omics
                      by seqadmin

                      In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                      06-25-2024, 06:43 AM





                    Topics Statistics Last Post
                    Started by seqadmin, 07-10-2024, 07:30 AM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 07-03-2024, 09:45 AM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 07-03-2024, 08:54 AM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 07-02-2024, 03:00 PM
                    0 responses
                    Last Post seqadmin