We have just made available a FastQ simulation script, termed Sherman, for high-throughput bisulfite (or standard genomic) sequencing datasets. It can generate single-end or paired-end data in both nucleotide-/base-space (such as from the Illumina platform) and color-space (such as from the SOLiD platform).
Sherman was designed to assess the influence of common problems observed in many Next-Gen Sequencing libraries on the primary analysis of BS-Seq data. Thus, it allows the user to introduce various 'contaminants' into the simulated libraries, including basecall errors (following an exponential decay model), SNPs, Illumina adapter fragments and more.
These are the main features:
• Generate any number of sequences of any length
• Generate either completely random sequences or use genomic sequences (genome can be specified)
• Generates single-end or paired-end data with variable fragment sizes
• Adjustable bisulfite conversion rate from 0-100% for either all cytosines or cytosines in CH and CG context individually
• Generate directional or non-directional libraries
• Generate sequences in base-space or SOLiD color-space format
• Adjustable default Phred quality score (Sanger encoding, Phred+33 format)
• Sequences can have constant Phred qualities throughout the read or can have quality scores following an exponential decay curve, which will eventually result in basecall errors (note that this is handled slightly different for base- and color-space data)
• Introduce a variable number of random SNPs into each read
• Introduce a fixed amount of adapter sequence at the 3' end of all sequences
• Introduce a variable amount of adapter sequence at various positions at the 3' end of reads
While including the paired-end option, Sherman has received a major overhaul so it should now run much quicker and be less memory-intensive. Initially, Sherman was designed to generate the kinds of library contaminations we were interested in, but if you have any ideas or suggestions which could be implemented (_easily_) we would love to hear from you.
Sherman can be found at www.bioinformatics.bbsrc.ac.uk/projects/
Sherman was designed to assess the influence of common problems observed in many Next-Gen Sequencing libraries on the primary analysis of BS-Seq data. Thus, it allows the user to introduce various 'contaminants' into the simulated libraries, including basecall errors (following an exponential decay model), SNPs, Illumina adapter fragments and more.
These are the main features:
• Generate any number of sequences of any length
• Generate either completely random sequences or use genomic sequences (genome can be specified)
• Generates single-end or paired-end data with variable fragment sizes
• Adjustable bisulfite conversion rate from 0-100% for either all cytosines or cytosines in CH and CG context individually
• Generate directional or non-directional libraries
• Generate sequences in base-space or SOLiD color-space format
• Adjustable default Phred quality score (Sanger encoding, Phred+33 format)
• Sequences can have constant Phred qualities throughout the read or can have quality scores following an exponential decay curve, which will eventually result in basecall errors (note that this is handled slightly different for base- and color-space data)
• Introduce a variable number of random SNPs into each read
• Introduce a fixed amount of adapter sequence at the 3' end of all sequences
• Introduce a variable amount of adapter sequence at various positions at the 3' end of reads
While including the paired-end option, Sherman has received a major overhaul so it should now run much quicker and be less memory-intensive. Initially, Sherman was designed to generate the kinds of library contaminations we were interested in, but if you have any ideas or suggestions which could be implemented (_easily_) we would love to hear from you.
Sherman can be found at www.bioinformatics.bbsrc.ac.uk/projects/
Comment