Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • kerhard
    replied
    Ah, I was guessing it might be the Perl rand() function generating the coordinates, but wanted to be sure. Thanks very much!

    Leave a comment:


  • fkrueger
    replied
    Hi Karl,
    The starting position in the genome is determined by first concatenating all chromosomes into one big long sequence, and then generating random numbers using the Perl rand() function. Using this number it does then first determine which chromosome and starting position this would correspond to, and extract 32bp sequence at this position. So in essence it should be as 'random' as the Perl rand() function is. Hope this helps.

    Leave a comment:


  • kerhard
    replied
    Hello,

    I'm using Sherman to generate sets of 32 bp genomic sequences for use as random control "libraries" to some transcriptome libraries our lad has made. I compare the distribution of these random "reads" in different annotated genomic categories (how many fall within genes, transposons, etc.) to that of the transcriptome libraries.

    So, a question about the --genome_folder option: How random are the sequences generated when this option is chosen? How are, for example, the different 32-mers chosen from the chromosome coordinates given?

    This is the command I use:

    ./Sherman -l 32 -n 51402229 --genome_folder /genome/ZmB73_Refgen/

    Just looking at two simulated files generated by using the identical command, I see they're not the same, but I just wanted to get a sense of how different they are.

    Thanks,

    Karl

    Leave a comment:


  • fkrueger
    replied
    We have released a new version of the bisulfite simulator Sherman (v.0.1.4). This update fixes the following flaw:

    During context specific cytosine conversion, until now Sherman assumed that a C at the last position was in CH context. This did however cause a weird blip in the M-bias plots (introduced into the Bismark methylation extractor as of v0.8.0) of simulated data at the end or read 1 and at the start of read 2 whenever the read was actually in CpG context. To account for this, Sherman does now determine the sequence context of the last position in a read correctly.

    Sherman is available here: https://www.bioinformatics.babraham....jects/sherman/.

    Leave a comment:


  • fkrueger
    replied
    We have updated Sherman (v0.1.2) so that reads which were simulated from an existing genome carry the genomic coordinates in the sequence ID. This makes it easier to determine the accuracy of different aligners..

    Leave a comment:


  • fkrueger
    replied
    We have just released an updated version of Sherman (v0.1.1) which fixes an issue with the simulation of non-directional paired-end data and improves some other minor aspects.

    Leave a comment:


  • brentp
    replied
    Got it. Thanks for the explanation.

    Leave a comment:


  • fkrueger
    replied
    Hi Brent,

    It is true that all reads have the same quality values at each position, and this is modeled so that on average there is a certain chance, of in your case 1%, of incorporating a sequencing error spread over the entire sequence. A certain degree of randomness is achieved at the point when the error is actually introduced, because this is decided randomly against the Phred score (= probability that a basecall is wrong) for each bp individually.

    Hope this isn't too confusing.

    Best,
    Felix

    Leave a comment:


  • brentp
    replied
    identical qualities?

    Hi, this looks to be quite useful.

    I call like:

    Code:
    ./Sherman -n 100000 -l 50 -cr 0 --colorspace --error_rate 1 --genome_folder ~/data/hg19/ --quality 30
    If I do the following, I get only 1 line of output:
    Code:
    $ awk '(NR %2 == 0)' simulated_QV.qual | uniq
    e.g. There is no randomness in the quality values.
    Is this as intended?

    thanks,
    -Brent

    Leave a comment:


  • Simulating FastQ libraries for BS-Seq or normal applications using Sherman

    We have just made available a FastQ simulation script, termed Sherman, for high-throughput bisulfite (or standard genomic) sequencing datasets. It can generate single-end or paired-end data in both nucleotide-/base-space (such as from the Illumina platform) and color-space (such as from the SOLiD platform).

    Sherman was designed to assess the influence of common problems observed in many Next-Gen Sequencing libraries on the primary analysis of BS-Seq data. Thus, it allows the user to introduce various 'contaminants' into the simulated libraries, including basecall errors (following an exponential decay model), SNPs, Illumina adapter fragments and more.

    These are the main features:
    • Generate any number of sequences of any length
    • Generate either completely random sequences or use genomic sequences (genome can be specified)
    • Generates single-end or paired-end data with variable fragment sizes
    • Adjustable bisulfite conversion rate from 0-100% for either all cytosines or cytosines in CH and CG context individually
    • Generate directional or non-directional libraries
    • Generate sequences in base-space or SOLiD color-space format
    • Adjustable default Phred quality score (Sanger encoding, Phred+33 format)
    • Sequences can have constant Phred qualities throughout the read or can have quality scores following an exponential decay curve, which will eventually result in basecall errors (note that this is handled slightly different for base- and color-space data)
    • Introduce a variable number of random SNPs into each read
    • Introduce a fixed amount of adapter sequence at the 3' end of all sequences
    • Introduce a variable amount of adapter sequence at various positions at the 3' end of reads

    While including the paired-end option, Sherman has received a major overhaul so it should now run much quicker and be less memory-intensive. Initially, Sherman was designed to generate the kinds of library contaminations we were interested in, but if you have any ideas or suggestions which could be implemented (_easily_) we would love to hear from you.

    Sherman can be found at www.bioinformatics.bbsrc.ac.uk/projects/

Latest Articles

Collapse

  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM
  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
11 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
17 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
14 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
43 views
0 likes
Last Post seqadmin  
Working...
X