I'm simulating reads that mimic PacBio's data (long reads, about 1k-20k bases). Other things are pretty well in order, but I'd like to know what kind of quality values there usually are and what their distribution and meaning is. I'd be very thankful if someone showed me to the right direction or explained at least some part of this!
- which characters are used? (as there seems to be many kinds of variations of FASTQ... I just can't seem to find anything about PacBio's FASTQ directly)
- is there a bigger probability for an indel (or a substitution) if the quality score is bad?
- how often do you generally see certain quality scores, where goes the line between "probably an error" and "most likely fine" ?
- how likely is it to have a bad quality score if the read itself is errorless?
Thank you in advance!
- which characters are used? (as there seems to be many kinds of variations of FASTQ... I just can't seem to find anything about PacBio's FASTQ directly)
- is there a bigger probability for an indel (or a substitution) if the quality score is bad?
- how often do you generally see certain quality scores, where goes the line between "probably an error" and "most likely fine" ?
- how likely is it to have a bad quality score if the read itself is errorless?
Thank you in advance!
Comment