Originally posted by simonandrews
View Post
Question 0: can you explain the graph a little. For example, in your illustration, the orange plot representing CCCCC has a peak at position 11 and a dip at position 7. Does that mean 100% of sequences have CCCCC at position 11 but only 20% have CCCCC at position 7?
Question 1: How exactly do you calculate the expected counts? What is the algorithm you follow to generate a sequence set that has same base content as the library.
Question 2: what is the difference between Obs/exp overall and Obs/Exp Max
Question 3: I don't understand why your program will not capture this: "If you have a partial sequence which is appearing at a variety of places within your sequence then this won't be seen either by the per base content plot or the duplicate sequence analysis."
Question 4: Have you thought about correcting for multiple testing and perhaps applying some statistics to get a p-value. I understand right now, you are reporting any fraction of obs/exp > 3.
Finally, is it possible for you to allow the user to look at 6-mers or even 7-mers (and make that a flexible option) and how will your statistics vary if you increase the kmer.
Once again, thank you so much for putting this out-I am sure it will be tremendously beneficial to the community at large.
Priyam
Comment