Hello!
I'm new to bioinformatics and I have faced with such a problem.
I'm looking for a fast, desirably multithreading (or which can be parallelized) tool, that can find all the occurences of certain short subsequence with several mismatches in every read in NGS output. Most of all mismatches are substitutions, not indels.
For example, I have a lot (thousands) of sequences like this: "fdsjfsjdkdfjSPARjdskfjdskSPAMfddjskdsfjkSPAMdkdsfjk", and I need to get a matrix with positions of all the entries of "SPAM" in every sequence.
I tried to use the patternMatch from R-Bioconductor and the Python package fuzzysearch, but they are not actually fast. "Motif" from Biopython does not seem quite right for my goals and also is not very fast.
Do you know any suitable tool for me?
I believe it exits and I do not need to reinvent the wheel.
Thank you in advance.
I'm new to bioinformatics and I have faced with such a problem.
I'm looking for a fast, desirably multithreading (or which can be parallelized) tool, that can find all the occurences of certain short subsequence with several mismatches in every read in NGS output. Most of all mismatches are substitutions, not indels.
For example, I have a lot (thousands) of sequences like this: "fdsjfsjdkdfjSPARjdskfjdskSPAMfddjskdsfjkSPAMdkdsfjk", and I need to get a matrix with positions of all the entries of "SPAM" in every sequence.
I tried to use the patternMatch from R-Bioconductor and the Python package fuzzysearch, but they are not actually fast. "Motif" from Biopython does not seem quite right for my goals and also is not very fast.
Do you know any suitable tool for me?
I believe it exits and I do not need to reinvent the wheel.
Thank you in advance.
Comment