Unconfigured Ad

**Brian Bushnell** · 09-24-2014, 10:16 AM

BBDuk might work for that. It can bin (or trim) reads by the presence and absence of specific kmers, like this:

bbduk.sh in=reads.fq outm=matching.fq out=unmatching.fq literal=ATGTTACGTCT k=11

However, it looks at all of the kmers in a read, not just the first one. Does this sequence ever occur in the middle of the reads, and if so, what would you want to do with those reads?

**vas72985** · 09-24-2014, 10:17 AM

The sequence most likely always occurs at the beginning of the read, but I suppose if something were to be slightly off with the prep, it could occur later in the read. In that case I would also like to pull out those reads. So if I understand correctly, this might work for that purpose?

**vas72985** · 09-24-2014, 10:21 AM

However, does BBDuk allow for paired data? Ie, if the kmer is in read 1, will it allow for isolation of read 1 reads containing the kmer but also of read 2 pairs for those reads it identifies?)

**Brian Bushnell** · 09-24-2014, 10:55 AM

1) Yes, it will work perfectly, in this case.
2) BBDuk always keeps pairs together, as long as it knows the input is paired. For twin files, the command would be:

bbduk.sh in1=reads1.fq in2=reads2.fq outm1=matching1.fq outm2=matching2.fq out1=unmatching1.fq out2=unmatching2.fq literal=ATGTTACGTCT k=11

You can later trim the reads with the "ktrim=l" flag.

**vas72985** · 09-24-2014, 11:28 AM

So I tried this on a very small test data set where I artificially inserted a specific 12mer (GACCAGCTAGTG) and it found all of the ones that I artificially inserted (as well as one that I didn't realize was there to begin with), but it also output a few read pairs as matches that look like they shouldn't belong. For example the read pair below:

@IRIS:7:32:32:1772#0/1
AAGGCTTTAGTCATGTGTTCAAGATCGAAAAAGGAA
+
aaaaaaaaaa`abab`a^aabaaa`ab`a`aaa`]a

@IRIS:7:32:32:1772#0/2
GAAGAAACCTCACAAGACTTTCACTAGATGGTCAGA
+
abbbaab^aaa``_aaa]`^_Z\X`W]^_a_TQ[]Z

Any ideas why it would be making some improper calls?

**vas72985** · 09-24-2014, 11:33 AM

Basically it found all 11 sequences that I know match the 12mer, but it also pulled out an additional 9 sequences that I have no idea why they are being called matching.

**Brian Bushnell** · 09-24-2014, 12:45 PM

Oh - by default, it looks for both a kmer AND its reverse compliment, and ignores the middle letter of the kmer to increase sensitivity. To disable these, add these flags:

rcomp=f mm=f

(where rcomp means 'look for reverse-compliments of kmers' and mm means 'mask middle').

In this case, reverse-compliment of GACCAGCTAGTG = CACTAGCTGGTC, and:

Code:

                     [B]CACTAG[COLOR="Red"]C[/COLOR]TGGTC[/B]
GAAGAAACCTCACAAGACTTT[B]CACTAGATGGTC[/B]AGA

...the middle base is masked. So it matches read 2.

**vas72985** · 09-24-2014, 12:58 PM

Ah, brilliant. Now it works like a charm. Thanks for the help. I'll give it a try on my actual dataset whenever I get it back. Now if only you could make that happen faster

**vas72985** · 09-24-2014, 02:08 PM

Now I may be getting greedy, but is there an option that would allow me to set a threshold for mismatches between my sequence and kmer. For example I know my kmer is exact, but it's possible that I would want to allow 1 or 2 mismatches from my kmer in sequences and still have them be called "matched". Is there an option for this? It wasn't immediately obvious to me looking at the usage.

Thanks

**Brian Bushnell** · 09-24-2014, 02:38 PM

Yes! It is possible. You can set "hdist=1" for one mismatch or "hdist=2" for 2 (that stands for Hamming distance). You can also allow indels but that shouldn't be necessary with Illumina data.

Topics	Statistics	Last Post
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 26 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 33 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM

Unconfigured Ad

Pulling out paired reads containing a specific sequence in one pair

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News