I am analyzing some published Tn-seq data. There appears to be residual transposon sequence in the reads, preventing alignment. Unfortunately, this sequence is of variable length. Here are some example reads, the transposon sequence is undrelined:
ATTCCGCTCTTCCGATCTAGTCATGCGCGGCCGCATAACATAACCGGTTGGATGATAAGTCCCCGGTCTATAT
ATTCTTCCCTACACGACGCTCTTCCGATCTAGTCATGCGCCGGAGCATTAGGTAACAGGTTGGATGATAAGTC
ATGCAGTCATGCAAATGATAACAGGTTGGATGATAAGTCCCCGGTCTATATTGAGAGTAACTACATTTACCGT
ATTCATCATTGCGGCAGTCATGCCTATTGTTCCTGGTGTAACAGGTTGGATGATAAGTCCCCGGTCTATATTG
I want to search for the last 8 bases of the transposon sequence (AGTCATGC) and remove any sequence to the left of it, but retaining all the remaining sequence in the read and any read wherein there is no transposon sequence. I've been trying bbduk using the following command:
But this seems to result in nearly every read being removed (the values drop from over 4 million to about 20,000).
Can any one help me with this issue? Thanks!
ATTCCGCTCTTCCGATCTAGTCATGCGCGGCCGCATAACATAACCGGTTGGATGATAAGTCCCCGGTCTATAT
ATTCTTCCCTACACGACGCTCTTCCGATCTAGTCATGCGCCGGAGCATTAGGTAACAGGTTGGATGATAAGTC
ATGCAGTCATGCAAATGATAACAGGTTGGATGATAAGTCCCCGGTCTATATTGAGAGTAACTACATTTACCGT
ATTCATCATTGCGGCAGTCATGCCTATTGTTCCTGGTGTAACAGGTTGGATGATAAGTCCCCGGTCTATATTG
I want to search for the last 8 bases of the transposon sequence (AGTCATGC) and remove any sequence to the left of it, but retaining all the remaining sequence in the read and any read wherein there is no transposon sequence. I've been trying bbduk using the following command:
Code:
bbduk.sh in=$in.fastq literal=AGTCATGC ktrim=l k=8 rcomp=f out=$out.fastq
Can any one help me with this issue? Thanks!
Comment