Hi,
I have a library which was constructed using a restriction enzyme (ApeKI). A paired-end 100 bp Illumina sequencing was performed.
The R1 (first read of a pair) reads consisted of a barcode and sequence and were fine. No barcode was added to R2 reads. Therefore, R2 reads should always start with the same sequence (C A/T GC). In most cases that is true, but unfortunately about 15 % of the reads have a 'N' at position 1, and 30 % of the reads have 'N' at position 4. In the filtering steps I usually remove sequences with at least one 'N', resulting in a loss of more than 30 % of the reads at the moment.
The other reads look good and can be mapped against the reference without problem, i.e. there was no general problem with the sequencing (else I would not expect the remaining reads to match the reference).
Since I haven't seen a publication yet where they described something like that, my question is:
Can I replace the N's at position 1 and 4 since I know what the sequence should be, before performing quality filtering? Do you see any problems with that approach?
Some programs need the restriction enzyme cutsite for analysis, that is why I do not want to just trim the first 4 nucleotides...
I have a library which was constructed using a restriction enzyme (ApeKI). A paired-end 100 bp Illumina sequencing was performed.
The R1 (first read of a pair) reads consisted of a barcode and sequence and were fine. No barcode was added to R2 reads. Therefore, R2 reads should always start with the same sequence (C A/T GC). In most cases that is true, but unfortunately about 15 % of the reads have a 'N' at position 1, and 30 % of the reads have 'N' at position 4. In the filtering steps I usually remove sequences with at least one 'N', resulting in a loss of more than 30 % of the reads at the moment.
The other reads look good and can be mapped against the reference without problem, i.e. there was no general problem with the sequencing (else I would not expect the remaining reads to match the reference).
Since I haven't seen a publication yet where they described something like that, my question is:
Can I replace the N's at position 1 and 4 since I know what the sequence should be, before performing quality filtering? Do you see any problems with that approach?
Some programs need the restriction enzyme cutsite for analysis, that is why I do not want to just trim the first 4 nucleotides...
Comment