I'm trying to trim/filter low quality reads from paired-end exome-seq data, using BBDuk.
I used the command:
```
for ea in $files;
do
R1="$ea"
R2=$(echo $R1 | sed "s/R1/R2/")
/home/shared/programs/bbmap/bbduk.sh -Xmx1g in1=$R1 in2=$R2 \
out1="$(echo $ea | sed s/.fastq.gz/_trimmed_filtered.fastq.gz/)" \
out2="$(echo $(echo $ea | sed s/R1/R2/) | sed s/.fastq.gz/_trimmed_filtered.fastq.gz/)" \
ref=/home/shared/programs/bbmap/resources/adapters.fa \
t=10 ktrim=r k=23 kmin=11 hdist=1 maq=10 minlen=60 tpe tbo
done;
```
After running fastqc on the output of this, I'm seeing that R2 files have some reads with low quality scores (see per sequence quality score), and the overrepresented sequence "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN".
Looking at these reads in the fastq:
```
@HISEQ:525:HMFYNBCXX:1:1101:1380:2167 2:N:0:CAGATC
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@HISEQ:525:HMFYNBCXX:1:1101:1276:2219 2:N:0:CAGATC
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@HISEQ:525:HMFYNBCXX:1:1101:1238:2328 2:N:0:CAGATC
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
```
Shouldn't these reads have been filtered out?
Any help here would be much appreciated.
I used the command:
```
for ea in $files;
do
R1="$ea"
R2=$(echo $R1 | sed "s/R1/R2/")
/home/shared/programs/bbmap/bbduk.sh -Xmx1g in1=$R1 in2=$R2 \
out1="$(echo $ea | sed s/.fastq.gz/_trimmed_filtered.fastq.gz/)" \
out2="$(echo $(echo $ea | sed s/R1/R2/) | sed s/.fastq.gz/_trimmed_filtered.fastq.gz/)" \
ref=/home/shared/programs/bbmap/resources/adapters.fa \
t=10 ktrim=r k=23 kmin=11 hdist=1 maq=10 minlen=60 tpe tbo
done;
```
After running fastqc on the output of this, I'm seeing that R2 files have some reads with low quality scores (see per sequence quality score), and the overrepresented sequence "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN".
Looking at these reads in the fastq:
```
@HISEQ:525:HMFYNBCXX:1:1101:1380:2167 2:N:0:CAGATC
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@HISEQ:525:HMFYNBCXX:1:1101:1276:2219 2:N:0:CAGATC
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@HISEQ:525:HMFYNBCXX:1:1101:1238:2328 2:N:0:CAGATC
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
```
Shouldn't these reads have been filtered out?
Any help here would be much appreciated.