Sometimes trimming adapters from two paired read files (with, say, cutadapt) results in unequal trimming for the members of any given pair. Therefore if you subsequently remove short inserts from both readfiles independently, it can throw the pairs out of sync as soon as it removes one but not the other member of a pair.
The following script "nixshorts_PE" will remove a read pair from two paired-end read fastqs when at least one of the two members are below a certain length. The same method can be used for removing short reads from a single-end file, with some adjustments. Just thought some of you might find this handy.
Please post improvements to this script if you think of them. Thanks!
#!/bin/bash
# This removes reads of a below a certain length from paired read files in fastq format (e.g., R1 and R2 from the same library)
# Usage: $ bash nixshorts_PE [input fastqR1] [input fastqR2] [minimum read length to keep]
# PROCESS:
#1. Start with inputs
R1fq=$1
R2fq=$2
minlen=$3
#2. Find all entries with read length less than minimum length and print line numbers, for both R1 and R2
awk -v min=$minlen '{if(NR%4==2) if(length($0)<min) print NR"\n"NR-1"\n"NR+1"\n"NR+2}' $R1fq > temp.lines1
awk -v min=$minlen '{if(NR%4==2) if(length($0)<min) print NR"\n"NR-1"\n"NR+1"\n"NR+2}' $R2fq >> temp.lines1
#3. Combine both line files into one, sort them numerically, and collapse redundant entries
sort -n temp.lines1 | uniq > temp.lines
rm temp.lines1
#4. Remove the line numbers recorded in "lines" from both fastqs
awk 'NR==FNR{l[$0];next;} !(FNR in l)' temp.lines $R1fq > $R1fq.$minlen
awk 'NR==FNR{l[$0];next;} !(FNR in l)' temp.lines $R2fq > $R2fq.$minlen
rm temp.lines
#5. Conclude
echo "Pairs shorter than $minlen bases removed from $R1fq and $R2fq"
The following script "nixshorts_PE" will remove a read pair from two paired-end read fastqs when at least one of the two members are below a certain length. The same method can be used for removing short reads from a single-end file, with some adjustments. Just thought some of you might find this handy.
Please post improvements to this script if you think of them. Thanks!
#!/bin/bash
# This removes reads of a below a certain length from paired read files in fastq format (e.g., R1 and R2 from the same library)
# Usage: $ bash nixshorts_PE [input fastqR1] [input fastqR2] [minimum read length to keep]
# PROCESS:
#1. Start with inputs
R1fq=$1
R2fq=$2
minlen=$3
#2. Find all entries with read length less than minimum length and print line numbers, for both R1 and R2
awk -v min=$minlen '{if(NR%4==2) if(length($0)<min) print NR"\n"NR-1"\n"NR+1"\n"NR+2}' $R1fq > temp.lines1
awk -v min=$minlen '{if(NR%4==2) if(length($0)<min) print NR"\n"NR-1"\n"NR+1"\n"NR+2}' $R2fq >> temp.lines1
#3. Combine both line files into one, sort them numerically, and collapse redundant entries
sort -n temp.lines1 | uniq > temp.lines
rm temp.lines1
#4. Remove the line numbers recorded in "lines" from both fastqs
awk 'NR==FNR{l[$0];next;} !(FNR in l)' temp.lines $R1fq > $R1fq.$minlen
awk 'NR==FNR{l[$0];next;} !(FNR in l)' temp.lines $R2fq > $R2fq.$minlen
rm temp.lines
#5. Conclude
echo "Pairs shorter than $minlen bases removed from $R1fq and $R2fq"