I'd like to introduce a new member of the BBMap package, FilterByTile. It's intended to increase the quality of libraries without incurring bias, or to help salvage libraries with major positional quality problems (like flow-cell bubbles).
*Overview*
The quality of Illumina reads is dependent on location in a flowcell. Some areas are in poor optical focus, or have weaker circulation, or air bubbles, can have very low-quality reads that nonetheless pass Illumina’s filter criteria. While these reads usually have below-average quality scores, it requires very aggressive quality-filtering to remove all of the reads with positionally-related low quality. Aggressive quality filtering and trimming can, in turn, cause detrimental impacts on analysis because sequence quality is also sequence-dependent; thus, aggressive filtering can incur bias against extreme-GC portions of the genome, or specific motifs. This may yield poor assemblies, incorrect ploidy calls, bad expression quantification, and similar problems.
FilterByTile is designed to filter low-quality reads based on positional information. By removing only a small fraction of reads - those in the lowest-quality areas of the flowcell – the overall quality of the data can be increased without incurring sequence-specific bias. The default settings of FilterByTile typically remove on the order of 2% of the reads, while reducing the overall error rate by substantially more than 2% (on the order of 10%). Essentially, it gets rid of the worst of the worst.
FilterByTile was originally developed after observing spikes in the kmer-uniqueness plot used to calculate library complexity, in what should have been a monotonically-declining exponential decay curve (generated by bbcalcunique.sh); these spikes corresponded to low-quality locations on the flow-cell. Interestingly, the spikes often have a regular period, indicating a structured pattern such as flow-cell edges, tile edges, or a “streak”. The initial goal of FilterByTile was simply to eliminate these spikes to allow better estimation of library size and complexity, but it can be useful for generally improving library quality as well.
*Notes*
How it works:
Illumina read names contain information about each cluster’s lane, tile, and X,Y coordinates. FilterByTile scans all reads in the file and calculates the average quality score for a given position. Additionally, the average kmer-uniqueness rate is calculated by position; for data with sufficient depth, this can be used as a proxy for error-rate, allowing filtering of data with inaccurate quality scores.
To calculate a useful average quality for a position, sufficient reads are needed. So, reads are aggregated by position into rectangular “micro-tiles”; these micro-tiles are iteratively expanded until the average micro-tile contains at least X reads (default 800). Then, the averages are calculated on a per-micro-tile bases, standard deviations are calculated, and for tiles at least Y standard deviations worse than normal, all reads are discarded together. Thus, smaller micro-tiles allow more precise positional filtering, but larger micro-tiles yield more accurate quality-score averages. Arbitrary shapes such as circles outlining bubbles would be optimal, but there are no plans for this.
How and when to use, or not:
FilterByTile is applicable to any Illumina HiSeq, MiSeq, or NextSeq sequence. Howevver, it depends on large volumes of data for statistics; it’s useless to run it on a set of 4000 reads demultiplexed from a much larger run. In that case, it would be better to use the “dump” flag to dump the statistics from all libraries in the run together, then use the “indump” flag to filter the libraries individually. That way, quality statistics gathered from all reads will be applied to each individual library.
The filtering should be beneficial in most cases - particularly when you want to salvage a library that obviously had bubbles or low-flow streaks in the lane, but also for libraries with no dramatic positional quality issues. However, there are some cases – such as complex metagenomes - in which more coverage is strictly beneficial, so throwing away even low-quality reads is a bad idea. In these cases, or any situation where very low coverage is expected, filtering will often lead to inferior results. With high coverage, FilterByTile should be strictly beneficial.
Read names:
FilterByTile depends on read headers to identify flowcell location. It has been validated with HiSeq, MiSeq, and NextSeq data, but different Illumina demultiplexing/base-calling software versions have different naming conventions, so please contact me if you see Illumina names that it can’t parse. Renamed reads (such as those in the SRA) probably won’t work.
Memory:
FilterByTile should not need too much memory, but if it runs out of memory it will generally be due to calculating kmer uniqueness for a large genome. In this case, the “usekmers=f” flag will ignore kmers and just use quality scores; in that case, it won’t run out of memory.
*Usage Examples*
Single-ended or paired/interleaved files:
Paired reads in twin files:
Filtering using a statistical profile from multiple libraries:
Filtering aggressively (when you know there’s a serious problem):
Disabling kmer uniqueness to increase speed and decrease memory usage:
*Overview*
The quality of Illumina reads is dependent on location in a flowcell. Some areas are in poor optical focus, or have weaker circulation, or air bubbles, can have very low-quality reads that nonetheless pass Illumina’s filter criteria. While these reads usually have below-average quality scores, it requires very aggressive quality-filtering to remove all of the reads with positionally-related low quality. Aggressive quality filtering and trimming can, in turn, cause detrimental impacts on analysis because sequence quality is also sequence-dependent; thus, aggressive filtering can incur bias against extreme-GC portions of the genome, or specific motifs. This may yield poor assemblies, incorrect ploidy calls, bad expression quantification, and similar problems.
FilterByTile is designed to filter low-quality reads based on positional information. By removing only a small fraction of reads - those in the lowest-quality areas of the flowcell – the overall quality of the data can be increased without incurring sequence-specific bias. The default settings of FilterByTile typically remove on the order of 2% of the reads, while reducing the overall error rate by substantially more than 2% (on the order of 10%). Essentially, it gets rid of the worst of the worst.
FilterByTile was originally developed after observing spikes in the kmer-uniqueness plot used to calculate library complexity, in what should have been a monotonically-declining exponential decay curve (generated by bbcalcunique.sh); these spikes corresponded to low-quality locations on the flow-cell. Interestingly, the spikes often have a regular period, indicating a structured pattern such as flow-cell edges, tile edges, or a “streak”. The initial goal of FilterByTile was simply to eliminate these spikes to allow better estimation of library size and complexity, but it can be useful for generally improving library quality as well.
*Notes*
How it works:
Illumina read names contain information about each cluster’s lane, tile, and X,Y coordinates. FilterByTile scans all reads in the file and calculates the average quality score for a given position. Additionally, the average kmer-uniqueness rate is calculated by position; for data with sufficient depth, this can be used as a proxy for error-rate, allowing filtering of data with inaccurate quality scores.
To calculate a useful average quality for a position, sufficient reads are needed. So, reads are aggregated by position into rectangular “micro-tiles”; these micro-tiles are iteratively expanded until the average micro-tile contains at least X reads (default 800). Then, the averages are calculated on a per-micro-tile bases, standard deviations are calculated, and for tiles at least Y standard deviations worse than normal, all reads are discarded together. Thus, smaller micro-tiles allow more precise positional filtering, but larger micro-tiles yield more accurate quality-score averages. Arbitrary shapes such as circles outlining bubbles would be optimal, but there are no plans for this.
How and when to use, or not:
FilterByTile is applicable to any Illumina HiSeq, MiSeq, or NextSeq sequence. Howevver, it depends on large volumes of data for statistics; it’s useless to run it on a set of 4000 reads demultiplexed from a much larger run. In that case, it would be better to use the “dump” flag to dump the statistics from all libraries in the run together, then use the “indump” flag to filter the libraries individually. That way, quality statistics gathered from all reads will be applied to each individual library.
The filtering should be beneficial in most cases - particularly when you want to salvage a library that obviously had bubbles or low-flow streaks in the lane, but also for libraries with no dramatic positional quality issues. However, there are some cases – such as complex metagenomes - in which more coverage is strictly beneficial, so throwing away even low-quality reads is a bad idea. In these cases, or any situation where very low coverage is expected, filtering will often lead to inferior results. With high coverage, FilterByTile should be strictly beneficial.
Read names:
FilterByTile depends on read headers to identify flowcell location. It has been validated with HiSeq, MiSeq, and NextSeq data, but different Illumina demultiplexing/base-calling software versions have different naming conventions, so please contact me if you see Illumina names that it can’t parse. Renamed reads (such as those in the SRA) probably won’t work.
Memory:
FilterByTile should not need too much memory, but if it runs out of memory it will generally be due to calculating kmer uniqueness for a large genome. In this case, the “usekmers=f” flag will ignore kmers and just use quality scores; in that case, it won’t run out of memory.
*Usage Examples*
Single-ended or paired/interleaved files:
Code:
filterbytile.sh in=reads.fq.gz out=filtered.fq.gz
Code:
filterbytile.sh in1=r1.fq in2=r2.fq out1=filtered1.fq out2=filtered2.fq
Filtering using a statistical profile from multiple libraries:
Code:
cat *.fastq.gz > all.fq.gz filterbytile.sh in=all.fq.gz dump=dump.flowcell filterbytile.sh in=sample1.fastq.gz out=filtered_sample1.fq.gz indump=dump.flowcell
Filtering aggressively (when you know there’s a serious problem):
Code:
filterbytile.sh in=x.fq out=y.fq ud=0.75 qd=1 ed=1 ua=.5 qa=.5 ea=.5
Disabling kmer uniqueness to increase speed and decrease memory usage:
Code:
filterbytile.sh in=x.fq out=y.fq usekmers=f
Comment