The Galaxy team wanted to take a moment and highlight several of the FASTQ manipulation tools that are currently available in Galaxy (http://usegalaxy.org). Galaxy provides a Free & Open Environment for NGS analysis (previously announced at: http://seqanswers.com/forums/showthread.php?t=4441).
As always, we encourage feature requests, comments/suggestions and bug reports ([email protected]). Additional information about the toolset, including a set of screencasts, can be viewed at: http://main.g2.bx.psu.edu/u/dan/p/fastq.
All of the following tools, unless mentioned otherwise, are found under the NGS: QC and manipulation tool section within Galaxy:
1. Make FASTQ from FASTA and Quality Score files
Some sequencing technologies will produce separate files containing sequences and quality scores (e.g. 454). These two separate files can be merged together to create a single FASTQ file. Specifying a quality score file is optional and, when not specified, quality score values will be filled with the maximal allowed quality value.
2. FASTQ Groomer
The FASTQ Groomer tool is used to verify and convert between the known FASTQ variants. The data created by this tool is guaranteed to conform to the target variant specified by the user, including the enforcement of quality score minimums and maximums. After grooming, the user is presented with some information about the input such as ASCII character and decimal value ranges and a list of FASTQ variants for which the input data is actually valid. Although the output created by this tool is now completely valid, if the user has selected the wrong presumed input variant, it is possible for the resultant score values to not reflect the values intended by the sequencing technology. Users should utilize the provided summary information as a sanity check before continuing with their analysis.
3. Quality Statistics
As quality scores can vary along the length of sequencing reads, determining how to trim and filter read data involves calculating summary statistics on a per column basis. The FASTQ Summary Statistics by column tool accomplishes this task. The output of this tool contains read counts, minimums, maximums, sums, means, quartiles with ranges, outliers and nucleotide counts for each base column in a FASTQ file. This statistical summary can be graphed by using the Boxplot tool, found under the Graph/Display Data tool section.
4. Read Trimmer
To prevent otherwise high quality reads from being rejected during quality filtering or from influencing the mapping or assembly process, it can be beneficial to trim bases from poor quality ends of reads. The FASTQ Trimmer by column tool allows trimming either end of a set of reads by using absolute offsets or by specifying percentage of read length based offsets. Offsets begin at 0 for each end and increase towards the opposing end of the read. For example, to trim the outer 3 bases from each end of a 36 length sequencing read, a user can specify absolute 5’ and 3’ offsets of 3 or percentage-based offsets of 8.33 (0.0833 * 36 = 2.9988, rounded to the nearest integer = 3).
5. Quality Filter
The Filter FASTQ reads by quality score and length tool allows filtering by minimum and maximum read lengths and by minimum and maximum quality score values over the entire read while allowing a configurable number of deviant bases. Complex filters can also be constructed that allow the user to set offsets, just like with the trimmer tool, to use as bounds for performing a selected aggregation action that is compared to a user specified value. Any number of complex filters can be designed and applied to a set of sequencing reads. For example, to only include reads which have no quality score values less than 28 in the first half of a read, a user can use percentage-based offsets of 0 and 50, select the min score aggregation and the greater-than-or-equal to operator (>=) and set a quality score threshold of 28.
6. FASTQ Manipulation
Highly configurable complex manipulations can be performed on selected FASTQ reads by using the Manipulate FASTQ reads on various attributes tool. This tool allows the user to define a set of matching criteria to be used to select the reads in a FASTQ file on which to perform a set of manipulations; any number of match directives can be defined and a read must match each directive to be considered for manipulation. Matching is currently limited to user specified regular expressions on sequence identifier/name, sequence content and quality score strings, with defaults set to match all (.*); however, additional matching and manipulation options can be easily implemented as needed. When a read does not match, it will be transferred to the output in an unmodified fashion. Reads which pass all matching criteria are subjected to any number of user specified manipulations. Manipulations are available which act upon sequence identifier/name, sequence content or quality score strings. Beyond allowing the user to remove matching reads or to perform string translations on any of these attributes, additional manipulations are available for sequence content, including: reverse complementing, reversing (without complementing), complementing (without reversing), trimming, in silico transcription of DNA to RNA and vice-versa, as well as changing the adapter base within color space sequences. Additionally, separate tools exist which can convert FASTQ files to-and-from a tabular format; this allows FASTQ data to be modified using any of the powerful text manipulation tools which are prepackaged with Galaxy.
7. Paired-End Read Splitting and Joining
FASTQ formatted paired-end sequencing data can come in two common forms, one which utilizes a separate file for each paired-end component or another where a single FASTQ file is used and the two paired-end reads ends have been concatenated together to form a single entry. Two tools exist to facilitate the use of this data: FASTQ Joiner on paired end reads and FASTQ Splitter on joined paired end reads. The Joiner tool takes two separate FASTQ files that contain paired end reads and creates a single file. The Splitter tool does the opposite of the Joiner tool and takes a single FASTQ file and splits each read in half, creating two separate FASTQ files. When splitting, an identifier suffix is added to each paired end; when joining, these differences in identifiers are taken into account.
As always, we encourage feature requests, comments/suggestions and bug reports ([email protected]). Additional information about the toolset, including a set of screencasts, can be viewed at: http://main.g2.bx.psu.edu/u/dan/p/fastq.
All of the following tools, unless mentioned otherwise, are found under the NGS: QC and manipulation tool section within Galaxy:
1. Make FASTQ from FASTA and Quality Score files
Some sequencing technologies will produce separate files containing sequences and quality scores (e.g. 454). These two separate files can be merged together to create a single FASTQ file. Specifying a quality score file is optional and, when not specified, quality score values will be filled with the maximal allowed quality value.
2. FASTQ Groomer
The FASTQ Groomer tool is used to verify and convert between the known FASTQ variants. The data created by this tool is guaranteed to conform to the target variant specified by the user, including the enforcement of quality score minimums and maximums. After grooming, the user is presented with some information about the input such as ASCII character and decimal value ranges and a list of FASTQ variants for which the input data is actually valid. Although the output created by this tool is now completely valid, if the user has selected the wrong presumed input variant, it is possible for the resultant score values to not reflect the values intended by the sequencing technology. Users should utilize the provided summary information as a sanity check before continuing with their analysis.
3. Quality Statistics
As quality scores can vary along the length of sequencing reads, determining how to trim and filter read data involves calculating summary statistics on a per column basis. The FASTQ Summary Statistics by column tool accomplishes this task. The output of this tool contains read counts, minimums, maximums, sums, means, quartiles with ranges, outliers and nucleotide counts for each base column in a FASTQ file. This statistical summary can be graphed by using the Boxplot tool, found under the Graph/Display Data tool section.
4. Read Trimmer
To prevent otherwise high quality reads from being rejected during quality filtering or from influencing the mapping or assembly process, it can be beneficial to trim bases from poor quality ends of reads. The FASTQ Trimmer by column tool allows trimming either end of a set of reads by using absolute offsets or by specifying percentage of read length based offsets. Offsets begin at 0 for each end and increase towards the opposing end of the read. For example, to trim the outer 3 bases from each end of a 36 length sequencing read, a user can specify absolute 5’ and 3’ offsets of 3 or percentage-based offsets of 8.33 (0.0833 * 36 = 2.9988, rounded to the nearest integer = 3).
5. Quality Filter
The Filter FASTQ reads by quality score and length tool allows filtering by minimum and maximum read lengths and by minimum and maximum quality score values over the entire read while allowing a configurable number of deviant bases. Complex filters can also be constructed that allow the user to set offsets, just like with the trimmer tool, to use as bounds for performing a selected aggregation action that is compared to a user specified value. Any number of complex filters can be designed and applied to a set of sequencing reads. For example, to only include reads which have no quality score values less than 28 in the first half of a read, a user can use percentage-based offsets of 0 and 50, select the min score aggregation and the greater-than-or-equal to operator (>=) and set a quality score threshold of 28.
6. FASTQ Manipulation
Highly configurable complex manipulations can be performed on selected FASTQ reads by using the Manipulate FASTQ reads on various attributes tool. This tool allows the user to define a set of matching criteria to be used to select the reads in a FASTQ file on which to perform a set of manipulations; any number of match directives can be defined and a read must match each directive to be considered for manipulation. Matching is currently limited to user specified regular expressions on sequence identifier/name, sequence content and quality score strings, with defaults set to match all (.*); however, additional matching and manipulation options can be easily implemented as needed. When a read does not match, it will be transferred to the output in an unmodified fashion. Reads which pass all matching criteria are subjected to any number of user specified manipulations. Manipulations are available which act upon sequence identifier/name, sequence content or quality score strings. Beyond allowing the user to remove matching reads or to perform string translations on any of these attributes, additional manipulations are available for sequence content, including: reverse complementing, reversing (without complementing), complementing (without reversing), trimming, in silico transcription of DNA to RNA and vice-versa, as well as changing the adapter base within color space sequences. Additionally, separate tools exist which can convert FASTQ files to-and-from a tabular format; this allows FASTQ data to be modified using any of the powerful text manipulation tools which are prepackaged with Galaxy.
7. Paired-End Read Splitting and Joining
FASTQ formatted paired-end sequencing data can come in two common forms, one which utilizes a separate file for each paired-end component or another where a single FASTQ file is used and the two paired-end reads ends have been concatenated together to form a single entry. Two tools exist to facilitate the use of this data: FASTQ Joiner on paired end reads and FASTQ Splitter on joined paired end reads. The Joiner tool takes two separate FASTQ files that contain paired end reads and creates a single file. The Splitter tool does the opposite of the Joiner tool and takes a single FASTQ file and splits each read in half, creating two separate FASTQ files. When splitting, an identifier suffix is added to each paired end; when joining, these differences in identifiers are taken into account.
Comment