I recently completed a C++ library and command-line utility which can be used to index and rapidly extract sequences and subsequences from FASTA files.
I wanted to publicize the work here as it has been quite useful to me and a number of my fellow lab members.
The utilities can be obtained using git via the project repository on github:
Please let me know if obtaining the repository via git is a problem, and I will make a tarball release.
What follows is the FastaHack README:
fastahack --- *fast* FASTA file indexing, subsequence and sequence extraction
Author: Erik Garrison <[email protected]>, Marth Lab, Boston College
Date: May 7, 2010
Overview:
fastahack is a small application for indexing and extracting sequences and
subsequences from FASTA files. The included Fasta.cpp library provides a FASTA
reader and indexer that can be embedded into applications which would benefit
from directly reading subsequences from FASTA files. The library automatically
handles index file generation and use.
Features:
- FASTA index (.fai) generation for FASTA files
- Sequence extraction
- Subsequence extraction
- Sequence statistics (TODO: currently only length is provided)
Sequence and subsequence extraction use fseek64 to provide fastest-possible
extraction without RAM-intensive file loading operations. This makes fastahack
a useful tool for bioinformaticists who need to quickly extract many
subsequences from a reference FASTA sequence.
Notes:
The index files generated by this system should be numerically equivalent to
those generated by samtools (http://samtools.sourceforge.net/). However, while
samtools truncates sequence names in the index file, fastahack provides them
completely.
To simplify use, sequences can be addressed by first whitespace-separated
field; e.g. "8 SN(Homo sapiens) GA(HG18) URI(NC_000008.9)" can be addressed
simply as "8", provided "8" is a unique first-field name in the FASTA file.
Thus, to extract 20bp starting at position 323202 in chromosome 8 from the
human reference:
% fastahack subsequence h.sapiens.fasta 8 323202 20
ACATTGTAATAGATCTCAGA
Usage information is provided by running fastahack with no arguments:
% fastahack
usage: fastahack <command> [options]
actions:
index <fasta reference>
sequence <fasta reference> <sequence name>
subsequence <fasta reference> <sequence name> <0-based start> <length>
stats <fasta reference> <sequence name> (returns sequence length)
Limitations:
fastahack will only generate indexes for FASTA files in which the sequences
have self-consistent line lengths. Trailing whitespace is allowed at the end
of sequences, but not embedded within the sequence. These limitations are
necessitated by the complexity of indexing sequences whose lines change in
length--- the use of indexes is frustrated by such inconsistencies; each change
in line length would require a new entry in the index file.
I wanted to publicize the work here as it has been quite useful to me and a number of my fellow lab members.
The utilities can be obtained using git via the project repository on github:
Please let me know if obtaining the repository via git is a problem, and I will make a tarball release.
What follows is the FastaHack README:
fastahack --- *fast* FASTA file indexing, subsequence and sequence extraction
Author: Erik Garrison <[email protected]>, Marth Lab, Boston College
Date: May 7, 2010
Overview:
fastahack is a small application for indexing and extracting sequences and
subsequences from FASTA files. The included Fasta.cpp library provides a FASTA
reader and indexer that can be embedded into applications which would benefit
from directly reading subsequences from FASTA files. The library automatically
handles index file generation and use.
Features:
- FASTA index (.fai) generation for FASTA files
- Sequence extraction
- Subsequence extraction
- Sequence statistics (TODO: currently only length is provided)
Sequence and subsequence extraction use fseek64 to provide fastest-possible
extraction without RAM-intensive file loading operations. This makes fastahack
a useful tool for bioinformaticists who need to quickly extract many
subsequences from a reference FASTA sequence.
Notes:
The index files generated by this system should be numerically equivalent to
those generated by samtools (http://samtools.sourceforge.net/). However, while
samtools truncates sequence names in the index file, fastahack provides them
completely.
To simplify use, sequences can be addressed by first whitespace-separated
field; e.g. "8 SN(Homo sapiens) GA(HG18) URI(NC_000008.9)" can be addressed
simply as "8", provided "8" is a unique first-field name in the FASTA file.
Thus, to extract 20bp starting at position 323202 in chromosome 8 from the
human reference:
% fastahack subsequence h.sapiens.fasta 8 323202 20
ACATTGTAATAGATCTCAGA
Usage information is provided by running fastahack with no arguments:
% fastahack
usage: fastahack <command> [options]
actions:
index <fasta reference>
sequence <fasta reference> <sequence name>
subsequence <fasta reference> <sequence name> <0-based start> <length>
stats <fasta reference> <sequence name> (returns sequence length)
Limitations:
fastahack will only generate indexes for FASTA files in which the sequences
have self-consistent line lengths. Trailing whitespace is allowed at the end
of sequences, but not embedded within the sequence. These limitations are
necessitated by the complexity of indexing sequences whose lines change in
length--- the use of indexes is frustrated by such inconsistencies; each change
in line length would require a new entry in the index file.
Comment