Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.

FASTA and FASTQ: A Guide to Key File Formats for Sequencing Data

Collapse
X
Collapse
  •  

  • FASTA and FASTQ: A Guide to Key File Formats for Sequencing Data

    Click image for larger version  Name:	File Formats2.jpg Views:	0 Size:	140.1 KB ID:	324496



    Performing successful sequencing analysis requires an understanding of different file formats and how they are used for various applications. Scientists interested in completing their own sequencing analysis should learn the purpose and contents of each format. In the next two articles, we will explore some of the essential file formats used in sequencing data analysis and their significance in the field.

    FASTA

    The FASTA file format is one of the most popular formats for storing biological sequence data. These text-based files can be used for storing strings of amino acids (peptides) or nucleotide sequences (DNA or RNA). They are routinely used for sequence annotation, database searches, and multiple sequence alignment.

    The FASTA format was originally created with the development of the FASTP program1, a platform used for searching amino acid sequence databases. “When the FASTP program was written in the fall of 1983, there were no publicly available protein sequence databases, so there was no standard format for protein sequences,” explained FASTA creator Dr. William R. Pearson, Professor of Biochemistry and Molecular Genetics in the School of Medicine at the University of Virginia. “There were two standard formats for DNA sequence databases, the Genbank format and the EMBL format. Both formats were developed by database people, and had field labels in specific columns with multiple field types, which included a lot more information than simply the sequence; those formats are still in use today.”

    During the FASTP program’s development, Pearson and his colleagues collaborated with Margaret Dayhoff’s group from the Protein Identification Resource (PIR) at Georgetown University. Her group had a relatively simple but important format for protein sequence databases, and that original format looked like the image in Figure 1.


    Click image for larger version  Name:	Original FASTA2.jpg Views:	0 Size:	25.9 KB ID:	324501
    Figure 1: Original format for storing protein sequence information developed by the PIR (provided by Dr. Pearson)

    “We received a protein sequence database from the PIR group in this format, so it was the first format the FASTP program could read,” explained Pearson. “However, molecular biologists using this format often forgot to include the description line, which meant that the first line of the sequence was lost (because it was read as the description).” This ultimately led to the simpler and more common file bioinformaticians use today. “The FASTA format was invented by putting both the accession information (HAHU) and the description on the line starting with the ‘>’ (greater-than sign),” Pearson explained about the new file example, shown in Figure 2 below.


    Click image for larger version  Name:	FinalFASTA2.jpg Views:	0 Size:	26.4 KB ID:	324502
    Figure 2: An example of the updated and current FASTA layout (provided by Dr. Pearson)


    The file’s new format was rapidly adopted for a number of reasons. “This was very easy for biologists to remember, and, because there were no fixed location fields, it was easy to type in sequences correctly,” said Pearson. When asked if there were any advantages of storing data in this file type, Pearson stated succinctly, “The advantage was simplicity: A line starting with ‘>’ for a description (and to indicate the beginning of a new sequence in a file/database with multiple sequences), everything else is a sequence to be analyzed.”

    After its initial development in 1983, FASTA has remained relatively the same. “Since then, different groups have used the information in the description line in different ways, but there were no constraints on either the length of the description line or the length of the sequence line,” added Pearson. “This was another feature that made the format easy to use and easy to incorporate into analysis workflows.”

    Despite the growing number of file types used for sequencing analysis and sequence storage, the FASTA format is still highly utilized to this day. As Pearson explained, “Almost all other bioinformatics file formats involve some kind of field-based format, which in general can be much more powerful and easier to compute on. But the FASTA format allowed biologists to easily enter (and examine) sequence data to create their own sequence sets. It is very information-dense, and is well suited to similarity searching, the purpose it was designed for.”


    FASTA facts:
    • FASTA uses standard IUB/IUPAC amino acid and nucleic acid codes
    • Some of the common file extensions are: “.fasta”, “.fa”, “.ffn”, “.frn”, “.fna”, and “.faa”
    • Pearson clarified that “FASTA” is pronounced “FAST-long-A”, not “FAST-Ah”
    • In-depth details about FASTA organization can be found at: https://blast.ncbi.nlm.nih.gov/doc/blast-topics/


    From FASTA to FASTQ

    Derived from FASTA, the FASTQ format is a similar text file containing important sequence information. However, FASTQ files contain details related to the sequencing run from which they originated. The main difference between the two files is that the FASTQ format contains raw sequencing information, specifically the quality scores related to the base calls.

    The FASTQ format was created by Dr. Jim Mullikin during his time at the Wellcome Trust Sanger Institute2, although its widespread use and an official publication on the format didn’t occur until years later. Initially designed for Sanger capillary sequencing, the FASTQ format was adapted for use with next-generation sequencing. Several other variations of FASTQ were created for specific technologies, but now the format has become fairly consistent across platforms.

    It is important to understand the contents of FASTA files because this format contains raw sequence data that can be used to evaluate the accuracy of the base calls and filter out low-quality reads and sequencing errors. Additionally, FASTQ files are highly utilized and fit into many analysis pipelines. Other important file types that contain primary sequencing data that users should be familiar with include FAST5 files and HDF5 files.


    FASTQ Layout

    Unlike the greater-than sign (‘>’) that starts the FASTA description line, the FASTQ format (shown in Figure 3) begins with an ‘@’ which is followed by a description line. The description may include details about the sequences or the sequencing run, such as the instrument the data was generated on. The nucleotide sequence begins on the second line of the file, and the third line is simply a ‘+’ (plus sign) which serves as a separator and may also contain a brief description. On the fourth line of the FASTQ file, quality scores for each respective base (from the second line) are represented by American Standard Code for Information Interchange (ASCII) characters.

    Click image for larger version  Name:	FASTQ Image.jpg Views:	0 Size:	37.5 KB ID:	324500
    Figure 3: A normal layout of a FASTQ containing the ‘@’ and description (line one), the bases (line two), the ‘+’ separator (line three), and the quality scores represented by ASCII characters (line four).

    The quality measurements on line four are shown using Phred scores, which assess the reliability of base calls. Phred quality scores are expressed as a logarithmic probability value and represent the estimated error rate for a given base call. The higher the Phred score, the lower the probability that the base is incorrect. These measures are generally used as the standard format for quality across technologies.


    FASTQ facts:
    • FASTQ uses the base calls A, C, T, G, and N
    • Common file extensions include: “.fastq” and “.fq” or the gzip-compressed format, “.fastq.gz”
    • Short-read technologies performing pair-end sequencing generate a FASTQ for each read
    • Tools like FASTQC3 and Nanoplot4 are popular tools for processing FASTQ files

    References:
    1. Lipman D, Pearson W. Rapid and sensitive protein similarity searches. Science. 1985;227(4693):1435-1441. doi:https://doi.org/10.1126/science.2983426
    2. Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research. 2010;38(6):1767-1771. doi:https://doi.org/10.1093/nar/gkp1137
    3. Andrews S. Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data. www.bioinformatics.babraham.ac.uk. Published 2010. http://www.bioinformatics.babraham.a...rojects/fastqc
    4. De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: visualizing and processing long-read sequencing data. Berger B, ed. Bioinformatics. 2018;34(15):2666-2669. doi:https://doi.org/10.1093/bioinformatics/bty149
      Please sign into your account to post comments.

    About the Author

    Collapse

    seqadmin Benjamin Atha holds a B.A. in biology from Hood College and an M.S. in biological sciences from Towson University. With over 9 years of hands-on laboratory experience, he's well-versed in next-generation sequencing systems. Ben is currently the editor for SEQanswers. Find out more about seqadmin

    Latest Articles

    Collapse

    • Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM
    • Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM
    • Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Working...
    X