Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FASTQ must die! Long live SAM/BAM!

    One of the ideas mentioned on the SEQanswers letter thread was about linking blog content and discussion back to SEQanswers, so...

    I've just blogged about why I think we as a community should try to move away from FASTQ as a file format for unaligned reads and use SAM/BAM instead, FASTQ must die! Long live SAM/BAM!, and will suggest people comment on this thread rather than on the blog.

    This is partly because I don't seem to have got my blog comments settings right anyway

  • #2
    I'm not sure if there is much to say. Fewer formats in bioinformatics would be good. Programs that read and write to all common formats would be good. BAM/SAM is, as far as I can tell, a good enough format. We will have to see if incompatibilities pop up during the next couple of years.

    Comment


    • #3
      I want to ask a question about bam files.

      I have 2 sequencing library in a same sample, and get 2 fastq files, the length of reads are 50bp and 36bp separately.
      So When I do tophat, because I need to specify the -r, I cannot combine the two fastq files. But after I got the accepted.bam files, can I combine them (bam files) with the samtools merge?

      thanks everyone.

      Comment


      • #4
        Originally posted by camelbbs View Post
        I want to ask a question about bam files.
        I was going to recommend asking in a new thread, but you've done that
        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

        Comment


        • #5
          Whilst I appreciate the sentiments of your argument for getting rid of fastq format, I tend to disagree.

          I guess my main objections would be:

          1) I like having a separation of primary data and derived data. FastQ is primary data which is never going to change. BAM/SAM is derived data which might change if you use a different read mapper, genome assembly etc.

          2) I like simple plain text formats. FastQ, for all of its failings (and it certainly has those!), is a simple format which is easy to parse and deal with. SAM/BAM is much harder to get your head around. Realistically you need to use an existing library to do anything with a BAM/SAM file due to the complexities of the format.

          3) FastQ is more future-proof. Because FastQ format makes no assumptions about the structure of your experiments (precisely because it contains no metadata) it makes very few assumptions about what your data is going to look like in the future. If you look at the recent changes to BAM format to get around the previous assumption of only ever having a maximum of two reads per sequence then you can see how this might go wrong in future.

          We use BAM format all the time, but it's not a format I particularly like working with. You mentioned the flag field in your blog which must single-handedly have caused more trouble than any other format design decision ever made in bioinformatics! I can see the appeal of the format, but the field is still undergoing such rapid change I can see that it's probably not finished yet.

          Comment


          • #6
            Hi Andrew,

            Thanks for your comments. You raise some good points, but I don't agree with them all.

            (1) Editing of FASTQ files happens already though (quality trimming, filtering, etc) so there is no clear separation between primary data and derived data.

            (2) Given how big sequence data files are getting, it is increasingly impractical to work with them as plain text (not so bad for viruses though). You can do plenty with SAM at the Unix command line, the fact it is one line per read actually helps. Any non trivial stuff yes, a SAM/BAM library helps.

            (3) From a long term data archive policy going through all the SAM/BAM format revisions to try and understand what an old file means might be hard, but try extracting the meta data from a FASTQ file where there are 101 different filename, header or read naming conventions, many undocumented.

            (unnumbered 4) I agree the representation of the FLAG in SAM as a single (decimal) integer was probably the worst design choice in the format. Even an eight character string of 0s and 1s would have been easier to understand. However, it is done, and changing it will only break things - and only benefit people working on the files directly with scripts and Unix one-line magic. If you're using a SAM/BAM library this should map the FLAG bits for you.

            And I agree things will change (e.g. maybe one day we will see SAM/BAM move to HDF5 rather than the homegrown BGZF used now).

            Peter
            Last edited by maubp; 10-22-2011, 04:18 AM. Reason: Typo

            Comment


            • #7
              The major problem with fastq is we are unable to keep meta data. This is a disadvantage, not an advantage in almost all aspects. From this angle, SAM is at least not worse than fastq -- we can always keep the primary data only -- and SAM is arguably the only universal way to keep meta data. It is true that we may need to change SAM when a new technology comes with new read structures or new types of information, but other solutions are no better. We need to design something new anyway. Then why not just add to SAM? I do not know the decision process at Sanger and Broad about the use of BAM to store the primary data. I would guess the ability to keep meta data in BAM is a key.

              On the other hand, I do not see fastq dying. SAM/BAM is too heavy. Parsing SAM/BAM by yourself is really a pain especially in C. I know many will argue that a SAM/BAM library is available to each mainstream programming language. But there are developers like me who resist using a non-standard external library for something that is supposed to be simple and has little to do with the core algorithm. This is my philosophy of implementing algorithms, even if a bad one. In this line, it is easy to imagine my resistance to HDF5. And this resistance is not all about my personal opinion: BGZF indeed has several technical advantages over HDF5 which makes BGZF more suitable for SAM/BAM. Actually the simplicity of BGZF alone is strong enough to win me over.

              Back to the topic. SAM/BAM is good, but it is not for everything and for everyone. Fastq has its niche and will long live, if not outlive SAM/BAM.
              Last edited by lh3; 10-23-2011, 08:29 PM. Reason: fixed grammatical errors

              Comment


              • #8
                sequence storage interface

                One thing that I would like to see is a clear separation between the interface and the implementation of these sequence storage formats - similar to the relationship between graphics and OpenGL, for example. An interface that allows the user to extract certain information from the data with guaranteed time/space complexity bounds would help in hiding some of the details of the low level implementation. For example, as long as one could extract intervals that overlap a certain range, it wouldn't matter if it was done using UCSC binning scheme, augmented intervals, nested-containment lists, or something else with similar complexity behaviors.

                BAM/SAM could act as a model implementation of the interface and serve as a proof-of-concept that such an interface can be satisfied. This way, the tools that people write won't break when the implementation changes or if there is a switch to a new storage format.

                Comment


                • #9
                  That is like the sequence alignment APIs we were discussing. It is definitely a good thing, but I have never got time to do that for SAM/BAM.

                  Comment


                  • #10
                    Originally posted by lh3 View Post
                    The major problem with fastq is we are unable to keep meta data. This is a disadvantage, not an advantage in almost all aspects. From this angle, SAM is at least not worse than fastq -- we can always keep the primary data only -- and SAM is arguably the only universal way to keep meta data. It is true that we may need to change SAM when a new technology comes with new read structures or new types of information, but other solutions are no better. We need to design something new anyway. Then why not just add to SAM? I do not know the decision process at Sanger and Broad about the use of BAM to store the primary data. I would guess the ability to keep meta data in BAM is a key.
                    Here we agree. Maybe I should mention the Broad on the blog post too...

                    Originally posted by lh3 View Post
                    On the other hand, I do not see fastq dying. SAM/BAM is too heavy. Parsing SAM/BAM by yourself is really a pain especially in C. I know many will argue that a SAM/BAM library is available to each mainstream programming language. But there are developers like me who resist using a non-standard external library for something that is supposed to be simple and has little to do with the core algorithm. This is my philosophy of implementing algorithms, even if a bad one.
                    Here I do disagree with you - there is a time and a place for writing your own library functions, but in this example I think using a library for parsing SAM/BAM is very sensible - especially if it lets you spend more time on the core algorithm and less on the file IO.

                    Originally posted by lh3 View Post
                    In this line, it is easy to imagine my resistance to HDF5. And this resistance is not all about my personal opinion: BGZF indeed has several technical advantages over HDF5 which makes BGZF more suitable for SAM/BAM. Actually the simplicity of BGZF alone is strong enough to win me over.
                    I'm coming to like BGZF, and thinking about how to use it for other sequential (in the sense of one record after another) file formats like FASTA, FASTQ, GenBank etc. BGZF gives you almost as good compression as gzip, but makes random access much more efficient.

                    Originally posted by lh3 View Post
                    Back to the topic. SAM/BAM is good, but it is not for everything and for everyone. Fastq has its niche and will long live, if not outlive SAM/BAM.
                    I suspect you're right - but I would still like to see FASTQ replaced sooner rather than later
                    Last edited by maubp; 11-23-2011, 07:59 AM. Reason: Fixed autocorrection of Broad to Board.

                    Comment


                    • #11
                      Originally posted by maubp View Post
                      I'm coming to like BGZF, and thinking about how to use it for other sequential (in the sense of one record after another) file formats like FASTA, FASTQ, GenBank etc. BGZF gives you almost as good compression as gzip, but makes random access much more efficient.
                      I've looked at this in more detail now, and think BGZF could be much more widely used, see this blog post and forum thread:
                      BAM files are compressed using a variant of GZIP (GNU ZIP) , called BGZF (Blocked GNU Zip Format). Anyone who has read the SAM/BAM Specifica...

                      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                      Comment


                      • #12
                        Where are we today?

                        Where do we stand on this today? If someone were to build a pipeline, what are the data points they should look at to decide between FASTQ and uBAM?

                        Most of all, file size concerns me. I no longer work on FASTQ, but when I did (1.5 years ago), they were 4-5 gigs, gzipped (WGS, 30X). I've never encountered uBAMs, but BAMs are 60+ gigs. Am I wrong comparing BAMs to uBAMs? Are the exponentially different in size? How would a WGS 30X uBAM compare in size to a FASTQ from the same experiment?
                        Ram

                        Comment


                        • #13
                          I think we are right where we were when this thread started. Gzipped fastq files is still the most common deliverable for sequencing AFAIK. I believe PacBio has started moving to a variant of BAM with the new SMRTportal v.3.0 but no change in that direction from Illumina.

                          You are free to choose any format that suites your internal needs.

                          Comment


                          • #14
                            I find gzipped fastq to be the most convenient. The sam/bam specification has a lot of limitations, like read 1 and read 2 having the same name. uBam is just what some random person decided to call "unmapped bam". They're still bam files.

                            Gzipped fastq is smaller and faster to process than unmapped bam. I just ran a test on 100k reads with these commands:

                            reformat.sh in=reads.fq.gz out=100k.fq.gz zl=6 ow reads=100k
                            reformat.sh in=reads.fq.gz out=100k_u.sam.gz zl=6 ow reads=100k
                            reformat.sh in=reads.fq.gz out=100k_u.bam zl=6 ow reads=100k

                            These are the sizes:

                            Code:
                            -rw-rw-r-- 1 bushnell genome 8784821 Nov 29 13:57 100k.fq.gz
                            -rw-rw-r-- 1 bushnell genome 9011991 Nov 29 13:58 100k_u.bam
                            -rw-rw-r-- 1 bushnell genome 8815867 Nov 29 13:57 100k_u.sam.gz
                            Write times:
                            fq.gz: 0.382 seconds
                            sam.gz: 0.400 seconds
                            bam: 1.958 seconds

                            Read times:
                            fq.gz: 0.304 seconds
                            sam.gz: 0.375 seconds
                            bam: 0.470 seconds

                            CPU-time (reading):
                            fq.gz: 1.438s
                            sam.gz: 1.431s
                            bam: 1.814s

                            So in addition to being inconvenient, unmapped bam is universally worse from a performance and space perspective.

                            Comment


                            • #15
                              sometimes you don't need alignments you need the raw reads, so long live FASTQ

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              37 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X