Seqanswers Leaderboard Ad

**westerman** · 10-21-2011, 10:52 AM

I'm not sure if there is much to say. Fewer formats in bioinformatics would be good. Programs that read and write to all common formats would be good. BAM/SAM is, as far as I can tell, a good enough format. We will have to see if incompatibilities pop up during the next couple of years.

**camelbbs** · 10-21-2011, 01:17 PM

I want to ask a question about bam files.

I have 2 sequencing library in a same sample, and get 2 fastq files, the length of reads are 50bp and 36bp separately.
So When I do tophat, because I need to specify the -r, I cannot combine the two fastq files. But after I got the accepted.bam files, can I combine them (bam files) with the samtools merge?

thanks everyone.

**maubp** · 10-21-2011, 01:58 PM

Originally posted by camelbbs View Post

I want to ask a question about bam files.

I was going to recommend asking in a new thread, but you've done that

a question about merge bam files - SEQanswers

http://seqanswers.com/forums/showthread.php?t=14952

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

**simonandrews** · 10-22-2011, 12:12 AM

Whilst I appreciate the sentiments of your argument for getting rid of fastq format, I tend to disagree.

I guess my main objections would be:

1) I like having a separation of primary data and derived data. FastQ is primary data which is never going to change. BAM/SAM is derived data which might change if you use a different read mapper, genome assembly etc.

2) I like simple plain text formats. FastQ, for all of its failings (and it certainly has those!), is a simple format which is easy to parse and deal with. SAM/BAM is much harder to get your head around. Realistically you need to use an existing library to do anything with a BAM/SAM file due to the complexities of the format.

3) FastQ is more future-proof. Because FastQ format makes no assumptions about the structure of your experiments (precisely because it contains no metadata) it makes very few assumptions about what your data is going to look like in the future. If you look at the recent changes to BAM format to get around the previous assumption of only ever having a maximum of two reads per sequence then you can see how this might go wrong in future.

We use BAM format all the time, but it's not a format I particularly like working with. You mentioned the flag field in your blog which must single-handedly have caused more trouble than any other format design decision ever made in bioinformatics! I can see the appeal of the format, but the field is still undergoing such rapid change I can see that it's probably not finished yet.

**maubp** · 10-22-2011, 04:16 AM

Hi Andrew,

Thanks for your comments. You raise some good points, but I don't agree with them all.

(1) Editing of FASTQ files happens already though (quality trimming, filtering, etc) so there is no clear separation between primary data and derived data.

(2) Given how big sequence data files are getting, it is increasingly impractical to work with them as plain text (not so bad for viruses though). You can do plenty with SAM at the Unix command line, the fact it is one line per read actually helps. Any non trivial stuff yes, a SAM/BAM library helps.

(3) From a long term data archive policy going through all the SAM/BAM format revisions to try and understand what an old file means might be hard, but try extracting the meta data from a FASTQ file where there are 101 different filename, header or read naming conventions, many undocumented.

(unnumbered 4) I agree the representation of the FLAG in SAM as a single (decimal) integer was probably the worst design choice in the format. Even an eight character string of 0s and 1s would have been easier to understand. However, it is done, and changing it will only break things - and only benefit people working on the files directly with scripts and Unix one-line magic. If you're using a SAM/BAM library this should map the FLAG bits for you.

And I agree things will change (e.g. maybe one day we will see SAM/BAM move to HDF5 rather than the homegrown BGZF used now).

Peter

**lh3** · 10-22-2011, 07:38 AM

The major problem with fastq is we are unable to keep meta data. This is a disadvantage, not an advantage in almost all aspects. From this angle, SAM is at least not worse than fastq -- we can always keep the primary data only -- and SAM is arguably the only universal way to keep meta data. It is true that we may need to change SAM when a new technology comes with new read structures or new types of information, but other solutions are no better. We need to design something new anyway. Then why not just add to SAM? I do not know the decision process at Sanger and Broad about the use of BAM to store the primary data. I would guess the ability to keep meta data in BAM is a key.

On the other hand, I do not see fastq dying. SAM/BAM is too heavy. Parsing SAM/BAM by yourself is really a pain especially in C. I know many will argue that a SAM/BAM library is available to each mainstream programming language. But there are developers like me who resist using a non-standard external library for something that is supposed to be simple and has little to do with the core algorithm. This is my philosophy of implementing algorithms, even if a bad one. In this line, it is easy to imagine my resistance to HDF5. And this resistance is not all about my personal opinion: BGZF indeed has several technical advantages over HDF5 which makes BGZF more suitable for SAM/BAM. Actually the simplicity of BGZF alone is strong enough to win me over.

Back to the topic. SAM/BAM is good, but it is not for everything and for everyone. Fastq has its niche and will long live, if not outlive SAM/BAM.

**BAMseek** · 10-22-2011, 11:09 AM

sequence storage interface

One thing that I would like to see is a clear separation between the interface and the implementation of these sequence storage formats - similar to the relationship between graphics and OpenGL, for example. An interface that allows the user to extract certain information from the data with guaranteed time/space complexity bounds would help in hiding some of the details of the low level implementation. For example, as long as one could extract intervals that overlap a certain range, it wouldn't matter if it was done using UCSC binning scheme, augmented intervals, nested-containment lists, or something else with similar complexity behaviors.

BAM/SAM could act as a model implementation of the interface and serve as a proof-of-concept that such an interface can be satisfied. This way, the tools that people write won't break when the implementation changes or if there is a switch to a new storage format.

**lh3** · 10-22-2011, 11:46 AM

That is like the sequence alignment APIs we were discussing. It is definitely a good thing, but I have never got time to do that for SAM/BAM.

**maubp** · 10-24-2011, 09:47 AM

Originally posted by lh3 View Post

The major problem with fastq is we are unable to keep meta data. This is a disadvantage, not an advantage in almost all aspects. From this angle, SAM is at least not worse than fastq -- we can always keep the primary data only -- and SAM is arguably the only universal way to keep meta data. It is true that we may need to change SAM when a new technology comes with new read structures or new types of information, but other solutions are no better. We need to design something new anyway. Then why not just add to SAM? I do not know the decision process at Sanger and Broad about the use of BAM to store the primary data. I would guess the ability to keep meta data in BAM is a key.

Here we agree. Maybe I should mention the Broad on the blog post too...

Originally posted by lh3 View Post

On the other hand, I do not see fastq dying. SAM/BAM is too heavy. Parsing SAM/BAM by yourself is really a pain especially in C. I know many will argue that a SAM/BAM library is available to each mainstream programming language. But there are developers like me who resist using a non-standard external library for something that is supposed to be simple and has little to do with the core algorithm. This is my philosophy of implementing algorithms, even if a bad one.

Here I do disagree with you - there is a time and a place for writing your own library functions, but in this example I think using a library for parsing SAM/BAM is very sensible - especially if it lets you spend more time on the core algorithm and less on the file IO.

Originally posted by lh3 View Post

In this line, it is easy to imagine my resistance to HDF5. And this resistance is not all about my personal opinion: BGZF indeed has several technical advantages over HDF5 which makes BGZF more suitable for SAM/BAM. Actually the simplicity of BGZF alone is strong enough to win me over.

I'm coming to like BGZF, and thinking about how to use it for other sequential (in the sense of one record after another) file formats like FASTA, FASTQ, GenBank etc. BGZF gives you almost as good compression as gzip, but makes random access much more efficient.

Originally posted by lh3 View Post

Back to the topic. SAM/BAM is good, but it is not for everything and for everyone. Fastq has its niche and will long live, if not outlive SAM/BAM.

I suspect you're right - but I would still like to see FASTQ replaced sooner rather than later

**maubp** · 11-08-2011, 02:03 PM

Originally posted by maubp View Post

I'm coming to like BGZF, and thinking about how to use it for other sequential (in the sense of one record after another) file formats like FASTA, FASTQ, GenBank etc. BGZF gives you almost as good compression as gzip, but makes random access much more efficient.

I've looked at this in more detail now, and think BGZF could be much more widely used, see this blog post and forum thread:

BGZF - Blocked, Bigger & Better GZIP!

http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html

BAM files are compressed using a variant of GZIP (GNU ZIP) , called BGZF (Blocked GNU Zip Format). Anyone who has read the SAM/BAM Specifica...

Using BGZF (Blocked GNU Zip Format) for general sequence files - SEQanswers

http://seqanswers.com/forums/showthread.php?t=15347

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

**RamakrishnanRS** · 11-29-2016, 12:28 PM

Where are we today?

Where do we stand on this today? If someone were to build a pipeline, what are the data points they should look at to decide between FASTQ and uBAM?

Most of all, file size concerns me. I no longer work on FASTQ, but when I did (1.5 years ago), they were 4-5 gigs, gzipped (WGS, 30X). I've never encountered uBAMs, but BAMs are 60+ gigs. Am I wrong comparing BAMs to uBAMs? Are the exponentially different in size? How would a WGS 30X uBAM compare in size to a FASTQ from the same experiment?

**GenoMax** · 11-29-2016, 12:43 PM

I think we are right where we were when this thread started. Gzipped fastq files is still the most common deliverable for sequencing AFAIK. I believe PacBio has started moving to a variant of BAM with the new SMRTportal v.3.0 but no change in that direction from Illumina.

You are free to choose any format that suites your internal needs.

**Brian Bushnell** · 11-29-2016, 02:04 PM

I find gzipped fastq to be the most convenient. The sam/bam specification has a lot of limitations, like read 1 and read 2 having the same name. uBam is just what some random person decided to call "unmapped bam". They're still bam files.

Gzipped fastq is smaller and faster to process than unmapped bam. I just ran a test on 100k reads with these commands:

reformat.sh in=reads.fq.gz out=100k.fq.gz zl=6 ow reads=100k
reformat.sh in=reads.fq.gz out=100k_u.sam.gz zl=6 ow reads=100k
reformat.sh in=reads.fq.gz out=100k_u.bam zl=6 ow reads=100k

These are the sizes:

Code:

-rw-rw-r-- 1 bushnell genome 8784821 Nov 29 13:57 100k.fq.gz
-rw-rw-r-- 1 bushnell genome 9011991 Nov 29 13:58 100k_u.bam
-rw-rw-r-- 1 bushnell genome 8815867 Nov 29 13:57 100k_u.sam.gz

Write times:
fq.gz: 0.382 seconds
sam.gz: 0.400 seconds
bam: 1.958 seconds

Read times:
fq.gz: 0.304 seconds
sam.gz: 0.375 seconds
bam: 0.470 seconds

CPU-time (reading):
fq.gz: 1.438s
sam.gz: 1.431s
bam: 1.814s

So in addition to being inconvenient, unmapped bam is universally worse from a performance and space perspective.

**StackerEd** · 12-14-2016, 11:40 AM

sometimes you don't need alignments you need the raw reads, so long live FASTQ

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

FASTQ must die! Long live SAM/BAM!

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News