I've just made the latest Gap5 release on sourceforge, as prebuilt linux binaries (32-bit and 64-bit intel binaries only currently). Source is there too, but it's warty and a pain to build right now.
Since the previous release I've put a lot of work into making the file format compact, so that it's now typically smaller than the equivalent BAM file while still retaining editing capability. (Although editing hasn't been rigorously tested yet and there are still many things to add.)
I've made a start at adding library analysis and supporting annotations. They're in the file format now and have minimalistic visualisation interfaces to test they're working, but these will get fleshed out during later 1.2.x releases.
James
Overview
Gap5 is ultimately the replacement for Gap4 in that it aims to be a sequence assembly viewer and editor for finishing experiments. As such it provides tools for comparing, joining and breaking contigs as well as the smaller details of individual base editing.
It is designed to be compact in file size, and generally very low in CPU and memory usage. In cpu/memory it's typically comparable to MapView or samtools tview. In file size (it still needs it's own format) it's usually slightly smaller than BAM format.
Right now it is very much work in progress. As far as a viewer goes it's already a useful tool in it's own right, but as an editor there's still lots of missing features. It's likely there are some major bugs in the editor too as it's not had a lot of testing yet.
I'd recommend using the "-ro" flag to gap5 (read-only mode) unless you really do need to be editing too.
To get started, firstly you need a Gap5 database. It cannot read your old Gap4 ones. You an construct new Gap5 databases out of ACE, MAQ, BAM or BAF format files, or convert your old Gap4 databases via caftools and the supplied caf2baf script. Eg:
cd /nfs/repository/d0022/bE171C14/
gap2caf -project BE171C14 -version 0 | caf2baf > /tmp/BE171C14.baf
cd /tmp
tg_index -p -B -o BE171C14 BE171C14.baf
gap5 BE171C14
It's worth having an idea of the depth of your data too. If you have a very shallow assembly, try using (for example) the "-z 256k" option to tg_index to speed up processing and reduce file size. See below for the full details.
tg_index
This converts various alignment file formats into a gap5 database file (or pair of files infact). The input formats currently supported are maq (short/long), bam, ace, baf, and some old "aln" text format. I have a caf2baf conversion tool if people need it too, but it's not natively supported by tg_index.
Usage:
format_code is one of
-b
In additon to this there are a variety of other options:
-a
-n
-p, -P
-T
-o 'db_name'
-z 'size'
So a typical example usage maybe:
gap5
This is the actual viewer or editor. The main displays you'll want to familiarise yourself with are the Contig List, Contig Editor and Template Displays.
Initially you may (or may not, depending on how many there are) see the "contig selector" window. Note that this is currently bugged when the total contig length goes beyond 2Gb - eg whole human alignments. It's probably worth using the Contig List window instead in this case. You can forcibly turn on or turn off displaying the contig selector at startup using -csel and -no_csel command line options.
-ro
Downloads
The executables are distributed via sourceforge at:
Code, for those that really care, is also there via:
Screenshots:
This shows a graphical overview of a mixed assembly. The colours indicate mapping quality and/or template status (single ended, paired but spanning contigs). The Y status indicates the insert size - hence clearly seeing solexa vs capillary libraries in this plot.
An example of the contig editor. This is a mix of 454 and capillary data made by MIRA. The MIRA tags are visible here as the coloured fragments.
Another editor screenshot, showing grey scales for base quality and mapping quality (in the "names" panel to the left, now just an ascii art representation of the alignments). Also shown are a couple traces for capillary sequences as this is from a mixed capillary/solexa assembly. It can show 454 traces too, and in theory solexa ones but we're no longer keeping processed trace data here (only raw).
James
Since the previous release I've put a lot of work into making the file format compact, so that it's now typically smaller than the equivalent BAM file while still retaining editing capability. (Although editing hasn't been rigorously tested yet and there are still many things to add.)
I've made a start at adding library analysis and supporting annotations. They're in the file format now and have minimalistic visualisation interfaces to test they're working, but these will get fleshed out during later 1.2.x releases.
James
Overview
Gap5 is ultimately the replacement for Gap4 in that it aims to be a sequence assembly viewer and editor for finishing experiments. As such it provides tools for comparing, joining and breaking contigs as well as the smaller details of individual base editing.
It is designed to be compact in file size, and generally very low in CPU and memory usage. In cpu/memory it's typically comparable to MapView or samtools tview. In file size (it still needs it's own format) it's usually slightly smaller than BAM format.
Right now it is very much work in progress. As far as a viewer goes it's already a useful tool in it's own right, but as an editor there's still lots of missing features. It's likely there are some major bugs in the editor too as it's not had a lot of testing yet.
I'd recommend using the "-ro" flag to gap5 (read-only mode) unless you really do need to be editing too.
To get started, firstly you need a Gap5 database. It cannot read your old Gap4 ones. You an construct new Gap5 databases out of ACE, MAQ, BAM or BAF format files, or convert your old Gap4 databases via caftools and the supplied caf2baf script. Eg:
cd /nfs/repository/d0022/bE171C14/
gap2caf -project BE171C14 -version 0 | caf2baf > /tmp/BE171C14.baf
cd /tmp
tg_index -p -B -o BE171C14 BE171C14.baf
gap5 BE171C14
It's worth having an idea of the depth of your data too. If you have a very shallow assembly, try using (for example) the "-z 256k" option to tg_index to speed up processing and reduce file size. See below for the full details.
tg_index
This converts various alignment file formats into a gap5 database file (or pair of files infact). The input formats currently supported are maq (short/long), bam, ace, baf, and some old "aln" text format. I have a caf2baf conversion tool if people need it too, but it's not natively supported by tg_index.
Usage:
tg_index -o dbname [options] -format_code input_filename
format_code is one of
-b
BAM
-mMAQ short
-MMAQ long
-AACE
-BBAF
In additon to this there are a variety of other options:
-a
Append mode. With this the database is appended to instead of overwritten.
-n
Requests that new contigs are made when appending, even if they match the names of existing contigs. (By default it'll merge data into the same contigs, but if padding is different then this will cause issues.
-p, -P
turn on (or off) read-pairing. This is on by default, but it uses up memory to identify the pairs (by name). If you know you have single-ended data then using -P will speed up indexing and save memory.
-T
Requests building a B+Tree of sequence names. This permits random access governed by a name rather than position, eg to jump specifically to sequence "foo" in the Gao5 editor. The index isn't build by default as it's rather slow. (I have plans on improving this though.)
-o 'db_name'
Specifies the output database name is to be db_name.
-z 'size'
This governs the bin size for the range-query binning system. By default 'size' is 4k, but it's worth increasing this if your coverage is very low. Ideally you want a few thousand sequences per bin to strike a happy balance between speed and I/O efficiency.
So a typical example usage maybe:
tg_index -z 64k -o rmdup_g5 -m rmdup.map
gap5
This is the actual viewer or editor. The main displays you'll want to familiarise yourself with are the Contig List, Contig Editor and Template Displays.
Initially you may (or may not, depending on how many there are) see the "contig selector" window. Note that this is currently bugged when the total contig length goes beyond 2Gb - eg whole human alignments. It's probably worth using the Contig List window instead in this case. You can forcibly turn on or turn off displaying the contig selector at startup using -csel and -no_csel command line options.
-ro
Use this command line option to disable editing abilities. It opens the file in read-only mode guaranteeing that you cannot change the data.
-cselForces the contig selector to be shown at startup
-no_cselForces the contig selector to not be shown at startup.
Downloads
The executables are distributed via sourceforge at:
Code, for those that really care, is also there via:
Screenshots:
This shows a graphical overview of a mixed assembly. The colours indicate mapping quality and/or template status (single ended, paired but spanning contigs). The Y status indicates the insert size - hence clearly seeing solexa vs capillary libraries in this plot.
An example of the contig editor. This is a mix of 454 and capillary data made by MIRA. The MIRA tags are visible here as the coloured fragments.
Another editor screenshot, showing grey scales for base quality and mapping quality (in the "names" panel to the left, now just an ascii art representation of the alignments). Also shown are a couple traces for capillary sequences as this is from a mixed capillary/solexa assembly. It can show 454 traces too, and in theory solexa ones but we're no longer keeping processed trace data here (only raw).
James
Comment