Seqanswers Leaderboard Ad

**seb567** · 05-30-2011, 09:56 AM

Ray 1.4.0: built-in scaffolder & more

Dear Ray users,

Ray 1.4.0 is now available.

The most significant change is the built-in scaffolder.

The second most significant change is the new algorithm that finds
assembly seeds.

Also, I added a lot of output files in Ray.

They are listed here:

Manual

http://denovoassembler.sourceforge.net/manual.html#SECTION00050000000000000000

Finally, our new website http://denovoassembler.sf.net is hopefully
easier to browse.

On the website, there is a manual for Ray.

Manual

http://denovoassembler.sourceforge.net/manual.html

Sébastien

1.4.0
2011-05-30

* A built-in scaffolder is now available -- Thanks to Dr.
Jean-Francois Pombert (University of British Columbia) for the
suggestion.
* The maximum number of libraries is now 499 instead of 250.
* The number of seeds is now divided by 2 to speed up their
extension.
* Fixed a bug in the depth first search that leaded to vertices
having no coverage values.
* Removed the configure script, now Ray must be compiled with the
provided Makefile.
* Added a switch to enable the profiler: -run-profiler
* Added a switch to debug seed generation: -debug-seeds
* Added a switch to debug bubble detection: -debug-bubbles
* Added a switch to show memory usage: -show-memory-usage
* Added a switch to show the ending context of extensions:
-show-ending-context
* Devised a new algorithm that finds the peak coverage, minimum
coverage and repeat coverage in distributions.
* Ray now writes the peak, minimum and repeat coverages to a file.
* Ray now writes the statistics for libraries to a file.
* Fixed a bug that disallowed mixing manual and automatic
detection of outer distances.
* Ray now writes the statistics for seed lengths to a file.
* Devised a new algorithm that computes longer seeds to bootstrap
assemblies.
* Slave modes, master modes and MPI tags are generated with macros
for method prototypes, enumerations and assignments in arrays.
* Added some changes for Microsoft Windows compatibility. Thanks
to Hannes Pouseele (Applied Maths, Inc.) for some suggestions.
* Added instructions regarding mpic++ and CXX environment
variable. Thanks to Dr. Harry Mangalam from UC Irvice for
pointing that out.
* Changed the merger behavior for ends of contigs.
* Added a script to validate scaffolds.

**figure002** · 06-05-2011, 03:15 AM

How to load SFF files?

Dear Sébastien,

I recently started testing Ray on read data of a ~450Mb genome. I can load FASTQ files without a problem, but I can't figure out how to load SFF files. The Instruction Manual only has examples for loading FASTQ files.

I just started a job with the following command,

Code:

mpirun -np 16 time Ray \
-s /home/sp/data/454/shotgun/F0A0H9G01.sff \
-i /home/sp/data/454/pairedend_20k/FPFSKVK01.sff \
-i /home/sp/data/454/pairedend_3k/FO2K76101.sff \
-k 17 -o melon_454_small_test_20110604

But the output contains a lot of the following,

Code:

$ tail mpirun.o22658
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG

So I'm probably doing something wrong. Could you please explain how I should do this?

**seb567** · 06-06-2011, 10:14 AM

Originally posted by figure002 View Post

Dear Sébastien,

I recently started testing Ray on read data of a ~450Mb genome. I can load FASTQ files without a problem, but I can't figure out how to load SFF files. The Instruction Manual only has examples for loading FASTQ files.

I just started a job with the following command,

Code:

mpirun -np 16 time Ray \
-s /home/sp/data/454/shotgun/F0A0H9G01.sff \
-i /home/sp/data/454/pairedend_20k/FPFSKVK01.sff \
-i /home/sp/data/454/pairedend_3k/FO2K76101.sff \
-k 17 -o melon_454_small_test_20110604

But the output contains a lot of the following,

Code:

$ tail mpirun.o22658
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG
Not KEY, was GACT expected TCAG

So I'm probably doing something wrong. Could you please explain how I should do this?

In the SFF specification, it is said that the header of the file contains the prefix of sample sequences.

The message you encountered means that your SFF file contains sequence reads with a sequence key that does not match with the one listed in the header.

Maybe they changed the SFF standard or you are using multiplex identifiers. In either case, I suggest you convert your SFF files to FASTA (or FASTQ) and supply the resulting files to Ray instead of the SFF files.

Sébastien

**flxlex** · 06-07-2011, 04:55 AM

Originally posted by seb567 View Post

Maybe they changed the SFF standard or you are using multiplex identifiers. In either case, I suggest you convert your SFF files to FASTA (or FASTQ) and supply the resulting files to Ray instead of the SFF files.

Sébastien, the 454 reads produced today will in many cases have the new key sequence GACT instead of TCAG. New library preparation kits (using the so-called 'Rapid Library' protocol) have this new key in the adaptors. It would be a great advantage if Ray could handle both key sequences!

**kmcarr** · 06-07-2011, 05:43 AM

Originally posted by flxlex View Post

Sébastien, the 454 reads produced today will in many cases have the new key sequence GACT instead of TCAG. New library preparation kits (using the so-called 'Rapid Library' protocol) have this new key in the adaptors. It would be a great advantage if Ray could handle both key sequences!

I believe the problem is due to a bug in an earlier version of the Roche/454 software. With the switch to Rapid Library chemistry Roche switched the keytag to GACT and released new software (?2.3?). The gsRunProcessor produced properly formatted SFF files which reported GACT as the keytag in the common header section of the SFF. However if you used the program sfffile to manipulate those SFFs (e.g. decode MID tags, split files or merge files) the new common header would erroneously report TCAG as the keytag. This bug appears to have been corrected in the latest release (2.5) of sfffile.

What Sébastien seems to be saying is that Ray reads the common header of the SFF to determine what the keytag should be and in this case there is a mismatch between what the header reports the keytag to be and the keytag observed in the reads. It seems that figure002's SFF file(s) have fallen victim to this bug in sfffile.

**seb567** · 06-08-2011, 01:51 PM

Originally posted by flxlex View Post

Sébastien, the 454 reads produced today will in many cases have the new key sequence GACT instead of TCAG. New library preparation kits (using the so-called 'Rapid Library' protocol) have this new key in the adaptors. It would be a great advantage if Ray could handle both key sequences!

Ray simply fetches the key sequence from the SFF header. Ray has no preference for GACT or TCAG.

Originally posted by kmcarr View Post

I believe the problem is due to a bug in an earlier version of the Roche/454 software. With the switch to Rapid Library chemistry Roche switched the keytag to GACT and released new software (?2.3?). The gsRunProcessor produced properly formatted SFF files which reported GACT as the keytag in the common header section of the SFF. However if you used the program sfffile to manipulate those SFFs (e.g. decode MID tags, split files or merge files) the new common header would erroneously report TCAG as the keytag. This bug appears to have been corrected in the latest release (2.5) of sfffile.

What Sébastien seems to be saying is that Ray reads the common header of the SFF to determine what the keytag should be and in this case there is a mismatch between what the header reports the keytag to be and the keytag observed in the reads. It seems that figure002's SFF file(s) have fallen victim to this bug in sfffile.

Exactly my point.

Meanwhile, what do you think would be the best way to deal with these ill-encoded SFF files generated by sfffile <2.5 with the rapid library chemistry ?

I just don't see an easy way.

**kmcarr** · 06-09-2011, 05:27 AM

Originally posted by seb567 View Post

Exactly my point.

Meanwhile, what do you think would be the best way to deal with these ill-encoded SFF files generated by sfffile <2.5 with the rapid library chemistry ?

I just don't see an easy way.

I am not a Python guy but it looks to me like it would be fairly straightforward using Biopython's Bio/SeqIO/SffIO module. Read the file in, flip the value of 'key_sequence', write out a new file.

(Still waiting for Bioperl Bio::SeqIO::SFF

)

**SES** · 06-10-2011, 05:35 PM

Originally posted by kmcarr View Post

(Still waiting for Bioperl Bio::SeqIO::SFF

)

+1

I know if you want something done you should probably take the initiative and contribute, but I have seen several posts where people have said they were working on this. So, like many people, I decided to wait, assuming it was in progress. (Sorry for taking things off track in the thread though

)

**seb567** · 06-10-2011, 06:58 PM

Originally posted by kmcarr View Post

I am not a Python guy but it looks to me like it would be fairly straightforward using Biopython's Bio/SeqIO/SffIO module. Read the file in, flip the value of 'key_sequence', write out a new file.

(Still waiting for Bioperl Bio::SeqIO::SFF

)

Regardless, I guess it is correct to consider SFF files as containers, just like FASTA or FASTQ files.

Therefore, Ray will no longer try to match the key sequence. Instead, it will *simply* load all sequences in the SFF file and trim them using the clipping values therein.

See http://github.com/sebhtml/ray/commit/15826e290f1

Originally posted by SES View Post

+1

I know if you want something done you should probably take the initiative and contribute, but I have seen several posts where people have said they were working on this. So, like many people, I decided to wait, assuming it was in progress. (Sorry for taking things off track in the thread though

)

You mean taking the initiative to write code changes to BioPython so that Bio/SeqIO/SffIO can change the key sequence, right ?

Is there a software tool from 454 that allows one to change header information in a SFF file ?

There is also this thing called flower (the code is pretty awesome by the way -- it is in Haskell)

Blog post: http://blog.malde.org/index.php/flower/
Source code: http://malde.org/~ketil/biohaskell/flower/

Also, the Ray git tree is now on github.

GitHub - sebhtml/ray: Ray -- Parallel genome assemblies for parallel DNA sequencing

http://github.com/sebhtml/ray

Ray -- Parallel genome assemblies for parallel DNA sequencing - GitHub - sebhtml/ray: Ray -- Parallel genome assemblies for parallel DNA sequencing

Furthermore, Ray can now handle arbitrary large k-mers.

I am presently running some integration and unit tests on Ray v1.6.0-rc2.

You can download the latest development version of Ray with the following command *provided* that you have git.

Code:

git clone git://github.com/sebhtml/ray.git

To use large k-mers:

Code:

git clone git://github.com/sebhtml/ray.git
cd ray
make MAXKMERLENGTH=64 PREFIX=ray-git-master-kmax=64
make install
mpirun -np 128 ray-git-master-kmax=64/Ray -k 55 \
-p ABCD_1.fastq ABCD_2.fastq -o DeadlyBug,k=55

Enjoy !

**SES** · 06-10-2011, 08:00 PM

Originally posted by seb567 View Post

You mean taking the initiative to write code changes to BioPython so that Bio/SeqIO/SffIO can change the key sequence, right ?

No. I was just sympathizing with kmcarr and referring specifically to the need for sff support in bioperl.

**kail** · 06-13-2011, 02:27 PM

Test Ray

Dear all,

I’m trying to test the installation of Ray (and openMPI) in my cluster. However, the set that I possess is too big (~90.403.198 paired reads).

So, can someone tell me were can I get a smaller set to test Ray? The idea will be to have a set that can run in 1 or 2 day… or less if possible

Cluster description:

Itanium II 64 processors 1.6 GHz machine with 128 GBRAM and Infiniband Voltaire 10Gbps interconnect switch.

Also, does Ray write to the disk while it is running? Where?

Thanks in advance for your help!

PD: There are 16 nodes each with four cores.

**seb567** · 06-13-2011, 04:16 PM

Originally posted by kail View Post

Dear all,

I’m trying to test the installation of Ray (and openMPI) in my cluster. However, the set that I possess is too big (~90.403.198 paired reads).

So, can someone tell me were can I get a smaller set to test Ray? The idea will be to have a set that can run in 1 or 2 day… or less if possible

Cluster description:

Itanium II 64 processors 1.6 GHz machine with 128 GBRAM and Infiniband Voltaire 10Gbps interconnect switch.

Also, does Ray write to the disk while it is running? Where?

Thanks in advance for your help!

PD: There are 16 nodes each with four cores.

E. coli

ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...65_1.fastq.bz2
ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...65_2.fastq.bz2
ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...66_1.fastq.bz2
ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...66_2.fastq.bz2

If you search http://www.ncbi.nlm.nih.gov/sra, you can probably find a more up-to-date dataset. However, sra files take forever to convert...

When compiling Ray 1.6.0, be sure to turn off data structure packing because it will produce bus errors on Itanium processors I believe.

wget http://sourceforge.net/projects/deno...-1.6.0.tar.bz2
tar xjf Ray-1.6.0.tar.bz2
cd Ray-1.6.0
make PREFIX=build-ray-1.6.0 FORCE_PACKING=n
make install
ls build-ray-1.6.0/Ray

Ray does not write any file while running, except result files. For a list, see

Manual

http://denovoassembler.sourceforge.net/manual.html#SECTION00060000000000000000

Why do you say your dataset is too large ?

**seb567** · 06-13-2011, 04:21 PM

Ray now supports arbitrary large k-mers (MAXKMERLENGTH)

= 1.6.0 =
2011-06-13

Moved the code tree to subversion to git and from an in-house tree to a github tree -- see http://github.com/sebhtml/ray
Fixed a compilation problem in Scaffolder.cpp. Thanks to Volker Winkelmann (University of Cologne).
Changed CC to MPICXX and added lines to compile Ray with Intel's MPI implementation. Thanks to Volker Winkelmann (University of Cologne).
Implemented a Kmer class for arbitrary long k-mers (MAXKMERLENGTH)
Added pack and unpack methods to Kmer to abstract the communication of k-mers -- thanks to Élénie Godzaridis for the idea.
Output contigs >= 100, not paths >= 100
Detailed the warning for unmatched 454 prefix.
Fixed a bug in the TLE entries in the AMOS file.
The Makefile can now install Ray somewhere. (make PREFIX=prefix; make install)
Structures are now packed by default. Set FORCE_PACKING=n to disable it.
Created subdirectories for code.
Ray now uses all sequences in an SFF file -- not just those matching the sequence key.
Ray now estimates the genome length in RayOutput.CoverageDistributionAnalysis.txt.
Fixed an integer overflow in CoverageDistribution when the number of k-mers occuring once is very large (for Assemblathon-2 datasets).
Added exit code EXIT_NO_MORE_MEMORY=42 as suggested by Hannes Pouseele (applied-maths.com).
Fixed the an access violation on Windows. Bug reported by Hannes Pouseele (applied-maths.com).
Fixed compilation errors for Microsoft Visual C++ (xiosbase and stdexcept) Bug reported by Hannes Pouseele (applied-maths.com>)
Ray compiles with Microsoft Visual Studio 10.0 without any change.

Website: http://denovoassembler.sourceforge.net/

**lletourn** · 06-13-2011, 04:24 PM

Very cool seb. I'm anxious to try out the MAXKMERLENGTH!

**kail** · 06-13-2011, 05:31 PM

Originally posted by seb567 View Post

E. coli

ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...65_1.fastq.bz2
ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...65_2.fastq.bz2
ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...66_1.fastq.bz2
ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...66_2.fastq.bz2

If you search http://www.ncbi.nlm.nih.gov/sra, you can probably find a more up-to-date dataset. However, sra files take forever to convert...

When compiling Ray 1.6.0, be sure to turn off data structure packing because it will produce bus errors on Itanium processors I believe.

wget http://sourceforge.net/projects/deno...-1.6.0.tar.bz2
tar xjf Ray-1.6.0.tar.bz2
cd Ray-1.6.0
make PREFIX=build-ray-1.6.0 FORCE_PACKING=n
make install
ls build-ray-1.6.0/Ray

Ray does not write any file while running, except result files. For a list, see

Manual

http://denovoassembler.sourceforge.net/manual.html#SECTION00060000000000000000

Why do you say your dataset is too large ?

seb567,

This is the first time I assemble a genome, so, i thought that my set was big because it has MANY sequences, anyway...

How long does the assembly will take?, if i have the following two set:

Paired-Ends (500 +- 50)
47.803.856 pairs

Mate-pair (2200 +- 200)
42.599.342 pairs

PD: I'm using Ray 1.3.0

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News