Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #91
    Ray 1.4.0: built-in scaffolder & more

    Dear Ray users,


    Ray 1.4.0 is now available.


    The most significant change is the built-in scaffolder.



    The second most significant change is the new algorithm that finds
    assembly seeds.

    Also, I added a lot of output files in Ray.

    They are listed here:




    Finally, our new website http://denovoassembler.sf.net is hopefully
    easier to browse.

    On the website, there is a manual for Ray.




    Sébastien

    1.4.0
    2011-05-30

    * A built-in scaffolder is now available -- Thanks to Dr.
    Jean-Francois Pombert (University of British Columbia) for the
    suggestion.
    * The maximum number of libraries is now 499 instead of 250.
    * The number of seeds is now divided by 2 to speed up their
    extension.
    * Fixed a bug in the depth first search that leaded to vertices
    having no coverage values.
    * Removed the configure script, now Ray must be compiled with the
    provided Makefile.
    * Added a switch to enable the profiler: -run-profiler
    * Added a switch to debug seed generation: -debug-seeds
    * Added a switch to debug bubble detection: -debug-bubbles
    * Added a switch to show memory usage: -show-memory-usage
    * Added a switch to show the ending context of extensions:
    -show-ending-context
    * Devised a new algorithm that finds the peak coverage, minimum
    coverage and repeat coverage in distributions.
    * Ray now writes the peak, minimum and repeat coverages to a file.
    * Ray now writes the statistics for libraries to a file.
    * Fixed a bug that disallowed mixing manual and automatic
    detection of outer distances.
    * Ray now writes the statistics for seed lengths to a file.
    * Devised a new algorithm that computes longer seeds to bootstrap
    assemblies.
    * Slave modes, master modes and MPI tags are generated with macros
    for method prototypes, enumerations and assignments in arrays.
    * Added some changes for Microsoft Windows compatibility. Thanks
    to Hannes Pouseele (Applied Maths, Inc.) for some suggestions.
    * Added instructions regarding mpic++ and CXX environment
    variable. Thanks to Dr. Harry Mangalam from UC Irvice for
    pointing that out.
    * Changed the merger behavior for ends of contigs.
    * Added a script to validate scaffolds.

    Comment


    • #92
      How to load SFF files?

      Dear Sébastien,

      I recently started testing Ray on read data of a ~450Mb genome. I can load FASTQ files without a problem, but I can't figure out how to load SFF files. The Instruction Manual only has examples for loading FASTQ files.

      I just started a job with the following command,
      Code:
      mpirun -np 16 time Ray \
      -s /home/sp/data/454/shotgun/F0A0H9G01.sff \
      -i /home/sp/data/454/pairedend_20k/FPFSKVK01.sff \
      -i /home/sp/data/454/pairedend_3k/FO2K76101.sff \
      -k 17 -o melon_454_small_test_20110604
      But the output contains a lot of the following,
      Code:
      $ tail mpirun.o22658
      Not KEY, was GACT expected TCAG
      Not KEY, was GACT expected TCAG
      Not KEY, was GACT expected TCAG
      Not KEY, was GACT expected TCAG
      Not KEY, was GACT expected TCAG
      Not KEY, was GACT expected TCAG
      Not KEY, was GACT expected TCAG
      Not KEY, was GACT expected TCAG
      Not KEY, was GACT expected TCAG
      Not KEY, was GACT expected TCAG
      So I'm probably doing something wrong. Could you please explain how I should do this?

      Comment


      • #93
        Originally posted by figure002 View Post
        Dear Sébastien,

        I recently started testing Ray on read data of a ~450Mb genome. I can load FASTQ files without a problem, but I can't figure out how to load SFF files. The Instruction Manual only has examples for loading FASTQ files.

        I just started a job with the following command,
        Code:
        mpirun -np 16 time Ray \
        -s /home/sp/data/454/shotgun/F0A0H9G01.sff \
        -i /home/sp/data/454/pairedend_20k/FPFSKVK01.sff \
        -i /home/sp/data/454/pairedend_3k/FO2K76101.sff \
        -k 17 -o melon_454_small_test_20110604
        But the output contains a lot of the following,
        Code:
        $ tail mpirun.o22658
        Not KEY, was GACT expected TCAG
        Not KEY, was GACT expected TCAG
        Not KEY, was GACT expected TCAG
        Not KEY, was GACT expected TCAG
        Not KEY, was GACT expected TCAG
        Not KEY, was GACT expected TCAG
        Not KEY, was GACT expected TCAG
        Not KEY, was GACT expected TCAG
        Not KEY, was GACT expected TCAG
        Not KEY, was GACT expected TCAG
        So I'm probably doing something wrong. Could you please explain how I should do this?
        In the SFF specification, it is said that the header of the file contains the prefix of sample sequences.

        The message you encountered means that your SFF file contains sequence reads with a sequence key that does not match with the one listed in the header.

        Maybe they changed the SFF standard or you are using multiplex identifiers. In either case, I suggest you convert your SFF files to FASTA (or FASTQ) and supply the resulting files to Ray instead of the SFF files.

        Sébastien

        Comment


        • #94
          Originally posted by seb567 View Post
          Maybe they changed the SFF standard or you are using multiplex identifiers. In either case, I suggest you convert your SFF files to FASTA (or FASTQ) and supply the resulting files to Ray instead of the SFF files.
          Sébastien, the 454 reads produced today will in many cases have the new key sequence GACT instead of TCAG. New library preparation kits (using the so-called 'Rapid Library' protocol) have this new key in the adaptors. It would be a great advantage if Ray could handle both key sequences!

          Comment


          • #95
            Originally posted by flxlex View Post
            Sébastien, the 454 reads produced today will in many cases have the new key sequence GACT instead of TCAG. New library preparation kits (using the so-called 'Rapid Library' protocol) have this new key in the adaptors. It would be a great advantage if Ray could handle both key sequences!
            I believe the problem is due to a bug in an earlier version of the Roche/454 software. With the switch to Rapid Library chemistry Roche switched the keytag to GACT and released new software (?2.3?). The gsRunProcessor produced properly formatted SFF files which reported GACT as the keytag in the common header section of the SFF. However if you used the program sfffile to manipulate those SFFs (e.g. decode MID tags, split files or merge files) the new common header would erroneously report TCAG as the keytag. This bug appears to have been corrected in the latest release (2.5) of sfffile.

            What Sébastien seems to be saying is that Ray reads the common header of the SFF to determine what the keytag should be and in this case there is a mismatch between what the header reports the keytag to be and the keytag observed in the reads. It seems that figure002's SFF file(s) have fallen victim to this bug in sfffile.

            Comment


            • #96
              Originally posted by flxlex View Post
              Sébastien, the 454 reads produced today will in many cases have the new key sequence GACT instead of TCAG. New library preparation kits (using the so-called 'Rapid Library' protocol) have this new key in the adaptors. It would be a great advantage if Ray could handle both key sequences!
              Ray simply fetches the key sequence from the SFF header. Ray has no preference for GACT or TCAG.


              Originally posted by kmcarr View Post
              I believe the problem is due to a bug in an earlier version of the Roche/454 software. With the switch to Rapid Library chemistry Roche switched the keytag to GACT and released new software (?2.3?). The gsRunProcessor produced properly formatted SFF files which reported GACT as the keytag in the common header section of the SFF. However if you used the program sfffile to manipulate those SFFs (e.g. decode MID tags, split files or merge files) the new common header would erroneously report TCAG as the keytag. This bug appears to have been corrected in the latest release (2.5) of sfffile.

              What Sébastien seems to be saying is that Ray reads the common header of the SFF to determine what the keytag should be and in this case there is a mismatch between what the header reports the keytag to be and the keytag observed in the reads. It seems that figure002's SFF file(s) have fallen victim to this bug in sfffile.
              Exactly my point.

              Meanwhile, what do you think would be the best way to deal with these ill-encoded SFF files generated by sfffile <2.5 with the rapid library chemistry ?

              I just don't see an easy way.

              Comment


              • #97
                Originally posted by seb567 View Post
                Exactly my point.

                Meanwhile, what do you think would be the best way to deal with these ill-encoded SFF files generated by sfffile <2.5 with the rapid library chemistry ?

                I just don't see an easy way.
                I am not a Python guy but it looks to me like it would be fairly straightforward using Biopython's Bio/SeqIO/SffIO module. Read the file in, flip the value of 'key_sequence', write out a new file.

                (Still waiting for Bioperl Bio::SeqIO::SFF )

                Comment


                • #98
                  Originally posted by kmcarr View Post

                  (Still waiting for Bioperl Bio::SeqIO::SFF )
                  +1

                  I know if you want something done you should probably take the initiative and contribute, but I have seen several posts where people have said they were working on this. So, like many people, I decided to wait, assuming it was in progress. (Sorry for taking things off track in the thread though )

                  Comment


                  • #99
                    Originally posted by kmcarr View Post
                    I am not a Python guy but it looks to me like it would be fairly straightforward using Biopython's Bio/SeqIO/SffIO module. Read the file in, flip the value of 'key_sequence', write out a new file.

                    (Still waiting for Bioperl Bio::SeqIO::SFF )
                    Regardless, I guess it is correct to consider SFF files as containers, just like FASTA or FASTQ files.

                    Therefore, Ray will no longer try to match the key sequence. Instead, it will *simply* load all sequences in the SFF file and trim them using the clipping values therein.

                    See http://github.com/sebhtml/ray/commit/15826e290f1


                    Originally posted by SES View Post
                    +1

                    I know if you want something done you should probably take the initiative and contribute, but I have seen several posts where people have said they were working on this. So, like many people, I decided to wait, assuming it was in progress. (Sorry for taking things off track in the thread though )
                    You mean taking the initiative to write code changes to BioPython so that Bio/SeqIO/SffIO can change the key sequence, right ?


                    Is there a software tool from 454 that allows one to change header information in a SFF file ?


                    There is also this thing called flower (the code is pretty awesome by the way -- it is in Haskell)

                    Blog post: http://blog.malde.org/index.php/flower/
                    Source code: http://malde.org/~ketil/biohaskell/flower/


                    Also, the Ray git tree is now on github.

                    Ray -- Parallel genome assemblies for parallel DNA sequencing - GitHub - sebhtml/ray: Ray -- Parallel genome assemblies for parallel DNA sequencing


                    Furthermore, Ray can now handle arbitrary large k-mers.

                    I am presently running some integration and unit tests on Ray v1.6.0-rc2.

                    You can download the latest development version of Ray with the following command *provided* that you have git.

                    Code:
                    git clone git://github.com/sebhtml/ray.git
                    To use large k-mers:

                    Code:
                    git clone git://github.com/sebhtml/ray.git
                    cd ray
                    make MAXKMERLENGTH=64 PREFIX=ray-git-master-kmax=64
                    make install
                    mpirun -np 128 ray-git-master-kmax=64/Ray -k 55 \
                    -p ABCD_1.fastq ABCD_2.fastq -o DeadlyBug,k=55
                    Enjoy !

                    Comment


                    • Originally posted by seb567 View Post
                      You mean taking the initiative to write code changes to BioPython so that Bio/SeqIO/SffIO can change the key sequence, right ?
                      No. I was just sympathizing with kmcarr and referring specifically to the need for sff support in bioperl.

                      Comment


                      • Test Ray

                        Dear all,


                        I’m trying to test the installation of Ray (and openMPI) in my cluster. However, the set that I possess is too big (~90.403.198 paired reads).

                        So, can someone tell me were can I get a smaller set to test Ray? The idea will be to have a set that can run in 1 or 2 day… or less if possible

                        Cluster description:

                        Itanium II 64 processors 1.6 GHz machine with 128 GBRAM and Infiniband Voltaire 10Gbps interconnect switch.

                        Also, does Ray write to the disk while it is running? Where?


                        Thanks in advance for your help!



                        PD: There are 16 nodes each with four cores.

                        Comment


                        • Originally posted by kail View Post
                          Dear all,


                          I’m trying to test the installation of Ray (and openMPI) in my cluster. However, the set that I possess is too big (~90.403.198 paired reads).

                          So, can someone tell me were can I get a smaller set to test Ray? The idea will be to have a set that can run in 1 or 2 day… or less if possible

                          Cluster description:

                          Itanium II 64 processors 1.6 GHz machine with 128 GBRAM and Infiniband Voltaire 10Gbps interconnect switch.

                          Also, does Ray write to the disk while it is running? Where?


                          Thanks in advance for your help!



                          PD: There are 16 nodes each with four cores.
                          E. coli

                          ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...65_1.fastq.bz2
                          ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...65_2.fastq.bz2
                          ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...66_1.fastq.bz2
                          ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...66_2.fastq.bz2

                          If you search http://www.ncbi.nlm.nih.gov/sra, you can probably find a more up-to-date dataset. However, sra files take forever to convert...


                          When compiling Ray 1.6.0, be sure to turn off data structure packing because it will produce bus errors on Itanium processors I believe.

                          wget http://sourceforge.net/projects/deno...-1.6.0.tar.bz2
                          tar xjf Ray-1.6.0.tar.bz2
                          cd Ray-1.6.0
                          make PREFIX=build-ray-1.6.0 FORCE_PACKING=n
                          make install
                          ls build-ray-1.6.0/Ray


                          Ray does not write any file while running, except result files. For a list, see





                          Why do you say your dataset is too large ?

                          Comment


                          • Ray now supports arbitrary large k-mers (MAXKMERLENGTH)

                            = 1.6.0 =
                            2011-06-13
                            • Moved the code tree to subversion to git and from an in-house tree to a github tree -- see http://github.com/sebhtml/ray
                            • Fixed a compilation problem in Scaffolder.cpp. Thanks to Volker Winkelmann (University of Cologne).
                            • Changed CC to MPICXX and added lines to compile Ray with Intel's MPI implementation. Thanks to Volker Winkelmann (University of Cologne).
                            • Implemented a Kmer class for arbitrary long k-mers (MAXKMERLENGTH)
                            • Added pack and unpack methods to Kmer to abstract the communication of k-mers -- thanks to Élénie Godzaridis for the idea.
                            • Output contigs >= 100, not paths >= 100
                            • Detailed the warning for unmatched 454 prefix.
                            • Fixed a bug in the TLE entries in the AMOS file.
                            • The Makefile can now install Ray somewhere. (make PREFIX=prefix; make install)
                            • Structures are now packed by default. Set FORCE_PACKING=n to disable it.
                            • Created subdirectories for code.
                            • Ray now uses all sequences in an SFF file -- not just those matching the sequence key.
                            • Ray now estimates the genome length in RayOutput.CoverageDistributionAnalysis.txt.
                            • Fixed an integer overflow in CoverageDistribution when the number of k-mers occuring once is very large (for Assemblathon-2 datasets).
                            • Added exit code EXIT_NO_MORE_MEMORY=42 as suggested by Hannes Pouseele (applied-maths.com).
                            • Fixed the an access violation on Windows. Bug reported by Hannes Pouseele (applied-maths.com).
                            • Fixed compilation errors for Microsoft Visual C++ (xiosbase and stdexcept) Bug reported by Hannes Pouseele (applied-maths.com>)
                            • Ray compiles with Microsoft Visual Studio 10.0 without any change.


                            Website: http://denovoassembler.sourceforge.net/

                            Comment


                            • Very cool seb. I'm anxious to try out the MAXKMERLENGTH!

                              Comment


                              • Originally posted by seb567 View Post
                                E. coli

                                ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...65_1.fastq.bz2
                                ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...65_2.fastq.bz2
                                ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...66_1.fastq.bz2
                                ftp://ftp.ddbj.nig.ac.jp/ddbj_databa...66_2.fastq.bz2

                                If you search http://www.ncbi.nlm.nih.gov/sra, you can probably find a more up-to-date dataset. However, sra files take forever to convert...


                                When compiling Ray 1.6.0, be sure to turn off data structure packing because it will produce bus errors on Itanium processors I believe.

                                wget http://sourceforge.net/projects/deno...-1.6.0.tar.bz2
                                tar xjf Ray-1.6.0.tar.bz2
                                cd Ray-1.6.0
                                make PREFIX=build-ray-1.6.0 FORCE_PACKING=n
                                make install
                                ls build-ray-1.6.0/Ray


                                Ray does not write any file while running, except result files. For a list, see





                                Why do you say your dataset is too large ?

                                seb567,

                                This is the first time I assemble a genome, so, i thought that my set was big because it has MANY sequences, anyway...

                                How long does the assembly will take?, if i have the following two set:

                                Paired-Ends (500 +- 50)
                                47.803.856 pairs

                                Mate-pair (2200 +- 200)
                                42.599.342 pairs

                                PD: I'm using Ray 1.3.0
                                Last edited by kail; 06-13-2011, 06:41 PM.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-25-2024, 11:49 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-24-2024, 08:47 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                62 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                61 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X