Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Hi,

    Question 1

    Does Ray support Illumina 1.6+ fastq format (with trailing B's)?

    Answer

    No, but it should work if the trailing B's are at the end, but not at the beginning.



    Question 2

    Does Ray have the capability for trimming low-quality bases or should I pre-process my reads beforehand?

    Answer

    Ray does not trim sequences, but random errors are not a problem.



    Question 3

    Should I convert my libraries to Phred/Sanger scores?

    Answer

    No: Ray does not use quality scores.



    Question 4

    Can I run Bambus with Rays's output?

    Answer

    Ray outputs a fasta file -- I have never utilized Bambus, so the question is what Bambus needs.


    -Sebhtml

    Comment


    • #32
      OpenMPI 1.2.8?

      Hi Sebastien,
      I followed the instruction in WiKi page and launched Ray about 20 hours ago. The processes are still running. Is this normal? When should it finish? I believe we have enough computing power. Only concern is that we have OpenMPI 1.2.8 in stead of 1.3.4 or 1.4.1. Does Ray work with 1.2.8? Please give some suggestion. Thank you very much for your great work, Best, ldong

      == Do-it-yourself examples ==

      === E. coli K-12 MG1655 with Illumina paired-end reads & amos output ===

      * Reads
      ftp://ftp.ncbi.nlm.nih.gov/sra/stati...665_1.fastq.gz
      ftp://ftp.ncbi.nlm.nih.gov/sra/stati...665_2.fastq.gz
      ftp://ftp.ncbi.nlm.nih.gov/sra/stati...666_1.fastq.gz
      ftp://ftp.ncbi.nlm.nih.gov/sra/stati...666_2.fastq.gz

      *vi a file with the following content, Commands.Ray
      LoadPairedEndReads SRR001665_1.fastq SRR001665_2.fastq 215 20
      LoadPairedEndReads SRR001666_1.fastq SRR001666_2.fastq 215 20
      OutputAmosFile

      * Command line
      mpirun -np 32 Ray Commands.Ray

      Comment


      • #33
        Re:

        Hi,

        @ldong

        What is the connectivity? (Infiniband's OK)

        Which step does Ray reach before stalling?

        Do dots kept being printed ?

        If the answer's no, then it might be the spin-lock bug in Open-MPI.

        https://svn.open-mpi.org/trac/ompi/ticket/2043 (shoud be fixed in Milestone 1.4.3.)


        Anyway, 1.2.8's very old. Current release's 1.4.2!

        I also need to optimize communication for "Computing seeds" and later steps in Ray.


        Recommendation: upgrade to 1.4.1 or 1.4.2


        Cordially,

        Sébastien

        Comment


        • #34
          @ldong What is your 'computing power'?

          Comment


          • #35
            Hi, Seb,

            Thank you very much for your suggestions. Not sure if we can upgrade openmpi. I will check with system administrator.

            We have a few nodes with 16 CPU and 64G memory allowing me to test Ray. Here are what I found:

            If I run Ray on two nodes with 32 process. There is always one process on the first node of the host list slowly reaches 25G memory, then gets killed by the system. Other processes never reach 1G.

            It seems like 25G is a system limitation. I will ask our administrator. What do you think? Best, ldong

            Comment


            • #36
              hang? during "Extending seeds"

              I compiled Ray with openmpi 1.4.2, gcc version 4.1.2 20080704 (Red Hat 4.1.2-44), x86_64 architecture and run it with "mpirun -mca btl ^sm". The data is 3 simulated Illumina libraries comprising 52x coverage of a 225MB chromosome: 40x 500bp PE 95nt reads (inward facing), 8x 5kb mate paired 36nt reads (outward facing), 4x 10kb mate paired 36nt reads(outward facing).

              Using 128 cores (16 8-core nodes), it runs fine up until the "Extending seeds" step. After a while the printing of the dots seem to slow down to glacial speeds. I've let it sit for several days with no progress. Is this an open mpi problem, you think? Any ideas on getting around this problem?

              Comment


              • #37
                If the paired-end reads are put in the same file, Ray can handle it?

                Comment


                • #38
                  Replies

                  @talioto What is the interconnection.

                  @baihezimu No.

                  Comment


                  • #39
                    Ray paper is finally available

                    Sébastien Boisvert, François Laviolette, Jacques Corbeil.
                    Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies
                    Journal of Computational Biology
                    Not available-, ahead of print.
                    doi:10.1089/cmb.2009.0238
                    Last edited by seb567; 10-20-2010, 07:56 AM. Reason: removed linebreak.

                    Comment


                    • #40
                      Congrats on e-publication

                      Congratulations on Ray's epub!

                      I learned of your work about 2 weeks ago, familiarized myself with the documentation you've provided, and successfully completed the E. coli assembly with sample data.

                      As the publication is now complete, could you please provide some more details with respect to your processing of Human Chromosome 1 (under the limitations section of the project website).

                      Specifically, could you provide the output text for that run so that I could better ascertain:
                      a. the run time for that assembly on your hardware
                      b. the version of the Open-MPI Library used in that assembly

                      I'd like to use Ray for assembly of sequencing reads for eukaryotes... and would like to know what, if any, potential problems to anticipate.

                      As Open-MPI 1.5 has now been released, I'd like to know if the shared memory problem is still a concern when performing analyses of larger datasets. I believe that this has been fixed in versions > 1.4.1, but would like to know for certain if it is a problem with Ray before spending hours of analysis time on shared hardware.

                      Thanks for both your work and time in addressing these questions-- your efforts are very much appreciated.

                      Comment


                      • #41
                        Response to 'Congrats on e-publication'

                        Congratulations on Ray's epub!
                        Thanks !

                        I learned of your work about 2 weeks ago, familiarized myself with the documentation you've provided, and successfully completed the E. coli assembly with sample data.
                        Now that is reproducible research !

                        As the publication is now complete, could you please provide some more details with respect to your processing of Human Chromosome 1 (under the limitations section of the project website).
                        Well, I used Open-1.3.4 with shared memory disabled. Ray 0.0.7 was
                        utilized.

                        I simulated reads of length 50 at a depth of 50 for the human chromosome
                        1 (the largest). To do so, I used the simtools provided with Ray. To get
                        them, type 'make simtools'.

                        The wiki is misleading on this, because I actually used a MPI-enabled
                        Infiniband-connected computer. I'll correct that shortly. Precisely, 384
                        cores were used.

                        The Sun Grid Engine script follows.

                        PHP Code:
                        [12@colosse2 0.0.7-run]$ cat Human-chr1-ompi-1.3.4-gcc.sh
                        #!/bin/bash
                        #$ -N Ray
                        #$ -P nne-790-aa
                        #$ -l h_rt=24:00:00
                        #$ -pe mpi 384
                        module load compilers/gcc/4.4.2 mpi/openmpi/1.3.4_gcc
                        /software/MPI/openmpi-1.3.4_gcc/bin/mpirun /home/12/Ray/tags/0.0.7/Ray /home/12/nne-790-aa/colosse.clumeq.ca/qsub/Ray-input.txt 
                        If you ask why Open-MPI 1.3.4, it is because all other versions have
                        shared memory enabled on the said computer, and that Open-MPI 1.4.3 is
                        not available yet to users of the said computer.

                        The content of the command file:

                        PHP Code:
                        [12@colosse2 0.0.7-run]$ cat Ray-input.txt 
                        LoadSingleEndReads 
                        /home/12/nne-790-aa/50xhs_ref_GRCh37_chr1.fa_fragments.fasta 
                        Specifically, could you provide the output text for that run so that I could better ascertain:
                        a. the run time for that assembly on your hardware
                        b. the version of the Open-MPI Library used in that assembly

                        PHP Code:
                        [12@colosse2 0.0.7-run]$ cat Ray.o876984
                        **************************************************
                        This program comes with ABSOLUTELY NO WARRANTY.
                        This is free software, and you are welcome to redistribute it
                        under certain conditions
                        see "gpl-3.0.txt" for details.
                        **************************************************

                        Ray Copyright (C2010  Sébastien BoisvertJacques CorbeilFrançois
                        Laviolette
                        http
                        ://denovoassembler.sf.net/

                        AssemblyEngineRay 0.0.7
                        NumberOfRanks
                        384
                        MPILibrary
                        Open-MPI 1.3.4
                        OperatingSystem
                        Linux

                        LoadSingleEndReads
                        Sequences
                        : /home/12/nne-790-aa/50xhs_ref_GRCh37_chr1.fa_fragments.fasta

                        Loading 
                        /home/12/nne-790-aa/50xhs_ref_GRCh37_chr1.fa_fragments.fasta
                        Distributing sequences
                        Counting vertices
                        Loading 
                        /home/12/nne-790-aa/50xhs_ref_GRCh37_chr1.fa_fragments.fasta
                        Indexing sequences
                        Connecting vertices
                        MinimumCoverage
                        5
                        PeakCoverage
                        30
                        Computing seeds
                        Extending seeds
                        Computing fusions
                        Finishing fusions
                        Collecting fusions
                                      
                        Writing Ray
                        -Contigs.fasta
                        140101 contigs
                        /175230944 nucleotides
                        Elapsed time
                        0 d 13 h 48 min 28 s 

                        I'd like to use Ray for assembly of sequencing reads for eukaryotes... and would like to know what, if any, potential problems to anticipate.
                        I have not myself extensively used Ray on eukaryotic sequence reads, so I am not really aware of potential pitfalls.

                        As Open-MPI 1.5 has now been released, I'd like to know if the shared memory problem is still a concern when performing analyses of larger datasets. I believe that this has been fixed in versions > 1.4.1, but would like to know for certain if it is a problem with Ray before spending hours of analysis time on shared hardware.
                        You better use Open-MPI 1.4.3 as it is a super stable release whereas Open-MPI 1.5 is a feature release. I only have access to Open-MPI 1.3.4 with disabled shared memory and Open-MPI 1.4.1 with defaults.

                        I should gain access to Open-MPI 1.4.3 with defaults in the next days/weeks.

                        Thanks for both your work and time in addressing these questions-- your efforts are very much appreciated.
                        Thank you also for bringing these questions.

                        Sébastien

                        Comment


                        • #42
                          Ray 0.1.0 is out

                          Dear de novo assembly enthusiasts:

                          Following the publication and some work over the last months, Ray 0.1.0
                          is now available incorporating (some) features requested as well as
                          improvements on speed (Extending seeds).

                          There is a full list of changes, based on the NEWS file.

                          v. 0.1.0
                          2010-11-03

                          * Moved some code from Machine.cpp to new files. (Ticket #116)
                          * Improved the speed of the extension of seeds by reducing the number of messages sent. (Tickets #164 & #490)
                          Thanks to all the people who reported this on the list !
                          * Ray is now verbose ! (Ticket #167)
                          Feature requested by Dr. Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
                          University, AUSTRALIA)
                          * The k-mer size can now be changed. Minimum value is 15 & maximum value is 32. (Tickets #169 & #483)
                          Feature requested by Dr. Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
                          University, AUSTRALIA)
                          * Ray should work now on architectures requiring alignments of address on 8 bytes such as Itanium. (Ticket #446)
                          Bug reported by Jordi Camps Puchades (Centre Nacional d'Anàlisi Genòmica/CNAG)
                          * Added reference to the paper in stdout. (Ticket #479)
                          * The coverage distribution is now always written. (Ticket #480)
                          * The code for extracting edges is now in a separate file (Ticket #486)
                          * Messages for paired reads are now grouped with messages for querying sequences in the extension of seeds. (Tickets #487 & #495)
                          * Messages for sequence reads are now done only once, when the read is initially discovered. (Ticket #488)
                          * Messages with tag TAG_HAS_PAIRED_READ are grouped with messages to get sequence reads. (Ticket #491)
                          * Added TimePrinter to print the elapsed time at each step. (Ticket #494)
                          * All generated files (AMOS, Contigs, and coverage distribution) are named following the -o parameter. (Ticket #426)
                          Feature requested by Jordi Camps Puchades (Centre Nacional d'Anàlisi Genòmica/CNAG)
                          * Print an exception if requested memory exceeds CHUNK_SIZE. That should never happen. (r3690)
                          * Print an exception if the system runs out of memory.
                          * Ray informs you on the number of k-mers for a k-mer size. (r3691)
                          * Unique IDs of sequence reads are now unsigned 64-bits integers. (r3710)
                          * The code is now in code/, scripts are now in scripts/. Examples are in scripts/examples/. (r3712)
                          * The compilation is more verbose. (r3714)


                          Download it:


                          I will update the wiki shortly with improved running times for the E.
                          coli dataset as well as in-depth simulation of paired reads on
                          chromosome 1 (with errors).

                          Thank you !

                          Comment


                          • #43
                            First of all: thanks for providing Ray. I am reading the paper and it sounds very promising.

                            I am testing version 0.1.0 (openMPI 1.4.2, compiled with intel11.1) on 8 X 8 core/48GB nodes. The data are 12 lanes of illumina PE reads and two runs of 454 of a bird species we are sequencing. The first 14 hrs Ray output tons of messages in ray.out, but for the past 36hrs has been quiet, but still keeping a 100% load on the nodes, utilizing about 5GB of memory for each job.

                            Is this quiet to be expected, or a manifestation of this "spin-lock" bug mentioned above? Is there any way of checking that Ray is still running OK?
                            Cheers
                            Pallo

                            EDIT: Ok looking closer at the spin-lock bug reports, it only seems to affect GCC, so Ill try to be patient
                            Last edited by pallo; 11-08-2010, 02:54 AM.

                            Comment


                            • #44
                              Where does it hang?

                              I think the bug 2043 was addressed in Open-MPI 1.4.3.

                              I don't know if ICC can produce the same problem though.




                              Yes, there is a way if you can log on the worker nodes.


                              First, get the pid of the processes associated to Ray

                              ps aux|grep Ray

                              then, attach a gdb instance to one of them.

                              gdb attach <pid of a Ray instance>

                              Finally, do a backtracking in gdb

                              bt

                              You will see which code is currently executed.



                              What is your interconnection?

                              Infiniband of gigaethernet ?

                              Comment


                              • #45
                                Hi,

                                The job had to be killed for other reasons, but here are the last lines of ray.out:

                                $ tail testrun/ray.out
                                Rank 51 stores an extension, 1354 vertices.
                                Rank 51 starts on a seed, length=106
                                Rank 43 starts on a seed, length=444
                                Rank 4 stores an extension, 1166 vertices.
                                Rank 4 starts on a seed, length=142
                                Rank 59 stores an extension, 152 vertices.
                                Rank 59 starts on a seed, length=211
                                Rank 6 stores an extension, 1095 vertices.
                                Rank 6 starts on a seed, length=89
                                Rank 0 starts on a seed, length=1175
                                The interconnections are Infiniband.

                                Im rerunning the job on a bigger set of nodes, Ill post the progress.

                                cheers
                                Pallo

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Best Practices for Single-Cell Sequencing Analysis
                                  by seqadmin



                                  While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                                  06-06-2024, 07:15 AM
                                • seqadmin
                                  Latest Developments in Precision Medicine
                                  by seqadmin



                                  Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                                  Somatic Genomics
                                  “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                                  05-24-2024, 01:16 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 06-14-2024, 07:24 AM
                                0 responses
                                12 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 06-13-2024, 08:58 AM
                                0 responses
                                13 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 06-12-2024, 02:20 PM
                                0 responses
                                17 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 06-07-2024, 06:58 AM
                                0 responses
                                184 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X