Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #76
    Originally posted by corthay View Post
    Hi seb567,

    I have two questions.

    1) Can Ray handle sanger reads such as WGS, Fosmid end ?
    Good question. Particularly relevant I would say !


    I am currently modifying Ray heuristics to produce assemblies using paired sequences separated by very large physical distances.

    I simulated 4 libraries from E. coli K-12 MG1655 genome sequences. These reads include 0.5% mismatch errors. ANd I used wgsim available in samtools.

    Fragment standard deviation is 10% of the fragment average length.

    These libraries are

    200 +/- 20
    1000 +/- 100
    4000 +/- 400
    10000 +/- 1000



    Code:
    /software/samtools-0.1.7/wgsim -e 0.005 -d 200 -s 20 -N 4000000 -1 30 -2 30 -r 0 -R 0 -X 0 Streptococcus-pneumoniae-R6.fasta 200-17_1.fastq 200-17_2.fastq 
    
    /software/samtools-0.1.7/wgsim -e 0.005 -d 1000 -s 100 -N 4000000 -1 30 -2 30 -r 0 -R 0 -X 0 Ecoli-k12-mg1655.fasta 1000_1.fastq 1000_2.fastq 
    
    /software/samtools-0.1.7/wgsim -e 0.005 -d 4000 -s 400 -N 4000000 -1 30 -2 30 -r 0 -R 0 -X 0 Ecoli-k12-mg1655.fasta 4000_1.fastq 4000_2.fastq 
    
    /software/samtools-0.1.7/wgsim -e 0.005 -d 10000 -s 1000 -N 4000000 -1 30 -2 30 -r 0 -R 0 -X 0 Ecoli-k12-mg1655.fasta 10000_1.fastq 10000_2.fastq

    Then, using Ray:

    Code:
    mpirun  -np 31 ~/Ray/trunk/code/Ray \
    -p 200_1.fastq 200_2.fastq -p 1000_1.fastq 1000_2.fastq -p 4000_1.fastq 4000_2.fastq -p 10000_1.fastq 10000_2.fastq | tee log

    Without scaffolding, Ray generates 65 contigs each having at least 500 nucleotides. There is a total of 4625792 assembled bases, all are A, T, C or G. The mean size is 71166, the N50 is 121756 and the largest is 371980.

    This is without any scaffolding.

    I then used MUMmer to validate contigs: 0 misassembly, 0 mismatch error and 0 small indel.

    Code:
            %  & numberOfContigs & bases & meanSize  & n50  & max   & coverage   & misassembled & mismatches & indels
     Ray & 65 &  & 71166 & 121756 &  371980 &  0.9965 & 0 & 0 & 0 \\
    99.65% of the genome is covered by Ray contigs.

    Running time/31 processors:

    Code:
    [1,0]<stdout>: Beginning of computation: 2 seconds
    [1,0]<stdout>: Distribution of sequence reads: 52 seconds
    [1,0]<stdout>: Distribution of vertices & edges: 2 minutes, 56 seconds
    [1,0]<stdout>: Calculation of coverage distribution: 0 seconds
    [1,0]<stdout>: Indexing of sequence reads: 15 seconds
    [1,0]<stdout>: Computation of seeds: 27 seconds
    [1,0]<stdout>: Computation of library sizes: 4 minutes, 45 seconds
    [1,0]<stdout>: Extension of seeds: 4 minutes, 15 seconds
    [1,0]<stdout>: Computation of fusions: 37 seconds
    [1,0]<stdout>: Collection of fusions: 0 seconds
    [1,0]<stdout>: Completion of the assembly: 14 minutes, 9 seconds


    The contig lengths:

    388
    641
    1038
    2164
    2714
    3438
    4735
    5845
    7199
    7939
    9557
    10364
    11883
    16323
    16937
    20213
    24049
    25788
    26056
    26433
    27014
    29449
    30205
    31159
    33620
    34875
    36084
    37445
    38010
    38550
    39509
    43086
    43623
    43841
    46455
    47970
    58692
    59083
    59441
    59889
    64822
    66096
    66767
    66783
    67838
    70713
    82799
    85321
    94029
    95691
    97407
    100961
    102906
    103167
    121756
    121858
    138099
    156439
    161412
    163121
    169912
    177402
    213535
    284796
    318866
    371980


    Let me answer your question now. For seed extension, an algorithmic process that generates contigs, available versions of Ray will not accomodate very large fragment lengths, 400 at most I think.

    It is because for larger distances, the right sequence in a pair most likely starts on a repeated k-mer. As such, constraints on the observed distances most be further enforced -- paired reads not satisfying the constraints must be set free in order to allow them to be used in the adequate time.

    Ray 1.2.3 'cRAYfish' is set to be released soon. For this release, I want to give a running time on 512 processors for a human genome. But, for the time being, I am currently waiting for some compute time on a compute cluster.

    1.2.3 fully utilizes large distances provided by jumping libraries (fosmid ends).

    I also plan to incorporate a scaffolder, but that is not my priority now.

    My priority is a memory utilization reducer that is utilized during the process of building the distributed de Bruijn graph.
    Originally posted by corthay View Post

    2) Does Ray do scaffolding ? The results did not contain N even though
    I used paired end as following.
    Not yet, but it is a planned feature.

    Note however that Ray make full use of paired reads to traverse repeated elements in the genome. And apparently it does that very well.


    Meanwhile, I suggest you use SSPACE.

    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


    Bioinformatics paper http://bioinformatics.oxfordjournals...s.btq683.short


    Originally posted by corthay View Post

    The header line of fasta file is modified by SOAPdenovo correction tool.
    Is this problem ?

    Thanks,
    Corthay
    no

    Ray does not read the header in the fasta and fastq files containing sequences.

    Comment


    • #77
      Originally posted by gstitan View Post
      Hey,

      I have uploaded the last version of Ray to can obtain an amos output format. It ran during ten hours but I obtain errors as:
      MPI_Isend(145): MPI_Isend(buf=0x2b196b013ad0, count=1, MPI_UNSIGNED_LONG_LONG, dest=23612, tag=89, MPI_COMM_WORLD, request=0x7fff25927d24) failed
      MPI_Isend(95).: Invalid rank has value 23612 but must be nonnegative and less than 16

      Do you konw this problem (and have you a solution)?
      Yes, there is bug in Ray 1.2.2 concerning the numbering of reads in the output of an AMOS file.

      In Ray 1.2.2, I introduced the following feature:

      Parallel reading of sequence files

      But I forgot to modify the module that writes AMOS files.

      In Ray, each sequence has a unique identifier, encoded in a unsigned integer of 64 bits.

      Before Ray 1.2.2, the rank responsible for the sequence x was computed with x%NumberOfRanks. And the index of that sequence on the rank was computed with x/NumberOfRanks.

      In Ray 1.2.2 and later, this changed.

      Given a sequence numbered x, a number of ranks N and a number of sequences M, the rank of x is x/(M/N). If the computed rank exceeds the number of ranks minus 1, then it is decremented because the last rank will have a little more sequences.

      The index of a sequence on a rank is computed by subtracting from x the total number of sequences stored by ranks having a lower rank number.

      This issue is fixed in 1.2.3, which is set to be released in the next weeks, I guess.

      Originally posted by gstitan View Post

      Best regards

      PS: on the same data with a default output format, it was ok...!
      OK

      Comment


      • #78
        Hi seb567,

        Thank you for the detailed answer.
        I am looking forward to the new version
        Also, it would make me happy if Ray can handle more large kmer
        though it might not be important for Ray.

        Thanks
        Corthay

        Comment


        • #79
          Ray 1.2.3: new heuristics for large distances (jumping libraries)

          Hello,

          Ray 1.2.3 is now available.


          One of the new features is the capability of assembling genomes with
          larger distances.

          See this post for an example:
          Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc



          Other changes:

          1.2.3 'cRAYfish'
          2011-02-05

          • Instruction manual
          • Fixed a bug leading to a segmentation fault when providing an invalid file or no file at all. This bug was introduced in Ray 1.2.2.
          • Fixed a bug leading to a hang when providing invalid arguments.
          • Added reporting of memory usage after loading sequences and after distributing vertices.
          • Removed read simulators because samtools include one called 'wgsim'.
          • Changed the paired-end heuristics to accomodate larger distances.
          • Removed RepeatedVertexWatchdog.
          • Changed the RepeatedThreshold from 255 to 2*peakCoverage (or 255 if it is higher than 255)
          • Now both versions of read pairs are utilized.
          • Paired reads are utilized if the distance falls in mean-3 standard deviation;mean+3 standard deviation. (3 instead of 1 as described in the paper).
          • If a pair of reads does not respect the constraints, the right read is removed from the used read set.
          • Fusions are computed more cautiously.
          • Added reporting of memory usage for sequence reads and for vertices. And for a few other places too.
          • Improved memory usage for the extensions. Now calling destructors instead of clear() because the latter does not free memory.
          • Messages are grouped in the computation of seeds, using VirtualCommunicator.
          • Added a Doxygen configuration file.
          • The paths in the distributed graph are now numbered with an integer of 64 bits. This fixes a segmentation fault occuring with large genomes -- when identifiers overflow on 32 bits. (need to verify this)
          • The partition on sequence reads is outputted.
          • Instead of having the Rank 0 (master) to request the contigs of each every other rank, now each rank appends its fusions to the contigs file and also to the AMOS file.
          • Fixed a bug in the numbering of reads in the AMOS file.
          • Fixed a memory usage problem in the memory consumption reducer
          • Improved the running time for the extension of seeds by removing an inner loop.


          -seb

          Comment


          • #80
            Hello,

            I just began reading this whole thread (and skipped some of the longer posts :P) and I would like to know what is the current status with CS data. In some of the older posts you mention a few problems Ray has with CS data. Does it work correctly now?

            Thank you and greetings
            Leo
            L. Collado Torres, Ph.D. student in Biostatistics.

            Comment


            • #81
              Originally posted by lcollado View Post

              What is the current status with color space data ?
              It is mostly not tested, I would say.

              Comment


              • #82
                Ray 1.3.0 and de novo assembly of Illumina CEO genome in 11.5 h

                Dear all,

                Ray 1.3.0 is now available online.


                The most important change is the correction of a major bug that caused
                parallel infinite loop on the human genome.

                This, along concepts incorporated in Ray 1.2.4, allowed Ray to assemble
                the genome of Illumina's CEO in 11.5 hours using 512 compute cores (see
                below for the link).

                What's new?

                1.3.0

                2011-03-22

                * Vertices with less than 1 of coverage are ignored during the
                computation of seeds and during the computation of extensions.
                * Computation of library outer distances relies on the virtual
                communicator.
                * Expiry positions are used to toss away reads that are out-of-range
                * When only one choice is given during the extension and some reads
                are in-range, then the sole choice is picked up.
                * Fixed a bug for empty reads.
                * A read is not added in the active set if it is marked on a
                repeated vertex and its mate was not encountered yet.
                * Grouped messages in the extension of seeds.
                * Reads marked on repeated vertices are cached during the extension.
                * Paths are cached in the computation of fusions.
                * Fixed an infinite loop in the extension of seeds.
                * When fetching read markers for a vertex, send a list of mates to
                meet if the vertex is repeated in order to reduce the communication.
                * Updated the Instruction Manual
                * Added a version of the logo without text.


                I fixed a bug that caused an infinite loop. Now Ray can assemble large
                genomes. See my blog post for more detail about that.
                THE initial aims of our group regarding de novo assembly of genomes were: 1. To assemble genomes using mixes of sequencing technologies si...



                Version 1.2.4 of Ray incorporated also new concepts that I will present
                at RECOMB-Seq 2011.

                The talk is available online:



                Sébastien Boisvert

                Comment


                • #83
                  Cool, I was looking forward for a MPI-based assembler like this. Will have a try now. Thanks a lot!

                  Comment


                  • #84
                    Hi Sebastien,

                    thank you very much for this tool. I've been using several assemblers to process solexa and roche data on 3 bacterial genomes and your tool is one of the best performers. In fact, with my solexa datasets is the best . Anyway, I wonder if you could include more output formats in a next release. More precisely, I really wanted to have FASTQ output files.
                    Indeed, this would not be needed if there was an (almost) universal sequence/assembly format conversion tool. There are several around but are often very targeted.

                    cheers

                    Comment


                    • #85
                      Hi seb567, glad to have met you in Quebec at the Rendez-vous. I'm playing with Ray on some of my "troublesome" species and I'm having very good results.

                      Like corthay, I was wondering if you had any plans to support bigger kmers than 31 either through a compiler switch or something else.

                      One could argue that past a point (kmer > 100) an overlap based assembler might do a very good job, but where speed is concerned graph based assemblers are hard to beat.

                      Also, with illumina that now has high 3' quality long reads, maybe bigger kmer can be justified.

                      Thanks

                      Comment


                      • #86
                        seb, Ray looks like a great assembly tool but I have a question -- could I expect Ray to do a better job than Newbler if the input reads are 454 only (single and paired end)? Or is the advantage mainly seen when mixed platform reads are input (e.g. 454 and Illumina)?

                        Comment


                        • #87
                          Like corthay, I was wondering if you had any plans to support bigger kmers than 31 either through a compiler switch or something else.

                          One could argue that past a point (kmer > 100) an overlap based assembler might do a very good job, but where speed is concerned graph based assemblers are hard to beat.
                          I would also like to put in a request for longer kmers. Assembling paired illumina reads of 100 bp (forward) and 83 bp (reverse, quality-trimmed) and scaffolding with SSPACE, I get 537 scaffolds with an N50 of 19163 when I use the default kmer setting, but 254 scaffolds with an N50 of 52083 when I use -k 31. Velvet Optimiser using the same data picks a kmer length of 67. I can only guess how awesome an assembly Ray could give me with the ability to use longer kmers...

                          Comment


                          • #88
                            Originally posted by ttnguyen View Post
                            Cool, I was looking forward for a MPI-based assembler like this. Will have a try now. Thanks a lot!

                            I hope you enjoy Ray.


                            Originally posted by Pedro View Post
                            Hi Sebastien,

                            thank you very much for this tool. I've been using several assemblers to process solexa and roche data on 3 bacterial genomes and your tool is one of the best performers. In fact, with my solexa datasets is the best . Anyway, I wonder if you could include more output formats in a next release. More precisely, I really wanted to have FASTQ output files.
                            Indeed, this would not be needed if there was an (almost) universal sequence/assembly format conversion tool. There are several around but are often very targeted.

                            cheers
                            Ray is very good when errors are randomly occuring. 454's homopolymers are not random.


                            I guess you would like the nucleotides to be accompanied by quality values in the output. Is that so ?


                            Originally posted by lletourn View Post
                            Hi seb567, glad to have met you in Quebec at the Rendez-vous. I'm playing with Ray on some of my "troublesome" species and I'm having very good results.

                            Like corthay, I was wondering if you had any plans to support bigger kmers than 31 either through a compiler switch or something else.

                            One could argue that past a point (kmer > 100) an overlap based assembler might do a very good job, but where speed is concerned graph based assemblers are hard to beat.

                            Also, with illumina that now has high 3' quality long reads, maybe bigger kmer can be justified.

                            Thanks
                            I agree with you. I will check how easy (or hard) it is for me to change the upper bound for k-mer length in Ray.


                            Originally posted by ssully View Post
                            seb, Ray looks like a great assembly tool but I have a question -- could I expect Ray to do a better job than Newbler if the input reads are 454 only (single and paired end)? Or is the advantage mainly seen when mixed platform reads are input (e.g. 454 and Illumina)?
                            With only 454 reads, Ray will not produce a very good assembly because the errors are not randomly occuring in reads. Ray will go through some of these pesky homopolymer bubbles in the graph.

                            However, should you mix 454 with another sequencing technology with random errors, you will obtain a better that one with only 454 alone.



                            Originally posted by Adjuvant View Post
                            I would also like to put in a request for longer kmers. Assembling paired illumina reads of 100 bp (forward) and 83 bp (reverse, quality-trimmed) and scaffolding with SSPACE, I get 537 scaffolds with an N50 of 19163 when I use the default kmer setting, but 254 scaffolds with an N50 of 52083 when I use -k 31. Velvet Optimiser using the same data picks a kmer length of 67. I can only guess how awesome an assembly Ray could give me with the ability to use longer kmers...
                            I understand, I will see how easy (or hard) it is to change the maximum k-mer length in Ray.




                            Thank you for your interest in Ray !

                            p.s.: I am presently optimising Ray by studying its communicational behavior.



                            x axis: time; blue: number of iterations; red: number of sent messages; green: number of received messages; data are collected 10 times per second

                            I am also in the process of devising a better way to compute seeds -- with are relatively similar to the concept of unipaths observed in classic assemblers.

                            Comment


                            • #89
                              Hi seq567 do you think it's a wonderful or terrible idea to try Ray on RNASeq data since coverage is not uniform?

                              You don't do bubble bursting, but when choosing paths could non-uniform coverage create mis-assemblies?

                              (I could read all the source, but I thought it would be faster this way :-) )

                              Comment


                              • #90
                                Originally posted by lletourn View Post
                                Hi seq567 do you think it's a wonderful or terrible idea to try Ray on RNASeq data since coverage is not uniform?

                                Worth a try although I believe the coverage distribution may have more than one peak because, as you said, non-uniformity is present.


                                Originally posted by lletourn View Post

                                You don't do bubble bursting, but when choosing paths could non-uniform coverage create mis-assemblies?
                                Choosing the next vertex to visit is not performed with the coverage. Instead, pairs of sequences are utilised for that purpose.

                                Bubbles are not bursted, but when Ray encounters one (in general, an heterozygous site), it will attempt to pass through it by selecting one child.

                                In my opinion, bubble bursting and bubble merging are bad approaches because you either lose information (bursting) or create misassemblies that are otherwise not so hard to avoid (merging).

                                Originally posted by lletourn View Post
                                (I could read all the source, but I thought it would be faster this way :-) )
                                However, the bits related to message passing may interfere with your reading.

                                Things are easier to grasp in natural languages than in computer languages !

                                ***
                                Sébastien Boisvert



                                Last edited by seb567; 04-26-2011, 11:30 AM. Reason: corrected a typo

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Today, 12:17 PM
                                0 responses
                                7 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 10:49 AM
                                0 responses
                                18 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-25-2024, 11:49 AM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-24-2024, 08:47 AM
                                0 responses
                                21 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X