Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by habm View Post
    Our longer-insert Illumina mate-pair libraries have significant duplication contamination - ie two size peaks, one of inward facing false pe reads (innies) at under 300bp, and one of the outward facing reads (outies) nearer the desired insert size eg 3000 or 5000 bp.
    How should the mp library size mean and SD be specified to allow Ray to deal with this, please?
    Thanks.

    PS, a run without any insert sizes specified (ie Automatic DetectionType) suggests that Ray found the innies OK, but not the useful outies:
    LibraryNumber: 1 (nominally 3kbp, really more like 2200bp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 457
    StandardDeviation: 441
    DetectionFailure: Yes

    LibraryNumber: 2 (nominally 6bp, really more like 4-5 kbp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 302
    StandardDeviation: 218
    DetectionFailure: Yes

    LibraryNumber: 3 (nominally 8 kbp, really 6.3 kbp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 260
    StandardDeviation: 213
    DetectionFailure: Yes

    Another way of expressing this is to ask whether Ray can disambiguate pe and mp reads, and if so, what input information is needed?
    One way you can use your data follows:

    For each of your paired library, analyze the file PREFIX.LibraryX.txt.

    For example, for LibraryNumber 3:

    less PREFIX.Library3.txt
    And locate your peak near 6300 to find the real value. Let us suppose that it is 6189.

    The column in these Library files are: observed outer distance and frequency.
    Basically, you want to locate the observed outer distance with the maximum frequency near 6300.

    Do that for all your paired libraries.

    Then, edit your Ray command to include manually the average and standard deviation for each of your library:

    mpirun -np 999999 /software/ray-version/Ray \
    -p lib3_1.fastq lib3_2.fastq 6189 618 (replace 6189 by what you found in the Library file and let the standard deviation be around 10 %).
    -p ...

    Doing so will allow you to use the long-insert, but will ignore the paired information for the short-insert (within each library) because they won't be within the specified ranges.

    Still, all the reads will contribute to building the genome graph.

    Hope it helps.


    Sébastien

    Comment


    • Originally posted by seb567 View Post
      Well, for low memory usage, you definitely want to use Ray v1.6.1 (on its way, presently Ray v1.6.1-rc3 which is available at https://github.com/sebhtml/ray/zipball/v1.6.1-rc3 ).

      See http://sourceforge.net/mailarchive/m...sg_id=27781099 for more details.

      In your message, you don't report how much memory your compute cores have access to.
      I believe each core has 4GB RAM. But that might be variable, some 2GB, some 1GB. But from what I understand other programs that are single threaded, or have bottle necks that are single threaded, can't use distributed memory and have to be contained to one machine. Which would limit me to 64 GB of RAM on our cluster. Which is why I brought up the RAM limitations.

      Originally posted by seb567 View Post
      Ray is a peer-to-peer program, that is you can launch it on 2048 compute cores if you want.
      Do you know how it scales? I'm billed linearly. So 2 hours on 128 cores is the same as 1 hour on 256 cores. But does Ray actually run twice as fast on twice the cores? Or is it simply to variable to know the optimums depending on the data?

      This isn't a huge concern, as the cost per CPU hour is low, so long as the speed is close to linear and I'm not "wasting" large numbers of cores nor getting it stuck for long periods with too low a number of cores.

      And I put wasting in quotes because I know none will really be wasted. I'm just concerned that at some point the marginal addition of another core isn't offset by the marginal decrease in time to completion. Or the reverse, were the program gets "stuck" due to limited memory per core or what not.

      Originally posted by seb567 View Post
      But, you should first do a run with k=31 just to quality-control the thing first.
      Will do. I was planning on trying a few Kmers anyway and comparing to a few other programs, to see what might work best.


      Originally posted by seb567 View Post
      What is the interconnect between your compute cores ?
      Its a DDR InfiniBand, so it should be 4Gbits/sec.
      Last edited by Wallysb01; 07-12-2011, 02:45 PM.

      Comment


      • Originally posted by Wallysb01 View Post
        Do you know how it scales? I'm billed linearly. So 2 hours on 128 cores is the same as 1 hour on 256 cores. But does Ray actually run twice as fast on twice the cores? Or is it simply to variable to know the optimums depending on the data?
        After a night of playing around with the number of cores and letting the job run for an hour or two, I can somewhat answer my own question, and I thought other might be interested and also wanted to confirm that this sounds right.

        Anyway, I ran two jobs with the same 2 lanes of data that totaled to about 500M PE reads. One time I ran with 256 processors, the other with 512, both for between 1.5 and 2 hours. What i found is that on our cluster, loading the sequences was very slow, and that it seemed to only occur one processor at a time. So, this step in the process will take the same amount of time regardless of the processor number, as the limiting factor is only the amount of data filing through our DDR InfiniBand, not the processors. From my estimation this was going to take 20 hours. That's kind of a lot of time for me to take up using 512 or even 256 processors. So I'm going to scale back to 96 or even 64 processors before I try it again. But even the 20*64 CPU hours is not chump-change for me, and that will only load the data.

        Now, once the data is loaded, the processor speed on our cluster should be comparable to the one used for the Human data, which where handling about 12M reads per core. If I use 64 cores, I should still be at only 8M reads per core.

        So, I'm hoping computationally, the cores can analyze the data roughly as quickly as colosse, and that later data transfers are not as much of a bottle neck. Can you confirm that might likely be the case, Sébastien?

        Comment


        • Originally posted by Wallysb01 View Post
          After a night of playing around with the number of cores and letting the job run for an hour or two, I can somewhat answer my own question, and I thought other might be interested and also wanted to confirm that this sounds right.

          Anyway, I ran two jobs with the same 2 lanes of data that totaled to about 500M PE reads. One time I ran with 256 processors, the other with 512, both for between 1.5 and 2 hours. What i found is that on our cluster, loading the sequences was very slow, and that it seemed to only occur one processor at a time. So, this step in the process will take the same amount of time regardless of the processor number, as the limiting factor is only the amount of data filing through our DDR InfiniBand, not the processors. From my estimation this was going to take 20 hours. That's kind of a lot of time for me to take up using 512 or even 256 processors. So I'm going to scale back to 96 or even 64 processors before I try it again. But even the 20*64 CPU hours is not chump-change for me, and that will only load the data.

          Now, once the data is loaded, the processor speed on our cluster should be comparable to the one used for the Human data, which where handling about 12M reads per core. If I use 64 cores, I should still be at only 8M reads per core.

          So, I'm hoping computationally, the cores can analyze the data roughly as quickly as colosse, and that later data transfers are not as much of a bottle neck. Can you confirm that might likely be the case, Sébastien?
          20 hours to load 250 M reads on Infiniband -- you probably have issues with you file system.

          What is the latency measured by Ray ? (PREFIX.NetworkTest.txt)

          Should be around 50-100 microseconds for what I know.

          10 GigaEthernet is 400-700 microseconds.


          By the way, Rank 0 computes the partition on the data (count entries in files, no allocation).

          Then, the partition is sent to all MPI rank.

          Finally, each MPI rank takes its slice.

          This should be fast.

          Is it a Lustre file system ?
          If so, do you use special stripe values ?

          Sébastien

          Comment


          • Hi Sébastien, thanks for the reply.

            I don't see the NetworkText.txt file anywhere. I'll have to look into how to make sure those are part of the out puts.

            I honestly don't know a ton about the cluster. Its the Saguaro cluster at ASU (http://hpc.asu.edu/home) if you want to take a look.

            Though, I've now tried a job with 64 cores and it loaded the sequences in 1:45. That seems pretty reasonable. Though I got a bus error computing vertices. I've since reinstalled with the force packing turned off, which was already mentioned in this thread to possibly cause this problem, and am rerunning Ray again. And will reach the same spot as the bus error in about an hour.

            So, I'm not sure if the cluster has some sort of bottle neck when trying to use over a certain number of processors, but I'm happy using 64 so long as it can finish inside 4 days.
            Last edited by Wallysb01; 07-13-2011, 12:31 PM.

            Comment


            • Originally posted by Wallysb01 View Post
              Hi Sébastien, thanks for the reply.

              I don't see the NetworkText.txt file anywhere. I'll have to look into how to make sure those are part of the out puts.

              I honestly don't know a ton about the cluster. Its the Saguaro cluster at ASU (http://hpc.asu.edu/home) if you want to take a look.

              Though, I've now tried a job with 64 cores and it loaded the sequences in 1:45. That seems pretty reasonable. Though I got a bus error computing vertices. I've since reinstalled with the force packing turned off, which was already mentioned in this thread to possibly cause this problem.

              So, I'm not sure if the cluster has some sort of bottle neck when trying to use over a certain number of processors, but I'm happy using 64 so long as it can finish inside 4 days.
              If you use at least v1.6.1-rc3 https://github.com/sebhtml/ray/zipball/v1.6.1-rc3
              then you should get the file PREFIX.NetworkTest.txt



              "bus error"

              Recompile Ray with FORCE_PACKING=n.


              Running Ray with FORCE_PACKING=n will be faster on your cluster.

              Basically, FORCE_PACKING=y does not align address on 8-byte frontiers.

              On some architectures, the code will run very slowly or throw bus errors at you.

              On other architectures, it won't.

              As I explained, FORCE_PACKING=y causes bus errors on some processor architectures.

              Action Point:

              Recompile v1.6.1-rc3 with FORCE_PACKING=n (default value)

              make PREFIX=ray-1.6.3-rc3
              make install


              The whole thing should run faster.

              Sébastien

              Comment


              • Ok, I was just using 1.6.0.

                I'll probably let this run go on 1.6.0 for now, but will install 1.6.3-rc3 and give it a try next time.

                Will update you on progress and thanks for the help.

                Comment


                • Contamination in mate-pair libraries (two peaks)

                  Originally posted by habm View Post
                  Our longer-insert Illumina mate-pair libraries have significant duplication contamination - ie two size peaks, one of inward facing false pe reads (innies) at under 300bp, and one of the outward facing reads (outies) nearer the desired insert size eg 3000 or 5000 bp.
                  How should the mp library size mean and SD be specified to allow Ray to deal with this, please?
                  Thanks.

                  PS, a run without any insert sizes specified (ie Automatic DetectionType) suggests that Ray found the innies OK, but not the useful outies:
                  LibraryNumber: 1 (nominally 3kbp, really more like 2200bp)
                  InputFormat: TwoFiles,Paired
                  AverageOuterDistance: 457
                  StandardDeviation: 441
                  DetectionFailure: Yes

                  LibraryNumber: 2 (nominally 6bp, really more like 4-5 kbp)
                  InputFormat: TwoFiles,Paired
                  AverageOuterDistance: 302
                  StandardDeviation: 218
                  DetectionFailure: Yes

                  LibraryNumber: 3 (nominally 8 kbp, really 6.3 kbp)
                  InputFormat: TwoFiles,Paired
                  AverageOuterDistance: 260
                  StandardDeviation: 213
                  DetectionFailure: Yes

                  Another way of expressing this is to ask whether Ray can disambiguate pe and mp reads, and if so, what input information is needed?
                  I will modify Ray over the next days to support many peaks in each library because the data for Assemblathon 2 also has these artefacts. So I guess it is normal to see these.

                  Plot: http://imgur.com/g657V
                  Data: http://pastebin.com/LCpAa9uv


                  But first I will release v1.6.1 on sourceforge as soon as I get my hands on my system test results.

                  Assemblathon 2

                  For those who follow Assemblathon 2, my last run on my testbed (Illumina data from BGI and from Illumina UK):

                  (all mate-pairs failed detection because of many peaks in each library)


                  Number of contigs: 550764
                  Total length of contigs: 1672750795
                  Number of contigs >= 500 nt: 501312
                  Total length of contigs >= 500 nt: 1656776315
                  Number of scaffolds: 510607
                  Total length of scaffolds: 1681345451
                  Number of scaffolds >= 500 nt: 463741
                  Total length of scaffolds >= 500: 1666464367

                  k-mer length: 31
                  Lowest coverage observed: 1
                  MinimumCoverage: 42
                  PeakCoverage: 171
                  RepeatCoverage: 300
                  Number of k-mers with at least MinimumCoverage: 2453479388 k-mers
                  Estimated genome length: 1226739694 nucleotides
                  Percentage of vertices with coverage 1: 83.7771 %
                  DistributionFile: parrot-Testbed-A2-k31-20110712.CoverageDistribution.txt

                  [1,0]<stdout>: Sequence partitioning: 1 hours, 54 minutes, 47 seconds
                  [1,0]<stdout>: K-mer counting: 5 hours, 47 minutes, 20 seconds
                  [1,0]<stdout>: Coverage distribution analysis: 30 seconds
                  [1,0]<stdout>: Graph construction: 2 hours, 52 minutes, 27 seconds
                  [1,0]<stdout>: Edge purge: 57 minutes, 55 seconds
                  [1,0]<stdout>: Selection of optimal read markers: 1 hours, 38 minutes, 13 seconds
                  [1,0]<stdout>: Detection of assembly seeds: 16 minutes, 7 seconds
                  [1,0]<stdout>: Estimation of outer distances for paired reads: 6 minutes, 26 seconds
                  [1,0]<stdout>: Bidirectional extension of seeds: 3 hours, 18 minutes, 6 seconds
                  [1,0]<stdout>: Merging of redundant contigs: 15 minutes, 45 seconds
                  [1,0]<stdout>: Generation of contigs: 1 minutes, 41 seconds
                  [1,0]<stdout>: Scaffolding of contigs: 54 minutes, 3 seconds
                  [1,0]<stdout>: Total: 18 hours, 3 minutes, 50 seconds

                  # average latency in microseconds (10^-6 seconds) when requesting a reply for a message of 4000 bytes
                  # Message passing interface rank Name Latency in microseconds
                  0 r107-n24 138
                  1 r107-n24 140
                  2 r107-n24 140
                  3 r107-n24 140
                  4 r107-n24 141
                  5 r107-n24 141
                  6 r107-n24 140
                  7 r107-n24 140
                  8 r107-n25 140
                  9 r107-n25 139
                  10 r107-n25 138
                  11 r107-n25 139


                  512 compute cores (64 computers * 8 cores/computer = 512)

                  Typical communication profile for one compute core:

                  [1,0]<stdout>:Rank 0: sent 249841326 messages, received 249840303 messages.

                  Yes, each core sends an average of 250 M messages during the 18 hours !

                  Total number of unfiltered Illumina TruSeq v3 sequences: Total: 3 072 136 294, that is ~3 G sequences !


                  Peak memory usage per core: 2.2 GiB

                  Peak memory usage (distributed in a peer-to-peer fashion): 1100 GiB

                  The peak occurs around 3 hours and goes down to 1.1 GiB per node immediately because the pool of defragmentation groups for k-mers occuring once is freed.


                  Sébastien

                  Comment


                  • Originally posted by seb567 View Post
                    What is the latency measured by Ray ? (PREFIX.NetworkTest.txt)

                    Should be around 50-100 microseconds for what I know.

                    10 GigaEthernet is 400-700 microseconds.
                    I finally did another try with a new install and Ray measured the latency around 125 microseconds. Is that part of the reason its loading fairly slowly?

                    Also, I did the FORCE_PACKING=n option on Ray-1.6.3-rc3, and I still encountered a bus error. Any ideas on what else might be causing that?

                    Comment


                    • Originally posted by Wallysb01 View Post
                      I finally did another try with a new install and Ray measured the latency around 125 microseconds. Is that part of the reason its loading fairly slowly?

                      Also, I did the FORCE_PACKING=n option on Ray-1.6.3-rc3, and I still encountered a bus error. Any ideas on what else might be causing that?
                      125 is just fine.

                      Is this your system:

                      Saguaro 2


                      If so, the architecture is

                      Intel EM64T Xeon E54xx (Harpertown) 2830 MHz (11.32 GFlops)



                      Did you supply a 'make clean' before doing 'make FORCE_PACKING=n' if you had build Ray-1.6.3-rc3 with FORCE_PACKING=y previously ?

                      Otherwise, your build will be a mix of both.

                      If not, can you send an email on the Ray mailing list and include at which point you get your bus error ?




                      colosse.clumeq.ca also uses Intel Xeon processors, but they are Nehalem-EP, not Harpertown.

                      Thanks.

                      Sébastien

                      Comment


                      • Thanks for your help Sébastien, doing a fresh install of 1.6.1-rc3 with force packing off worked wonderfully. The starting sequence totaled about 250M pe 104bp reads and Ray finished in 18.5 hours on 96 nodes.

                        I ran the simple default condition, except for setting Kmers to 31, since I'd tried a few other programs with that size and wanted to be able to compare the assemblies, though no other program has finished yet.

                        The assembly was not particularly great however. N50=194, the total contig lengths only added up to 44% of the estimated genome size and the longest contig was 5383. Do you have any particular recommendations? We should have a coverage of almost 30x. Does that mean we should try larger Kmers? Would debug-bubbles and debug-seeds help?
                        Last edited by Wallysb01; 07-16-2011, 11:35 AM.

                        Comment


                        • Originally posted by Wallysb01 View Post
                          Thanks for your help Sébastien, doing a fresh install of 1.6.1-rc3 with force packing off worked wonderfully. The starting sequence totaled about 250M pe 104bp reads and Ray finished in 18.5 hours on 96 nodes.

                          I ran the simple default condition, except for setting Kmers to 31, since I'd tried a few other programs with that size and wanted to be able to compare the assemblies, though no other program has finished yet.

                          The assembly was not particularly great however. N50=194, the total contig lengths only added up to 44% of the estimated genome size and the longest contig was 5383. Do you have any particular recommendations? We should have a coverage of almost 30x. Does that mean we should try larger Kmers? Would debug-bubbles and debug-seeds help?
                          Well, you have to do some post-assembly quality control.

                          First, check the PREFIX.CoverageDistributionAnalysis.txt You need a high peak coverage. (at least 20, higher is better)

                          Do you have paired information ?

                          If so, check PREFIX.LibraryStatistics.txt and assert that the averages and standard deviations are OK.

                          What data do you have ? Is it Illumina TruSeq v. 3 ? (those are very neat)

                          -debug-bubbles and -debug-seeds and -show-ending-context just display more information to you.

                          Sébastien
                          I like software development, AI, biology and using good tools like git, cargo, and docker. - sebhtml

                          Comment


                          • So, peak coverage was 913.

                            Interestingly it estimated the genome size was ~1.8Mb, when it should be closer to ~1.8Gb.

                            I also specified the insert size of both the paired end libraries which were done with the Illumina TruSeq v3. So fair the insert sizes where both about 450 bps with 104 bp paired end reads. We may end up increasing the number of libraries we sequence to include larger inserts to improve the contigs and actually get scaffolds.

                            I was also planning back blast to filter out contigs aligning to human/bacterial genomes, but might not happen until I try a could of different sets of Kmers sizes.

                            Comment


                            • mate-pair libraries

                              Thanks for the helpful responses, Sébastien, and for addressing this problem in the next version.
                              Meanwhile, are the files PREFIX.LibraryX.txt made automatically, please? I cannot see them, whether I set a mean and sd for insert size in the command line, or not.

                              Comment


                              • Hi Sébastien,

                                I have a question, Ray appears to be automatically setting the minimum kmer coverage to 1 minus the peak coverage, is that true, and what's the reason for this? I apparently have absurd peak coverage of 626, even with k=61. Could this be why my assembly was poor the first time with k=31 above?

                                I'm thinking I'm going to cancel my job and drop that min coverage.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-25-2024, 11:49 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-24-2024, 08:47 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                62 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                61 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X