Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Wallysb01
    replied
    Originally posted by seb567 View Post
    What is the latency measured by Ray ? (PREFIX.NetworkTest.txt)

    Should be around 50-100 microseconds for what I know.

    10 GigaEthernet is 400-700 microseconds.
    I finally did another try with a new install and Ray measured the latency around 125 microseconds. Is that part of the reason its loading fairly slowly?

    Also, I did the FORCE_PACKING=n option on Ray-1.6.3-rc3, and I still encountered a bus error. Any ideas on what else might be causing that?

    Leave a comment:


  • seb567
    replied
    Contamination in mate-pair libraries (two peaks)

    Originally posted by habm View Post
    Our longer-insert Illumina mate-pair libraries have significant duplication contamination - ie two size peaks, one of inward facing false pe reads (innies) at under 300bp, and one of the outward facing reads (outies) nearer the desired insert size eg 3000 or 5000 bp.
    How should the mp library size mean and SD be specified to allow Ray to deal with this, please?
    Thanks.

    PS, a run without any insert sizes specified (ie Automatic DetectionType) suggests that Ray found the innies OK, but not the useful outies:
    LibraryNumber: 1 (nominally 3kbp, really more like 2200bp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 457
    StandardDeviation: 441
    DetectionFailure: Yes

    LibraryNumber: 2 (nominally 6bp, really more like 4-5 kbp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 302
    StandardDeviation: 218
    DetectionFailure: Yes

    LibraryNumber: 3 (nominally 8 kbp, really 6.3 kbp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 260
    StandardDeviation: 213
    DetectionFailure: Yes

    Another way of expressing this is to ask whether Ray can disambiguate pe and mp reads, and if so, what input information is needed?
    I will modify Ray over the next days to support many peaks in each library because the data for Assemblathon 2 also has these artefacts. So I guess it is normal to see these.

    Plot: http://imgur.com/g657V
    Data: http://pastebin.com/LCpAa9uv


    But first I will release v1.6.1 on sourceforge as soon as I get my hands on my system test results.

    Assemblathon 2

    For those who follow Assemblathon 2, my last run on my testbed (Illumina data from BGI and from Illumina UK):

    (all mate-pairs failed detection because of many peaks in each library)


    Number of contigs: 550764
    Total length of contigs: 1672750795
    Number of contigs >= 500 nt: 501312
    Total length of contigs >= 500 nt: 1656776315
    Number of scaffolds: 510607
    Total length of scaffolds: 1681345451
    Number of scaffolds >= 500 nt: 463741
    Total length of scaffolds >= 500: 1666464367

    k-mer length: 31
    Lowest coverage observed: 1
    MinimumCoverage: 42
    PeakCoverage: 171
    RepeatCoverage: 300
    Number of k-mers with at least MinimumCoverage: 2453479388 k-mers
    Estimated genome length: 1226739694 nucleotides
    Percentage of vertices with coverage 1: 83.7771 %
    DistributionFile: parrot-Testbed-A2-k31-20110712.CoverageDistribution.txt

    [1,0]<stdout>: Sequence partitioning: 1 hours, 54 minutes, 47 seconds
    [1,0]<stdout>: K-mer counting: 5 hours, 47 minutes, 20 seconds
    [1,0]<stdout>: Coverage distribution analysis: 30 seconds
    [1,0]<stdout>: Graph construction: 2 hours, 52 minutes, 27 seconds
    [1,0]<stdout>: Edge purge: 57 minutes, 55 seconds
    [1,0]<stdout>: Selection of optimal read markers: 1 hours, 38 minutes, 13 seconds
    [1,0]<stdout>: Detection of assembly seeds: 16 minutes, 7 seconds
    [1,0]<stdout>: Estimation of outer distances for paired reads: 6 minutes, 26 seconds
    [1,0]<stdout>: Bidirectional extension of seeds: 3 hours, 18 minutes, 6 seconds
    [1,0]<stdout>: Merging of redundant contigs: 15 minutes, 45 seconds
    [1,0]<stdout>: Generation of contigs: 1 minutes, 41 seconds
    [1,0]<stdout>: Scaffolding of contigs: 54 minutes, 3 seconds
    [1,0]<stdout>: Total: 18 hours, 3 minutes, 50 seconds

    # average latency in microseconds (10^-6 seconds) when requesting a reply for a message of 4000 bytes
    # Message passing interface rank Name Latency in microseconds
    0 r107-n24 138
    1 r107-n24 140
    2 r107-n24 140
    3 r107-n24 140
    4 r107-n24 141
    5 r107-n24 141
    6 r107-n24 140
    7 r107-n24 140
    8 r107-n25 140
    9 r107-n25 139
    10 r107-n25 138
    11 r107-n25 139


    512 compute cores (64 computers * 8 cores/computer = 512)

    Typical communication profile for one compute core:

    [1,0]<stdout>:Rank 0: sent 249841326 messages, received 249840303 messages.

    Yes, each core sends an average of 250 M messages during the 18 hours !

    Total number of unfiltered Illumina TruSeq v3 sequences: Total: 3 072 136 294, that is ~3 G sequences !


    Peak memory usage per core: 2.2 GiB

    Peak memory usage (distributed in a peer-to-peer fashion): 1100 GiB

    The peak occurs around 3 hours and goes down to 1.1 GiB per node immediately because the pool of defragmentation groups for k-mers occuring once is freed.


    Sébastien

    Leave a comment:


  • Wallysb01
    replied
    Ok, I was just using 1.6.0.

    I'll probably let this run go on 1.6.0 for now, but will install 1.6.3-rc3 and give it a try next time.

    Will update you on progress and thanks for the help.

    Leave a comment:


  • seb567
    replied
    Originally posted by Wallysb01 View Post
    Hi Sébastien, thanks for the reply.

    I don't see the NetworkText.txt file anywhere. I'll have to look into how to make sure those are part of the out puts.

    I honestly don't know a ton about the cluster. Its the Saguaro cluster at ASU (http://hpc.asu.edu/home) if you want to take a look.

    Though, I've now tried a job with 64 cores and it loaded the sequences in 1:45. That seems pretty reasonable. Though I got a bus error computing vertices. I've since reinstalled with the force packing turned off, which was already mentioned in this thread to possibly cause this problem.

    So, I'm not sure if the cluster has some sort of bottle neck when trying to use over a certain number of processors, but I'm happy using 64 so long as it can finish inside 4 days.
    If you use at least v1.6.1-rc3 https://github.com/sebhtml/ray/zipball/v1.6.1-rc3
    then you should get the file PREFIX.NetworkTest.txt



    "bus error"

    Recompile Ray with FORCE_PACKING=n.


    Running Ray with FORCE_PACKING=n will be faster on your cluster.

    Basically, FORCE_PACKING=y does not align address on 8-byte frontiers.

    On some architectures, the code will run very slowly or throw bus errors at you.

    On other architectures, it won't.

    As I explained, FORCE_PACKING=y causes bus errors on some processor architectures.

    Action Point:

    Recompile v1.6.1-rc3 with FORCE_PACKING=n (default value)

    make PREFIX=ray-1.6.3-rc3
    make install


    The whole thing should run faster.

    Sébastien

    Leave a comment:


  • Wallysb01
    replied
    Hi Sébastien, thanks for the reply.

    I don't see the NetworkText.txt file anywhere. I'll have to look into how to make sure those are part of the out puts.

    I honestly don't know a ton about the cluster. Its the Saguaro cluster at ASU (http://hpc.asu.edu/home) if you want to take a look.

    Though, I've now tried a job with 64 cores and it loaded the sequences in 1:45. That seems pretty reasonable. Though I got a bus error computing vertices. I've since reinstalled with the force packing turned off, which was already mentioned in this thread to possibly cause this problem, and am rerunning Ray again. And will reach the same spot as the bus error in about an hour.

    So, I'm not sure if the cluster has some sort of bottle neck when trying to use over a certain number of processors, but I'm happy using 64 so long as it can finish inside 4 days.
    Last edited by Wallysb01; 07-13-2011, 12:31 PM.

    Leave a comment:


  • seb567
    replied
    Originally posted by Wallysb01 View Post
    After a night of playing around with the number of cores and letting the job run for an hour or two, I can somewhat answer my own question, and I thought other might be interested and also wanted to confirm that this sounds right.

    Anyway, I ran two jobs with the same 2 lanes of data that totaled to about 500M PE reads. One time I ran with 256 processors, the other with 512, both for between 1.5 and 2 hours. What i found is that on our cluster, loading the sequences was very slow, and that it seemed to only occur one processor at a time. So, this step in the process will take the same amount of time regardless of the processor number, as the limiting factor is only the amount of data filing through our DDR InfiniBand, not the processors. From my estimation this was going to take 20 hours. That's kind of a lot of time for me to take up using 512 or even 256 processors. So I'm going to scale back to 96 or even 64 processors before I try it again. But even the 20*64 CPU hours is not chump-change for me, and that will only load the data.

    Now, once the data is loaded, the processor speed on our cluster should be comparable to the one used for the Human data, which where handling about 12M reads per core. If I use 64 cores, I should still be at only 8M reads per core.

    So, I'm hoping computationally, the cores can analyze the data roughly as quickly as colosse, and that later data transfers are not as much of a bottle neck. Can you confirm that might likely be the case, Sébastien?
    20 hours to load 250 M reads on Infiniband -- you probably have issues with you file system.

    What is the latency measured by Ray ? (PREFIX.NetworkTest.txt)

    Should be around 50-100 microseconds for what I know.

    10 GigaEthernet is 400-700 microseconds.


    By the way, Rank 0 computes the partition on the data (count entries in files, no allocation).

    Then, the partition is sent to all MPI rank.

    Finally, each MPI rank takes its slice.

    This should be fast.

    Is it a Lustre file system ?
    If so, do you use special stripe values ?

    Sébastien

    Leave a comment:


  • Wallysb01
    replied
    Originally posted by Wallysb01 View Post
    Do you know how it scales? I'm billed linearly. So 2 hours on 128 cores is the same as 1 hour on 256 cores. But does Ray actually run twice as fast on twice the cores? Or is it simply to variable to know the optimums depending on the data?
    After a night of playing around with the number of cores and letting the job run for an hour or two, I can somewhat answer my own question, and I thought other might be interested and also wanted to confirm that this sounds right.

    Anyway, I ran two jobs with the same 2 lanes of data that totaled to about 500M PE reads. One time I ran with 256 processors, the other with 512, both for between 1.5 and 2 hours. What i found is that on our cluster, loading the sequences was very slow, and that it seemed to only occur one processor at a time. So, this step in the process will take the same amount of time regardless of the processor number, as the limiting factor is only the amount of data filing through our DDR InfiniBand, not the processors. From my estimation this was going to take 20 hours. That's kind of a lot of time for me to take up using 512 or even 256 processors. So I'm going to scale back to 96 or even 64 processors before I try it again. But even the 20*64 CPU hours is not chump-change for me, and that will only load the data.

    Now, once the data is loaded, the processor speed on our cluster should be comparable to the one used for the Human data, which where handling about 12M reads per core. If I use 64 cores, I should still be at only 8M reads per core.

    So, I'm hoping computationally, the cores can analyze the data roughly as quickly as colosse, and that later data transfers are not as much of a bottle neck. Can you confirm that might likely be the case, Sébastien?

    Leave a comment:


  • Wallysb01
    replied
    Originally posted by seb567 View Post
    Well, for low memory usage, you definitely want to use Ray v1.6.1 (on its way, presently Ray v1.6.1-rc3 which is available at https://github.com/sebhtml/ray/zipball/v1.6.1-rc3 ).

    See http://sourceforge.net/mailarchive/m...sg_id=27781099 for more details.

    In your message, you don't report how much memory your compute cores have access to.
    I believe each core has 4GB RAM. But that might be variable, some 2GB, some 1GB. But from what I understand other programs that are single threaded, or have bottle necks that are single threaded, can't use distributed memory and have to be contained to one machine. Which would limit me to 64 GB of RAM on our cluster. Which is why I brought up the RAM limitations.

    Originally posted by seb567 View Post
    Ray is a peer-to-peer program, that is you can launch it on 2048 compute cores if you want.
    Do you know how it scales? I'm billed linearly. So 2 hours on 128 cores is the same as 1 hour on 256 cores. But does Ray actually run twice as fast on twice the cores? Or is it simply to variable to know the optimums depending on the data?

    This isn't a huge concern, as the cost per CPU hour is low, so long as the speed is close to linear and I'm not "wasting" large numbers of cores nor getting it stuck for long periods with too low a number of cores.

    And I put wasting in quotes because I know none will really be wasted. I'm just concerned that at some point the marginal addition of another core isn't offset by the marginal decrease in time to completion. Or the reverse, were the program gets "stuck" due to limited memory per core or what not.

    Originally posted by seb567 View Post
    But, you should first do a run with k=31 just to quality-control the thing first.
    Will do. I was planning on trying a few Kmers anyway and comparing to a few other programs, to see what might work best.


    Originally posted by seb567 View Post
    What is the interconnect between your compute cores ?
    Its a DDR InfiniBand, so it should be 4Gbits/sec.
    Last edited by Wallysb01; 07-12-2011, 02:45 PM.

    Leave a comment:


  • seb567
    replied
    Originally posted by habm View Post
    Our longer-insert Illumina mate-pair libraries have significant duplication contamination - ie two size peaks, one of inward facing false pe reads (innies) at under 300bp, and one of the outward facing reads (outies) nearer the desired insert size eg 3000 or 5000 bp.
    How should the mp library size mean and SD be specified to allow Ray to deal with this, please?
    Thanks.

    PS, a run without any insert sizes specified (ie Automatic DetectionType) suggests that Ray found the innies OK, but not the useful outies:
    LibraryNumber: 1 (nominally 3kbp, really more like 2200bp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 457
    StandardDeviation: 441
    DetectionFailure: Yes

    LibraryNumber: 2 (nominally 6bp, really more like 4-5 kbp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 302
    StandardDeviation: 218
    DetectionFailure: Yes

    LibraryNumber: 3 (nominally 8 kbp, really 6.3 kbp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 260
    StandardDeviation: 213
    DetectionFailure: Yes

    Another way of expressing this is to ask whether Ray can disambiguate pe and mp reads, and if so, what input information is needed?
    One way you can use your data follows:

    For each of your paired library, analyze the file PREFIX.LibraryX.txt.

    For example, for LibraryNumber 3:

    less PREFIX.Library3.txt
    And locate your peak near 6300 to find the real value. Let us suppose that it is 6189.

    The column in these Library files are: observed outer distance and frequency.
    Basically, you want to locate the observed outer distance with the maximum frequency near 6300.

    Do that for all your paired libraries.

    Then, edit your Ray command to include manually the average and standard deviation for each of your library:

    mpirun -np 999999 /software/ray-version/Ray \
    -p lib3_1.fastq lib3_2.fastq 6189 618 (replace 6189 by what you found in the Library file and let the standard deviation be around 10 %).
    -p ...

    Doing so will allow you to use the long-insert, but will ignore the paired information for the short-insert (within each library) because they won't be within the specified ranges.

    Still, all the reads will contribute to building the genome graph.

    Hope it helps.


    Sébastien

    Leave a comment:


  • seb567
    replied
    Originally posted by Wallysb01 View Post
    So, I'm interested to try Ray, as I have access to a cluster with ungodly numbers of cores but very real limitations in RAM that make other programs difficult to run.

    Anyway, I have 2 Illumina lanes with 104bp PE reads, totaling about 250M PE reads, from a vertebrate with a genome size of roughly 2Gbp. Do you have any suggestions on how many cores I should try using and for how long?

    I was also thinking of trying a fairly large Kmer first, around maybe ~65. Any suggestions on that?
    Well, for low memory usage, you definitely want to use Ray v1.6.1 (on its way, presently Ray v1.6.1-rc3 which is available at https://github.com/sebhtml/ray/zipball/v1.6.1-rc3 ).

    See http://sourceforge.net/mailarchive/m...sg_id=27781099 for more details.

    In your message, you don't report how much memory your compute cores have access to.

    Ray is a peer-to-peer program, that is you can launch it on 2048 compute cores if you want.

    But, you should first do a run with k=31 just to quality-control the thing first.

    You'll get something like this:

    cat parrot-BGI-Assemblathon2-k31-20110711.CoverageDistributionAnalysis.txt

    k-mer length: 31
    Lowest coverage observed: 1
    MinimumCoverage: 31
    PeakCoverage: 133
    RepeatCoverage: 235
    Number of k-mers with at least MinimumCoverage: 2462747440 k-mers
    Estimated genome length: 1231373720 nucleotides
    Percentage of vertices with coverage 1: 82.8132 %
    DistributionFile: parrot-BGI-Assemblathon2-k31-20110711.CoverageDistribution.txt



    In Ray, k-mers from 15 to 31 are stored on one 64-bit integer.

    K-mers from 33 to 63 are stored on 2 64-bit integers.

    K-mers from 65 to 95 are stored on 3 64-bit integers.




    Example for the memory usage with Illumina TruSeq 3 chemistry

    Ray v1.6.1-rc3 compiled with FORCE_PACKING=y MAXKMERLENGTH=32

    (FORCE_PACKING=y causes bus errors on some architectures such as UltraSparc and Itanium)


    k=31

    2 386 063 326 Illumina TruSeq 3 sequences, length is 90 or 151

    data for the Parrot dataset of Assemblathon 2

    Data generated by the BGI.

    Running time:

    [1,0]<stdout>: Sequence partitioning: 2 hours, 30 minutes, 21 seconds
    [1,0]<stdout>: K-mer counting: 2 hours, 33 minutes, 44 seconds
    [1,0]<stdout>: Coverage distribution analysis: 3 minutes, 51 seconds
    [1,0]<stdout>: Graph construction: 1 hours, 36 minutes, 47 seconds
    [1,0]<stdout>: Edge purge: 48 minutes, 20 seconds
    [1,0]<stdout>: Selection of optimal read markers: 1 hours, 5 minutes, 30 seconds
    [1,0]<stdout>: Detection of assembly seeds: 12 minutes, 15 seconds
    [1,0]<stdout>: Estimation of outer distances for paired reads: 4 minutes, 51 seconds
    [1,0]<stdout>: Bidirectional extension of seeds: 2 hours, 11 minutes, 46 seconds
    [1,0]<stdout>: Merging of redundant contigs: 13 minutes, 31 seconds
    [1,0]<stdout>: Generation of contigs: 1 minutes, 24 seconds
    [1,0]<stdout>: Scaffolding of contigs: 34 minutes, 46 seconds
    [1,0]<stdout>: Total: 11 hours, 57 minutes, 30 seconds


    Peak memory usage:

    ~800 GiB, distributed on 512 compute cores uniformly by Ray's peer-to-peer scheme.

    Each compute core utilises on average ~ 1.5 GiB maximum.


    Measured network latency is ~150 microseconds, this figure includes software overheads.


    head parrot-BGI-Assemblathon2-k31-20110711.NetworkTest.txt
    # average latency in microseconds (10^-6 seconds) when requesting a reply for a message of 4000 bytes
    # Message passing interface rank Name Latency in microseconds
    0 r104-n7 153
    1 r104-n7 156
    2 r104-n7 155
    3 r104-n7 155
    4 r104-n7 154
    5 r104-n7 155
    6 r104-n7 155
    7 r104-n7 155


    What is the interconnect between your compute cores ?

    Sébastien
    I like software development, AI, biology and using good tools like git, cargo, and docker. - sebhtml

    Leave a comment:


  • seb567
    replied
    Originally posted by habm View Post
    Our longer-insert Illumina mate-pair libraries have significant duplication contamination - ie two size peaks, one of inward facing false pe reads (innies) at under 300bp, and one of the outward facing reads (outies) nearer the desired insert size eg 3000 or 5000 bp.
    How should the mp library size mean and SD be specified to allow Ray to deal with this, please?
    Thanks.

    PS, a run without any insert sizes specified (ie Automatic DetectionType) suggests that Ray found the innies OK, but not the useful outies:
    LibraryNumber: 1 (nominally 3kbp, really more like 2200bp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 457
    StandardDeviation: 441
    DetectionFailure: Yes

    LibraryNumber: 2 (nominally 6bp, really more like 4-5 kbp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 302
    StandardDeviation: 218
    DetectionFailure: Yes

    LibraryNumber: 3 (nominally 8 kbp, really 6.3 kbp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 260
    StandardDeviation: 213
    DetectionFailure: Yes

    Another way of expressing this is to ask whether Ray can disambiguate pe and mp reads, and if so, what input information is needed?
    Presently, Ray can not disambiguate two paired libraries with different outer distances that are pooled in the same files.


    Do you also observe two peaks in those files:

    PREFIX.Library0.txt
    PREFIX.Library1.txt
    PREFIX.Library2.txt
    PREFIX.Library3.txt

    (replace PREFIX by what you have given to the -o switch.)

    Example of such a file for MiSeq data:

    cat ecoli-MiSeq.Library0.txt

    52 1
    53 1
    56 1
    58 1
    61 1
    62 2
    63 2
    64 2
    65 1
    66 2
    67 2
    68 2
    69 1
    71 1
    72 1
    73 2
    74 1
    75 1
    76 2
    77 2
    78 1
    79 3
    80 3
    82 3
    83 3
    84 2
    85 3
    86 2
    87 6
    89 8
    90 1
    91 7
    92 5
    93 5
    94 8
    95 4
    96 6
    97 7
    98 6
    99 10
    100 5
    101 4
    102 10
    103 4
    104 5
    105 3
    106 12
    107 5
    108 10
    109 8
    110 5
    111 10
    112 14
    113 13
    114 11
    115 11
    116 8
    117 10
    118 11
    119 14
    120 23
    121 12
    122 16
    123 14
    124 17
    125 20
    126 18
    127 20
    128 24
    129 21
    130 19
    131 20
    132 14
    133 28
    134 24
    135 34
    136 31
    137 24
    138 29
    139 25
    140 25
    141 42
    142 34
    143 32
    144 36
    145 34
    146 40
    147 39
    148 38
    149 50
    150 33
    151 50
    152 141
    153 141
    154 154
    155 150
    156 142
    157 153
    158 146
    159 173
    160 163
    161 176
    162 147
    163 147
    164 148
    165 140
    166 162
    167 159
    168 136
    169 139
    170 132
    171 149
    172 149
    173 159
    174 165
    175 151
    176 168
    177 167
    178 144
    179 149
    180 153
    181 151
    182 145
    183 150
    184 148
    185 143
    186 159
    187 148
    188 128
    189 140
    190 144
    191 139
    192 156
    193 127
    194 106
    195 149
    196 117
    197 118
    198 130
    199 134
    200 134
    201 139
    202 149
    203 156
    204 138
    205 140
    206 139
    207 146
    208 164
    209 173
    210 153
    211 153
    212 167
    213 167
    214 136
    215 143
    216 187
    217 159
    218 202
    219 157
    220 173
    221 210
    222 187
    223 204
    224 229
    225 230
    226 240
    227 255
    228 261
    229 298
    230 292
    231 313
    232 383
    233 408
    234 465
    235 500
    236 567
    237 660
    238 710
    239 774
    240 958
    241 1074
    242 1191
    243 1296
    244 1494
    245 1590
    246 1924
    247 2021
    248 2269
    249 2456
    250 2668
    251 2966
    252 3229
    253 3439
    254 3777
    255 3940
    256 4350
    257 4597
    258 5087
    259 5409
    260 5743
    261 6285
    262 6936
    263 7585
    264 8592
    265 9747
    266 11368
    267 13517
    268 16384
    269 20030
    270 24454
    271 29734
    272 35766
    273 42817
    274 49781
    275 57815
    276 65042
    277 72012
    278 78801
    279 84473
    280 90123
    281 93640
    282 97052
    283 100139
    284 101610
    285 103303
    286 103863
    287 104660
    288 104885
    289 104595
    290 104248
    291 104087
    292 104276
    293 103179
    294 102572
    295 102231
    296 101174
    297 100269
    298 100211
    299 99244
    300 98984
    301 97880
    302 97111
    303 95995
    304 94988
    305 94205
    306 92844
    307 91975
    308 91365
    309 89238
    310 89011
    311 87040
    312 85224
    313 84300
    314 82654
    315 80892
    316 79329
    317 77002
    318 74527
    319 71500
    320 68193
    321 64387
    322 60527
    323 55998
    324 51459
    325 46082
    326 41498
    327 36553
    328 32011
    329 27418
    330 23535
    331 19623
    332 16185
    333 13669
    334 11028
    335 9012
    336 7569
    337 6332
    338 5236
    339 4454
    340 3676
    341 3071
    342 2564
    343 2167
    344 1875
    345 1538
    346 1278
    347 935
    348 766
    349 591
    350 412
    351 305
    352 253
    353 151
    354 119
    355 76
    356 51
    357 37
    358 32
    359 22
    360 24
    361 10
    362 7
    363 5
    364 4
    365 3
    367 3
    369 2
    370 1
    373 1
    378 1
    390 1
    398 1
    399 1
    414 1
    417 1
    421 1
    431 1
    497 1
    501 1
    516 1
    528 1
    668 1
    1200 1
    1230 1
    1373 1
    1478 1
    1886 1
    2028 1

    Leave a comment:


  • Wallysb01
    replied
    So, I'm interested to try Ray, as I have access to a cluster with ungodly numbers of cores but very real limitations in RAM that make other programs difficult to run.

    Anyway, I have 2 Illumina lanes with 104bp PE reads, totaling about 250M PE reads, from a vertebrate with a genome size of roughly 2Gbp. Do you have any suggestions on how many cores I should try using and for how long?

    I was also thinking of trying a fairly large Kmer first, around maybe ~65. Any suggestions on that?

    Leave a comment:


  • habm
    replied
    Our longer-insert Illumina mate-pair libraries have significant duplication contamination - ie two size peaks, one of inward facing false pe reads (innies) at under 300bp, and one of the outward facing reads (outies) nearer the desired insert size eg 3000 or 5000 bp.
    How should the mp library size mean and SD be specified to allow Ray to deal with this, please?
    Thanks.

    PS, a run without any insert sizes specified (ie Automatic DetectionType) suggests that Ray found the innies OK, but not the useful outies:
    LibraryNumber: 1 (nominally 3kbp, really more like 2200bp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 457
    StandardDeviation: 441
    DetectionFailure: Yes

    LibraryNumber: 2 (nominally 6bp, really more like 4-5 kbp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 302
    StandardDeviation: 218
    DetectionFailure: Yes

    LibraryNumber: 3 (nominally 8 kbp, really 6.3 kbp)
    InputFormat: TwoFiles,Paired
    AverageOuterDistance: 260
    StandardDeviation: 213
    DetectionFailure: Yes

    Another way of expressing this is to ask whether Ray can disambiguate pe and mp reads, and if so, what input information is needed?
    Last edited by habm; 07-04-2011, 03:23 PM.

    Leave a comment:


  • seb567
    replied
    Originally posted by arelouse View Post
    Hi, a simple question: how do u achieve parallelism in Ray?

    I took a quick look at this thread and also read the paper and the slides describing Ray, but ending with almost nothing.
    Sorry if I missed some points! It would be appreciated if u can provide a little basic ideas behind Ray.
    For example, it's easy (at least from the description) to understand AByss' distribution strategy.
    Read my blog:

    More on virtual communication with the message-passing interface.
    The message-passing interface (MPI) is a standard that allows numerous computers to communicate in order to achieve a large-scale peer-to-p...


    Also, a silly story:
    IT WAS a wintry day of January, in a coldly-tempered land. On this island lived peculiar citizens whose main everyday whereabouts involved p...

    Leave a comment:


  • gringer
    replied
    Just as a heads-up about memory consumption in upcoming Ray releases, I've just finished a Ray transcriptome run on my desktop computer (using 10 processor cores). This was done using a bleeding-edge git version of Ray (post Kmer academy). Here are some statistics:

    Input files:

    2 paired-end Illumina files, each 7.6GB
    1 454 file, 2.1MB
    1 solid colour-space file, converted to base-space, 2.7Gb

    These input files were masked and filtered to eliminate sequences < Q20 (so Ray got no 'unknown' bases, which would be converted to 'A'). I presume this is why the 454 input file was so small, about 1/20 of its original size.

    Total memory consumption was about 21Gb [my desktop computer has 24GB], which was similar to memory consumption using the paired-end files alone. I presume this is because the consumption is based on the number of unique Kmers, rather than the input sequence length.

    Elapsed time for each step, Thu Jun 30 12:44:37 2011

    Sequence partitioning: 15 minutes, 53 seconds
    K-mer counting: 34 minutes, 34 seconds
    Coverage distribution analysis: 13 seconds
    Graph construction: 1 hours, 5 minutes, 28 seconds
    Edge purge: 3 minutes, 48 seconds
    Selection of optimal read markers: 46 minutes, 56 seconds
    Detection of assembly seeds: 4 minutes, 18 seconds
    Estimation of outer distances for paired reads: 6 minutes, 1 seconds
    Bidirectional extension of seeds: 11 minutes, 42 seconds
    Merging of redundant contigs: 52 seconds
    Generation of contigs: 4 seconds
    Scaffolding of contigs: 2 minutes, 44 seconds
    Total: 3 hours, 12 minutes, 33 seconds

    Rank 8: assembler memory usage: 2215932 KiB
    Rank 0: assembler memory usage: 2027504 KiB
    Rank 2: assembler memory usage: 2031596 KiB
    Rank 6: assembler memory usage: 2027500 KiB
    Rank 4: assembler memory usage: 2105324 KiB
    Rank 7: assembler memory usage: 2084844 KiB
    Rank 3: assembler memory usage: 2035692 KiB
    Rank 9: assembler memory usage: 2187264 KiB
    Rank 1: assembler memory usage: 2027500 KiB
    Rank 5: assembler memory usage: 2125804 KiB
    Number of contigs: 48078
    Total length of contigs: 9535608
    Number of contigs >= 500 nt: 1746
    Total length of contigs >= 500 nt: 1208359
    Number of scaffolds: 47705
    Total length of scaffolds: 9564818
    Number of scaffolds >= 500 nt: 1894
    Total length of scaffolds >= 500: 1390923

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Exploring the Dynamics of the Tumor Microenvironment
    by seqadmin




    The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
    07-08-2024, 03:19 PM
  • seqadmin
    Exploring Human Diversity Through Large-Scale Omics
    by seqadmin


    In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
    06-25-2024, 06:43 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 07:20 AM
0 responses
18 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-16-2024, 05:49 AM
0 responses
36 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-15-2024, 06:53 AM
0 responses
39 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-10-2024, 07:30 AM
0 responses
41 views
0 likes
Last Post seqadmin  
Working...
X