Seqanswers Leaderboard Ad

**seb567** · 07-12-2011, 02:36 PM

Originally posted by habm View Post

Our longer-insert Illumina mate-pair libraries have significant duplication contamination - ie two size peaks, one of inward facing false pe reads (innies) at under 300bp, and one of the outward facing reads (outies) nearer the desired insert size eg 3000 or 5000 bp.
How should the mp library size mean and SD be specified to allow Ray to deal with this, please?
Thanks.

PS, a run without any insert sizes specified (ie Automatic DetectionType) suggests that Ray found the innies OK, but not the useful outies:
LibraryNumber: 1 (nominally 3kbp, really more like 2200bp)
InputFormat: TwoFiles,Paired
AverageOuterDistance: 457
StandardDeviation: 441
DetectionFailure: Yes

LibraryNumber: 2 (nominally 6bp, really more like 4-5 kbp)
InputFormat: TwoFiles,Paired
AverageOuterDistance: 302
StandardDeviation: 218
DetectionFailure: Yes

LibraryNumber: 3 (nominally 8 kbp, really 6.3 kbp)
InputFormat: TwoFiles,Paired
AverageOuterDistance: 260
StandardDeviation: 213
DetectionFailure: Yes

Another way of expressing this is to ask whether Ray can disambiguate pe and mp reads, and if so, what input information is needed?

One way you can use your data follows:

For each of your paired library, analyze the file PREFIX.LibraryX.txt.

For example, for LibraryNumber 3:

less PREFIX.Library3.txt
And locate your peak near 6300 to find the real value. Let us suppose that it is 6189.

The column in these Library files are: observed outer distance and frequency.
Basically, you want to locate the observed outer distance with the maximum frequency near 6300.

Do that for all your paired libraries.

Then, edit your Ray command to include manually the average and standard deviation for each of your library:

mpirun -np 999999 /software/ray-version/Ray \
-p lib3_1.fastq lib3_2.fastq 6189 618 (replace 6189 by what you found in the Library file and let the standard deviation be around 10 %).
-p ...

Doing so will allow you to use the long-insert, but will ignore the paired information for the short-insert (within each library) because they won't be within the specified ranges.

Still, all the reads will contribute to building the genome graph.

Hope it helps.

Sébastien

**Wallysb01** · 07-12-2011, 02:38 PM

Originally posted by seb567 View Post

Well, for low memory usage, you definitely want to use Ray v1.6.1 (on its way, presently Ray v1.6.1-rc3 which is available at https://github.com/sebhtml/ray/zipball/v1.6.1-rc3 ).

See http://sourceforge.net/mailarchive/m...sg_id=27781099 for more details.

In your message, you don't report how much memory your compute cores have access to.

I believe each core has 4GB RAM. But that might be variable, some 2GB, some 1GB. But from what I understand other programs that are single threaded, or have bottle necks that are single threaded, can't use distributed memory and have to be contained to one machine. Which would limit me to 64 GB of RAM on our cluster. Which is why I brought up the RAM limitations.

Originally posted by seb567 View Post

Ray is a peer-to-peer program, that is you can launch it on 2048 compute cores if you want.

Do you know how it scales? I'm billed linearly. So 2 hours on 128 cores is the same as 1 hour on 256 cores. But does Ray actually run twice as fast on twice the cores? Or is it simply to variable to know the optimums depending on the data?

This isn't a huge concern, as the cost per CPU hour is low, so long as the speed is close to linear and I'm not "wasting" large numbers of cores nor getting it stuck for long periods with too low a number of cores.

And I put wasting in quotes because I know none will really be wasted. I'm just concerned that at some point the marginal addition of another core isn't offset by the marginal decrease in time to completion. Or the reverse, were the program gets "stuck" due to limited memory per core or what not.

Originally posted by seb567 View Post

But, you should first do a run with k=31 just to quality-control the thing first.

Will do. I was planning on trying a few Kmers anyway and comparing to a few other programs, to see what might work best.

Originally posted by seb567 View Post

What is the interconnect between your compute cores ?

Its a DDR InfiniBand, so it should be 4Gbits/sec.

**Wallysb01** · 07-13-2011, 07:07 AM

Originally posted by Wallysb01 View Post

Do you know how it scales? I'm billed linearly. So 2 hours on 128 cores is the same as 1 hour on 256 cores. But does Ray actually run twice as fast on twice the cores? Or is it simply to variable to know the optimums depending on the data?

After a night of playing around with the number of cores and letting the job run for an hour or two, I can somewhat answer my own question, and I thought other might be interested and also wanted to confirm that this sounds right.

Anyway, I ran two jobs with the same 2 lanes of data that totaled to about 500M PE reads. One time I ran with 256 processors, the other with 512, both for between 1.5 and 2 hours. What i found is that on our cluster, loading the sequences was very slow, and that it seemed to only occur one processor at a time. So, this step in the process will take the same amount of time regardless of the processor number, as the limiting factor is only the amount of data filing through our DDR InfiniBand, not the processors. From my estimation this was going to take 20 hours. That's kind of a lot of time for me to take up using 512 or even 256 processors. So I'm going to scale back to 96 or even 64 processors before I try it again. But even the 20*64 CPU hours is not chump-change for me, and that will only load the data.

Now, once the data is loaded, the processor speed on our cluster should be comparable to the one used for the Human data, which where handling about 12M reads per core. If I use 64 cores, I should still be at only 8M reads per core.

So, I'm hoping computationally, the cores can analyze the data roughly as quickly as colosse, and that later data transfers are not as much of a bottle neck. Can you confirm that might likely be the case, Sébastien?

**seb567** · 07-13-2011, 12:03 PM

Originally posted by Wallysb01 View Post

After a night of playing around with the number of cores and letting the job run for an hour or two, I can somewhat answer my own question, and I thought other might be interested and also wanted to confirm that this sounds right.

Anyway, I ran two jobs with the same 2 lanes of data that totaled to about 500M PE reads. One time I ran with 256 processors, the other with 512, both for between 1.5 and 2 hours. What i found is that on our cluster, loading the sequences was very slow, and that it seemed to only occur one processor at a time. So, this step in the process will take the same amount of time regardless of the processor number, as the limiting factor is only the amount of data filing through our DDR InfiniBand, not the processors. From my estimation this was going to take 20 hours. That's kind of a lot of time for me to take up using 512 or even 256 processors. So I'm going to scale back to 96 or even 64 processors before I try it again. But even the 20*64 CPU hours is not chump-change for me, and that will only load the data.

Now, once the data is loaded, the processor speed on our cluster should be comparable to the one used for the Human data, which where handling about 12M reads per core. If I use 64 cores, I should still be at only 8M reads per core.

So, I'm hoping computationally, the cores can analyze the data roughly as quickly as colosse, and that later data transfers are not as much of a bottle neck. Can you confirm that might likely be the case, Sébastien?

20 hours to load 250 M reads on Infiniband -- you probably have issues with you file system.

What is the latency measured by Ray ? (PREFIX.NetworkTest.txt)

Should be around 50-100 microseconds for what I know.

10 GigaEthernet is 400-700 microseconds.

By the way, Rank 0 computes the partition on the data (count entries in files, no allocation).

Then, the partition is sent to all MPI rank.

Finally, each MPI rank takes its slice.

This should be fast.

Is it a Lustre file system ?
If so, do you use special stripe values ?

Sébastien

**Wallysb01** · 07-13-2011, 12:19 PM

Hi Sébastien, thanks for the reply.

I don't see the NetworkText.txt file anywhere. I'll have to look into how to make sure those are part of the out puts.

I honestly don't know a ton about the cluster. Its the Saguaro cluster at ASU (http://hpc.asu.edu/home) if you want to take a look.

Though, I've now tried a job with 64 cores and it loaded the sequences in 1:45. That seems pretty reasonable. Though I got a bus error computing vertices. I've since reinstalled with the force packing turned off, which was already mentioned in this thread to possibly cause this problem, and am rerunning Ray again. And will reach the same spot as the bus error in about an hour.

So, I'm not sure if the cluster has some sort of bottle neck when trying to use over a certain number of processors, but I'm happy using 64 so long as it can finish inside 4 days.

**seb567** · 07-13-2011, 12:36 PM

Originally posted by Wallysb01 View Post

Hi Sébastien, thanks for the reply.

I don't see the NetworkText.txt file anywhere. I'll have to look into how to make sure those are part of the out puts.

I honestly don't know a ton about the cluster. Its the Saguaro cluster at ASU (http://hpc.asu.edu/home) if you want to take a look.

Though, I've now tried a job with 64 cores and it loaded the sequences in 1:45. That seems pretty reasonable. Though I got a bus error computing vertices. I've since reinstalled with the force packing turned off, which was already mentioned in this thread to possibly cause this problem.

So, I'm not sure if the cluster has some sort of bottle neck when trying to use over a certain number of processors, but I'm happy using 64 so long as it can finish inside 4 days.

If you use at least v1.6.1-rc3 https://github.com/sebhtml/ray/zipball/v1.6.1-rc3
then you should get the file PREFIX.NetworkTest.txt

"bus error"

Recompile Ray with FORCE_PACKING=n.

Running Ray with FORCE_PACKING=n will be faster on your cluster.

Basically, FORCE_PACKING=y does not align address on 8-byte frontiers.

On some architectures, the code will run very slowly or throw bus errors at you.

On other architectures, it won't.

As I explained, FORCE_PACKING=y causes bus errors on some processor architectures.

Action Point:

Recompile v1.6.1-rc3 with FORCE_PACKING=n (default value)

make PREFIX=ray-1.6.3-rc3
make install

The whole thing should run faster.

Sébastien

**Wallysb01** · 07-13-2011, 12:48 PM

Ok, I was just using 1.6.0.

I'll probably let this run go on 1.6.0 for now, but will install 1.6.3-rc3 and give it a try next time.

Will update you on progress and thanks for the help.

**seb567** · 07-13-2011, 03:39 PM

Contamination in mate-pair libraries (two peaks)

Originally posted by habm View Post

Our longer-insert Illumina mate-pair libraries have significant duplication contamination - ie two size peaks, one of inward facing false pe reads (innies) at under 300bp, and one of the outward facing reads (outies) nearer the desired insert size eg 3000 or 5000 bp.
How should the mp library size mean and SD be specified to allow Ray to deal with this, please?
Thanks.

PS, a run without any insert sizes specified (ie Automatic DetectionType) suggests that Ray found the innies OK, but not the useful outies:
LibraryNumber: 1 (nominally 3kbp, really more like 2200bp)
InputFormat: TwoFiles,Paired
AverageOuterDistance: 457
StandardDeviation: 441
DetectionFailure: Yes

LibraryNumber: 2 (nominally 6bp, really more like 4-5 kbp)
InputFormat: TwoFiles,Paired
AverageOuterDistance: 302
StandardDeviation: 218
DetectionFailure: Yes

LibraryNumber: 3 (nominally 8 kbp, really 6.3 kbp)
InputFormat: TwoFiles,Paired
AverageOuterDistance: 260
StandardDeviation: 213
DetectionFailure: Yes

Another way of expressing this is to ask whether Ray can disambiguate pe and mp reads, and if so, what input information is needed?

I will modify Ray over the next days to support many peaks in each library because the data for Assemblathon 2 also has these artefacts. So I guess it is normal to see these.

Plot: http://imgur.com/g657V
Data: http://pastebin.com/LCpAa9uv

But first I will release v1.6.1 on sourceforge as soon as I get my hands on my system test results.

Assemblathon 2

For those who follow Assemblathon 2, my last run on my testbed (Illumina data from BGI and from Illumina UK):

(all mate-pairs failed detection because of many peaks in each library)

Number of contigs: 550764
Total length of contigs: 1672750795
Number of contigs >= 500 nt: 501312
Total length of contigs >= 500 nt: 1656776315
Number of scaffolds: 510607
Total length of scaffolds: 1681345451
Number of scaffolds >= 500 nt: 463741
Total length of scaffolds >= 500: 1666464367

k-mer length: 31
Lowest coverage observed: 1
MinimumCoverage: 42
PeakCoverage: 171
RepeatCoverage: 300
Number of k-mers with at least MinimumCoverage: 2453479388 k-mers
Estimated genome length: 1226739694 nucleotides
Percentage of vertices with coverage 1: 83.7771 %
DistributionFile: parrot-Testbed-A2-k31-20110712.CoverageDistribution.txt

[1,0]<stdout>: Sequence partitioning: 1 hours, 54 minutes, 47 seconds
[1,0]<stdout>: K-mer counting: 5 hours, 47 minutes, 20 seconds
[1,0]<stdout>: Coverage distribution analysis: 30 seconds
[1,0]<stdout>: Graph construction: 2 hours, 52 minutes, 27 seconds
[1,0]<stdout>: Edge purge: 57 minutes, 55 seconds
[1,0]<stdout>: Selection of optimal read markers: 1 hours, 38 minutes, 13 seconds
[1,0]<stdout>: Detection of assembly seeds: 16 minutes, 7 seconds
[1,0]<stdout>: Estimation of outer distances for paired reads: 6 minutes, 26 seconds
[1,0]<stdout>: Bidirectional extension of seeds: 3 hours, 18 minutes, 6 seconds
[1,0]<stdout>: Merging of redundant contigs: 15 minutes, 45 seconds
[1,0]<stdout>: Generation of contigs: 1 minutes, 41 seconds
[1,0]<stdout>: Scaffolding of contigs: 54 minutes, 3 seconds
[1,0]<stdout>: Total: 18 hours, 3 minutes, 50 seconds

# average latency in microseconds (10^-6 seconds) when requesting a reply for a message of 4000 bytes
# Message passing interface rank Name Latency in microseconds
0 r107-n24 138
1 r107-n24 140
2 r107-n24 140
3 r107-n24 140
4 r107-n24 141
5 r107-n24 141
6 r107-n24 140
7 r107-n24 140
8 r107-n25 140
9 r107-n25 139
10 r107-n25 138
11 r107-n25 139

512 compute cores (64 computers * 8 cores/computer = 512)

Typical communication profile for one compute core:

[1,0]<stdout>:Rank 0: sent 249841326 messages, received 249840303 messages.

Yes, each core sends an average of 250 M messages during the 18 hours !

Total number of unfiltered Illumina TruSeq v3 sequences: Total: 3 072 136 294, that is ~3 G sequences !

Peak memory usage per core: 2.2 GiB

Peak memory usage (distributed in a peer-to-peer fashion): 1100 GiB

The peak occurs around 3 hours and goes down to 1.1 GiB per node immediately because the pool of defragmentation groups for k-mers occuring once is freed.

Sébastien

**Wallysb01** · 07-14-2011, 03:43 PM

Originally posted by seb567 View Post

What is the latency measured by Ray ? (PREFIX.NetworkTest.txt)

Should be around 50-100 microseconds for what I know.

10 GigaEthernet is 400-700 microseconds.

I finally did another try with a new install and Ray measured the latency around 125 microseconds. Is that part of the reason its loading fairly slowly?

Also, I did the FORCE_PACKING=n option on Ray-1.6.3-rc3, and I still encountered a bus error. Any ideas on what else might be causing that?

**seb567** · 07-14-2011, 04:16 PM

Originally posted by Wallysb01 View Post

I finally did another try with a new install and Ray measured the latency around 125 microseconds. Is that part of the reason its loading fairly slowly?

Also, I did the FORCE_PACKING=n option on Ray-1.6.3-rc3, and I still encountered a bus error. Any ideas on what else might be causing that?

125 is just fine.

Is this your system:

Saguaro 2

- | TOP500

http://www.top500.org/system/details/9789

If so, the architecture is

Intel EM64T Xeon E54xx (Harpertown) 2830 MHz (11.32 GFlops)

Xeon - Wikipedia

http://en.wikipedia.org/wiki/Xeon#5400-series_.22Harpertown.22

Did you supply a 'make clean' before doing 'make FORCE_PACKING=n' if you had build Ray-1.6.3-rc3 with FORCE_PACKING=y previously ?

Otherwise, your build will be a mix of both.

If not, can you send an email on the Ray mailing list and include at which point you get your bus error ?

denovoassembler-users List Signup and Options

https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Ray -- Parallel genome assemblies for parallel DNA sequencing

colosse.clumeq.ca also uses Intel Xeon processors, but they are Nehalem-EP, not Harpertown.

Thanks.

Sébastien

**Wallysb01** · 07-16-2011, 10:09 AM

Thanks for your help Sébastien, doing a fresh install of 1.6.1-rc3 with force packing off worked wonderfully. The starting sequence totaled about 250M pe 104bp reads and Ray finished in 18.5 hours on 96 nodes.

I ran the simple default condition, except for setting Kmers to 31, since I'd tried a few other programs with that size and wanted to be able to compare the assemblies, though no other program has finished yet.

The assembly was not particularly great however. N50=194, the total contig lengths only added up to 44% of the estimated genome size and the longest contig was 5383. Do you have any particular recommendations? We should have a coverage of almost 30x. Does that mean we should try larger Kmers? Would debug-bubbles and debug-seeds help?

**seb567** · 07-18-2011, 03:32 AM

Originally posted by Wallysb01 View Post

Thanks for your help Sébastien, doing a fresh install of 1.6.1-rc3 with force packing off worked wonderfully. The starting sequence totaled about 250M pe 104bp reads and Ray finished in 18.5 hours on 96 nodes.

I ran the simple default condition, except for setting Kmers to 31, since I'd tried a few other programs with that size and wanted to be able to compare the assemblies, though no other program has finished yet.

The assembly was not particularly great however. N50=194, the total contig lengths only added up to 44% of the estimated genome size and the longest contig was 5383. Do you have any particular recommendations? We should have a coverage of almost 30x. Does that mean we should try larger Kmers? Would debug-bubbles and debug-seeds help?

Well, you have to do some post-assembly quality control.

First, check the PREFIX.CoverageDistributionAnalysis.txt You need a high peak coverage. (at least 20, higher is better)

Do you have paired information ?

If so, check PREFIX.LibraryStatistics.txt and assert that the averages and standard deviations are OK.

What data do you have ? Is it Illumina TruSeq v. 3 ? (those are very neat)

-debug-bubbles and -debug-seeds and -show-ending-context just display more information to you.

Sébastien

sebhtml - Overview

http://github.com/sebhtml

I like software development, AI, biology and using good tools like git, cargo, and docker. - sebhtml

**Wallysb01** · 07-18-2011, 09:06 AM

So, peak coverage was 913.

Interestingly it estimated the genome size was ~1.8Mb, when it should be closer to ~1.8Gb.

I also specified the insert size of both the paired end libraries which were done with the Illumina TruSeq v3. So fair the insert sizes where both about 450 bps with 104 bp paired end reads. We may end up increasing the number of libraries we sequence to include larger inserts to improve the contigs and actually get scaffolds.

I was also planning back blast to filter out contigs aligning to human/bacterial genomes, but might not happen until I try a could of different sets of Kmers sizes.

**habm** · 07-18-2011, 03:57 PM

mate-pair libraries

Thanks for the helpful responses, Sébastien, and for addressing this problem in the next version.
Meanwhile, are the files PREFIX.LibraryX.txt made automatically, please? I cannot see them, whether I set a mean and sd for insert size in the command line, or not.

**Wallysb01** · 07-18-2011, 09:17 PM

Hi Sébastien,

I have a question, Ray appears to be automatically setting the minimum kmer coverage to 1 minus the peak coverage, is that true, and what's the reason for this? I apparently have absurd peak coverage of 626, even with k=61. Could this be why my assembly was poor the first time with k=31 above?

I'm thinking I'm going to cancel my job and drop that min coverage.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News