Isaac2 genome index creation

craczy replied

07-24-2015, 09:02 AM
Originally posted by GenoMax View Post

@Semyon/Come: Can one of you confirm if the following files represent the correct isaac2 index for hg19 genome? My isaac-sort-reference job appeared to have finished (no errors) but these are the only files I see in the top level directory (Temp directory is still there with files within)

Code:

1.1G 2uniqueness.16bpb.gz 47G kmer-positions-32-0.dat 50K sorted-reference.xml

This looks correct, but surprising. Did you specify something like "-w 1" on the command line by any chance?

All the kmers are indexed in on single data file (kmer-positions-32-0.dat), which is not a very good thing as it prevents parallelisation when searching for mapping candidates.

You can use the "isaac-pack-reference" and then "isaac-unpack-reference -w 6" to split the index into smaller files without having to re-doing the reference sorting.
Leave a comment:
sklages replied

07-24-2015, 02:14 AM
Originally posted by sklages View Post

Well, .. for now .. the server crashed overnight, just three hours ago ..
We now have to investigate what event caused this crash. Maybe it is just "Murphy's Law" .. we'll see.

Well, .. it was indeed Murphy's law :-)
We had a failure on a network interface .. that made at least one process going frenzy and pushed the load beyond 1000...

So I'll restart indexing today.
Leave a comment:
sklages replied

07-23-2015, 09:52 PM
Originally posted by sklages View Post

OK .. index creation is running for hg19 ... I'll report back tomorrow.

Well, .. for now .. the server crashed overnight, just three hours ago ..
We now have to investigate what event caused this crash. Maybe it is just "Murphy's Law" .. we'll see.
Leave a comment:
GenoMax replied

07-23-2015, 04:18 PM
@Semyon/Come: Can one of you confirm if the following files represent the correct isaac2 index for hg19 genome? My isaac-sort-reference job appeared to have finished (no errors) but these are the only files I see in the top level directory (Temp directory is still there with files within)

Code:

1.1G 2uniqueness.16bpb.gz 47G kmer-positions-32-0.dat 50K sorted-reference.xml
Leave a comment:
sklages replied

07-23-2015, 06:30 AM
I haven't neither .. should use 32.
But .. I am optmistic :-)
Leave a comment:
GenoMax replied

07-23-2015, 06:23 AM
I did not specify a value for seed-length so the process is creating all possible combinations [--annotation-seed-lengths arg (=16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80]. It looks like the end may be in sight today for the process I am running since the files for 80 are being made now.

@sven: Expect a multi-day turnaround.
Leave a comment:
craczy replied

07-22-2015, 12:12 PM
Originally posted by sklages View Post

But format of index files has not changed from version 1 to 2?

Unfortunately it has. The index contains extra information about the reference and with isaac2 that information has changed. Specifically, in the isaac2 index we are keeping track for each position in the reference genome if there are similar sequences elsewhere in the reference.
Leave a comment:
sklages replied

07-22-2015, 11:42 AM
Thanks for the clarification ... good news on isaac-align ;-)

And as for isaac-sort-reference I still do not think that this is the right way; but you are probably right, as we only run it once for each reference, it might not be too much of a problem for an experienced user for the moment. Nevertheless you should consider changing or extending this behavior in that, that a user is able to restrict resources on a single node.
But format of index files has not changed from version 1 to 2?
Leave a comment:
craczy replied

07-22-2015, 10:06 AM
Originally posted by sklages View Post

Sure, .. the job will be run on a single node. Nevertheless I need to know roughly about the resources my job will use and I should be able to restrict resources as well, even if it runs on non-cluster server.

These are only workarounds ... I do see the problem with the software being designed that way.
IMHO there is no reason to let the user without control over the resources a software uses ... there is always the argument "speed" and "efficiency" .. maybe. But sometimes it is not only speed that is important ..

Roman mentioned on github that at least the aligner may be restricted to a certain number of CPUs, but is not recommended for the sake of "efficiency of processing". Again, "efficiency" does not always mean "speed of single job". But that's just my 2p ;-)

First of all sorry for the confusion around the meaning of the "-j" option across the different tools, and about the inconvenience that you experienced. To clarify:

- isaac-align: this is a single node and single process application and the "-j" option controls the maximum number of compute threads, which would effectively enable the user to control the CPU load on the node (the recommendation is to let the application figure out and use the available resources). As the user can also control the amount of memory used by the process, this should work fairly well in a cluster environment. If it causes trouble with your job scheduler, we would really like to better understand the issue so that we can effectively resolve it (it is a really important feature!).

- isaac-sort-reference: this is a multiprocess application. It can be distributed on multiple nodes but that requires explicit specification of the qrsh (or other) command line. The option "-j " is for the number of parallel operations (processes as opposed to threads). The recommendation is to execute it on a single node and to use "-j 1". At the moment, this application does not provide any control to the user for CPU and memory usage. Hopefully this inconvenience is mitigated by the fact that it need to run at most once per reference. If there really is a need to restrict resource usage, doing it with modern solutions like virtualization might be a good option.

Regarding the time and resources required to run "isaac-sort-reference", a server with 150GB memory is required. A dual CPU (mid-range or better) is recommended. It is also useful to have a reasonably good file system as the operation does quite a bit of IOs. With a mid-range server it should take about half a day. Again the "-j 1" option is important on a single node, otherwise the processes will compete for CPU, memory, swap, etc. If it takes much longer than that, it might be worth checking that the node is not stuck on IO waits or busy swapping.

Thanks a lot for your feedback on github!

Come
Leave a comment:
GenoMax replied

07-22-2015, 05:48 AM
I am with you all the way. Core infrastructure providers are left to fend for this sort of thing, which the end-users don't appreciate/care about.

This was one of the reasons I started this conversation so @semyon can take the real world observations back for internal discussion/improvements, especially if they want more users to use their software.

Last edited by GenoMax; 07-22-2015, 05:51 AM.
Leave a comment:
sklages replied

07-22-2015, 05:38 AM
Originally posted by GenoMax View Post

isaac is not meant to be used across a cluster (just on a single node in the cluster).

Sure, .. the job will be run on a single node. Nevertheless I need to know roughly about the resources my job will use and I should be able to restrict resources as well, even if it runs on non-cluster server.

You have to be creative. Request exclusive access to a node in your scheduler/limit the I/O. It will also involve conversations with your cluster admins so they don't have a heart attack on seeing those kinds of loads on a single server in the cluster.

These are only workarounds ... I do see the problem with the software being designed that way.
IMHO there is no reason to let the user without control over the resources a software uses ... there is always the argument "speed" and "efficiency" .. maybe. But sometimes it is not only speed that is important ..

Roman mentioned on github that at least the aligner may be restricted to a certain number of CPUs, but is not recommended for the sake of "efficiency of processing". Again, "efficiency" does not always mean "speed of single job". But that's just my 2p ;-)
Leave a comment:
GenoMax replied

07-22-2015, 05:10 AM
Originally posted by sklages View Post

Ha, .. but as you said .. that makes it completely unusable for cluster environments, if you have no control over cpu usage / machine load. Bad people may call that "error by design". ;-)

isaac is not meant to be used across a cluster (just on a single node in the cluster).

If this behaviour will not be changed I will never be able to test the aligner itself :-)

You have to be creative. Request exclusive access to a node in your scheduler/limit the I/O. It will also involve conversations with your cluster admins so they don't have a heart attack on seeing those kinds of loads on a single server in the cluster.

Last edited by GenoMax; 07-22-2015, 05:12 AM.
Leave a comment:
sklages replied

07-22-2015, 04:57 AM
Originally posted by GenoMax View Post

HiSeq Analysis Software (HAS) which isaac was a part of always did this. It seems to pay no attention to -j directive (as I said yesterday HAS documentation does say that it will take over the node).

I suggest watching I/O on your RAID (especially if it is shared with some other users/nodes). HAS/Isaac can do some interesting things to storage too.

EDIT: Just saw your update. That kind of load is "normal". It is only periodic (and partly related to storage). Isaac will also not use all the cores all the time so that part is "normal" too.

Ha, .. but as you said .. that makes it completely unusable for cluster environments, if you have no control over cpu usage / machine load. Bad people may call that "error by design". ;-)

I have written a ticket on github as I consider this a bug.

If this behaviour will not be changed I will never be able to test the aligner itself :-)

In the past I never used HAS as it was shipped with a very old version of the aligner ..
Leave a comment:
GenoMax replied

07-22-2015, 04:47 AM
Originally posted by sklages View Post

Using "--jobs 32" on a 48 core machine results in 48 threads. So you are right ...
But this is a bug. Otherwise "--jobs N" does not make any sense.
So let's see how long it takes to build hg19 index ..

HiSeq Analysis Software (HAS) which isaac was a part of always did this. It seems to pay no attention to -j directive (as I said yesterday HAS documentation does say that it will take over the node).

I suggest watching I/O on your RAID (especially if it is shared with some other users/nodes). HAS/Isaac can do some interesting things to storage too.

EDIT: Just saw your update. That kind of load is "normal". It is only periodic (and partly related to storage). Isaac will also not use all the cores all the time so that part is "normal" too.
Leave a comment:
sklages replied

07-22-2015, 04:22 AM
I started to build a new index for hg19 on 32 cores on a fast (local) RAID.

Using "--jobs 32" on a 48 core machine results in 48 threads. So you are right ...
But this is a bug. Otherwise "--jobs N" does not make any sense.
So let's see how long it takes to build hg19 index ..

UPDATE:
For now I have cancelled building the index.
The average load on that machine raised beyond 52 with peaks over 128 ... I need to investigate first :-)

Last edited by sklages; 07-22-2015, 04:39 AM.
Leave a comment:

Previous 1 2 3 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News