Seqanswers Leaderboard Ad

**maasha** · 02-13-2013, 01:44 AM

GNU parallel is brilliant for executing command line tools in a Unix/Linux setup with multiple servers/CPUs. It works very well with Biopieces. See the HowTo.

**turnersd** · 02-13-2013, 05:11 AM

I use it all the time in place of xargs.

**yaximik** · 02-13-2013, 06:15 AM

Yep, Biopieces is one example, although How To carefully says it can be used for some tasks. As examples for use of parallel do not include much of bioinformatic tasks, I wonder if there is some general idea(s) what tasks can benefit from parallel use. More specifically, could compute-intensive and long-running jobs like BLAST, alignment or de novo assembly benefit from parallel?

**turnersd** · 02-13-2013, 06:20 AM

Parallel won't parallelize an intrinsically serial job, but very easily allows you to launch many serial jobs in parallel. I use it all the time to run an operation on lots of files by using something like, e.g.:

Code:

find *.fq | parallel fastqc {} --outdir .  # run fastqc on all .fq files
find *.bam | parallel samtools index {}    # index all bam files

**Richard Finney** · 02-13-2013, 06:27 AM

This looks perfect. I've got my own homebrewed program I called "tetris" which does the same thing but I'll definitely switch to this.

Note the --max-procs parameter which throttles the serialized jobs to only use the specified amount of CPUs.

Anybody hooked this thing up to "gnu niceload"? Any examples?

**yaximik** · 02-13-2013, 07:06 AM

Interesting. Little bit off the topic, but I encountered strange difference in CPU use with fastqc. I made a small script to process 10 files at once (the box has 2 quad core processors with multithreading enabled, that is 8 physical cores and 16 threads), like
fastqc -t 10 [file1 ... file10]
When I launched the script CPU(s) got only to something 26us% in top. But when I just copied the above task to the command line, CPU(s) jumped to something 85us% in top. What may be the reason for the difference? Did you notice something like that with parallel?

**tange** · 02-13-2013, 10:51 AM

If used for research please remember:

Code:

parallel --bibtex

**maasha** · 02-13-2013, 11:03 AM

@yaximik I see these major benefits of parallel: 1) use parallel instead of a for; do & done; loop to execute some command in parallel in a way that optimizes the CPU usage (parallel cleverly decides to wait for jobs to complete before starting new jobs without flooding the machine). 2) use the parallel --pipe to parallelize the processing of huge files. 3) combine 1) and 2). And then there are all the other things that parallel can do for you.

**ersgupta** · 02-15-2013, 10:37 AM

I have been using it for the past 6-8 months. I feel very happy when I am able to run my jobs using parallel, because just saves a hell lot of time. Actually it helps in best utilization of the computational facilities you have.

Here is an example of the time that I save normally:
If I have to convert around 8 sam files to bam files, say it generally takes 8min for one file conversion. In serial it would take 64min, but when I run on cluster using GNU parallel, it just takes ~8min.

**maasha** · 02-15-2013, 10:40 AM

Over at Biostars there is this tool description.

**yaximik** · 02-15-2013, 10:55 AM

Originally posted by maasha View Post

Over at Biostars there is this tool description.

Oh, that is a cool set of examples. Tnx!

**rflrob** · 02-15-2013, 11:33 AM

Another nice thing about parallel is that it makes it easy to generate filenames in an intelligent way. Say you want to convert a bunch of bam files to sam files, you can easily do:

parallel 'samtools view -h -o {.}.sam {}' ::: *.bam

which does exactly what you want, instead of potentially ending up with .bam.sam or the like. That's just a trivial example (and possibly not correct, I never exactly remember the syntax), and there's a lot more you can do with it.

**yaximik** · 03-11-2013, 02:10 PM

I tried to run conversion between two assembly fomats using parallel and amos2ace, but got an error:

Code:

$ cat /home/yaximik/AssRefMap/SC/Ray/RayOutput/AMOS.afg | parallel --block 100M -k --pipe --recstart '{' --recend '}' amos2ace > /home/yaximik/AssRefMap/SC/Ray/RayOutput/AMOS.ace
substr outside of string at /usr/bin/parallel line 333.

Any idea what does this mean and how to fix the problem?

**maasha** · 03-12-2013, 01:06 AM

@yaximik

New questions in new threads. Do your homework first:

read this:

Ten Simple Rules for Getting Help from Online Scientific Communities

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002202

Internet,Archives,Web-based applications,Computer software,Careers,Grammar,Human learning,Scientists

and then

man parallel

Notice the section:

Your bug report should always include:

· The error message you get (if any).

· The output of parallel --version. If you are not running
the latest released version you should specify why you
believe the problem is not fixed in that version.

· A complete example that others can run that shows the
problem. This should preferably be small and simple. A
combination of seq, cat, echo, and sleep can reproduce most
errors. If your example requires large files, see if you
can make them by something like seq 1000000 > file.

· The output of your example. If your problem is not easily
reproduced by others, the output might help them figure out
the problem.

If you suspect the error is dependent on your environment or
distribution, please see if you can reproduce the error on
one of these VirtualBox images:

VirtualBoxes - Free VirtualBox(R) Images - Browse Files at SourceForge.net

http://sourceforge.net/projects/virtualboximage/files/

Appliances of free/open source operating systems for VirtualBox

Specifying the name of your distribution is not enough as you
may have installed software that is not in the VirtualBox
images.

If you cannot reproduce the error on any of the VirtualBox
images above, you should assume the debugging will be done
through you. That will put more burden on you and it is extra
important you give any information that help.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

GNU parallel - any benefits?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News