Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SGE and ncbi-blast-2.2.28+

    Hello,

    I've predicted genes from metagenomic assemblies with FragGeneScan. The next step is to query the predicted peptides against NCBI's nr database. My cluster consists of sixteen Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz, which makes up 256 threads altogether. An option would be to use the '-num_threads' flag in blast. However, in my experience, this doesn't parallelize the task entirely.

    So what I'm going to do is to run blast with SGE using the below script (after modifying it suitable for blast-2.2.28+). Here's more info.

    Code:
    #!/bin/bash
    #
    #$ -cwd
    #$ -S /bin/bash
    #$ -j y
    
    export BLASTDB=/share/bio/ncbi/db/
    export BLASTMAT=/opt/Bio/ncbi/data/
    
    export PATH=$PATH:/opt/Bio/ncbi/bin
    
    blastall -d patnt -p blastn -i $HOME/test.txt -o $HOME/result.txt
    I have no previous experience with SGE (all I know is that it's setup on the cluster I'm using). So my question is, should I omit the '-num_threads' flag from my query entirely?
    savetherhino.org

  • #2
    Have a look at this as another option: http://www3.imperial.ac.uk/bioinfsup...ing_array_jobs

    Even though you have 16 CPU's how much memory do you have available for each? You may need about ~10G per job if you are going to search against "nr".

    Comment


    • #3
      Originally posted by GenoMax View Post
      Have a look at this as another option: http://www3.imperial.ac.uk/bioinfsup...ing_array_jobs

      Even though you have 16 CPU's how much memory do you have available for each? You may need about ~10G per job if you are going to search against "nr".
      Thanks, the link is very helpful. The cluster has 256G RAM. So, I suppose a good solution would be to run 16 independent tasks with 16 threads in each.
      savetherhino.org

      Comment


      • #4
        Originally posted by rhinoceros View Post
        Thanks, the link is very helpful. The cluster has 256G RAM. So, I suppose a good solution would be to run 16 independent tasks with 16 threads in each.
        What kind of a cluster is this?

        Most commodity clusters have nodes with a certain amount of RAM (e.g. on a cluster I access there are blades with dual quad core xeon CPU's accessing 72GB of local RAM) and then there clusters with "shared" memory access (e.g. NUMA). I have not seen cluster of the latter kind in common use of late.

        Is your cluster the latter type when you say that you have 256G RAM? Or do you actually have 256G RAM on each node (not completely unlikely now-a-days)?

        Unless you are the only person using this cluster you may not be able to spawn off those many jobs simultaneously. Then there will be some dependence on the type/speed of storage.
        Last edited by GenoMax; 04-12-2013, 09:21 AM.

        Comment


        • #5
          Originally posted by GenoMax View Post
          What kind of a cluster is this?

          Most commodity clusters have nodes with a certain amount of RAM (e.g. on a cluster I access there are blades with dual quad core xeon CPU's accessing 72GB of local RAM) and then there clusters with "shared" memory access (e.g. NUMA).

          Is your cluster the latter type when you say that you have 256G RAM? Or do you actually have 256G RAM on each node (not completely unlikely now-a-days)?
          I'm not 100% sure, but I think the cluster consists of 16 Dell R620's, i.e. 16 GB RAM in each node..

          Code:
          cat /proc/meminfo
          MemTotal:       264635596 kB
          ..
          savetherhino.org

          Comment


          • #6
            Originally posted by rhinoceros View Post
            I'm not 100% sure, but I think the cluster consists of 16 Dell R620's, i.e. 16 GB RAM in each node..

            Code:
            cat /proc/meminfo
            MemTotal:       264635596 kB
            ..
            So you do have a cluster of the first type and the cluster head-node does seem to have 256GB RAM (assuming that is where you ran the cat command).

            Not sure if your sys admins allow you to run jobs on head-node ....

            If the worker nodes have only 16GB RAM each then you are not going to be able to perhaps run more than one job per node (you could but then things will use swap/tmp and everything will be slow). I suggest experimenting with test jobs allocating different memory to see if you could squeeze in two jobs per node.

            Comment


            • #7
              Hello again,

              Will the following result in 16 parallel instances of blast with each instance running 16 threads? Original input.fasta has been divided into 16 files named input.1 - input.16.

              qsub -t 1-16:1 blastp-sge.sh

              Code:
              #!/bin/bash
              #$ -N blastp
              #$ -j y
              #$ -cwd
              #$ -l h_vmem=2G -pe smp 8
              #$ -R y
              /path/to/ncbi-blast/2.2.28+/bin/blastp -query input.${SGE_TASK_ID} -db /path/to/db/nr -seg yes -soft_masking true -use_sw_tback -evalue 1e-5 -outfmt "6 qseqid sseqid sgi staxids pident length mismatch gapopen qstart qend sstart send evalue bitscore" -num_threads 16 -out ${SGE_TASK_ID}.tsv
              Output would be 1.tsv - 16.tsv which could be merged easily. I'm having particularly hard time understanding the '#$ -l h_vmem=2G -pe smp 8' line.
              Last edited by rhinoceros; 04-13-2013, 08:25 AM.
              savetherhino.org

              Comment


              • #8
                Originally posted by rhinoceros View Post
                I'm having particularly hard time understanding the '#$ -l h_vmem=2G -pe smp 8' line.
                The h_vmem parameter has to do with the memory allocation for the job. This page has info about this parameter: http://www.biostat.jhsph.edu/bit/clu...e.html#MemSpec

                The "pe" part refers to a parallel environment (if there is one set up on your cluster). This would be related to "num_threads" part for your blast jobs as described here: http://www3.imperial.ac.uk/bioinfsup..._parallel_jobs

                You may want to confer with your local SGE admin about the right parameters to set for the queues you have access to.

                Comment


                • #9
                  Everything is working now. My script blastp.sh is as follows:

                  Code:
                  #!/bin/bash
                  #$ -V
                  #$ -N blastp
                  #$ -j y
                  #$ -cwd
                  #$ -pe orte 16
                  /path/to/ncbi-blast/2.2.28+/bin/blastp -query input.${SGE_TASK_ID} -db /path/to/db/nr -lotsOfFlags -outfmt 6 -num_threads 16 -out ${SGE_TASK_ID}.tsv
                  The input is a fasta file that I have split to 20 parts with fastasplitn (input.1, input.2, .., input.20). I call the script from the same dir as follows: qsub -t 1-20:1 blastp.sh

                  So I'm running in this case 20 parallel blasts with 16 threads in each (though actually some of them are in the queue). Output is 1.tsv, 2.tsv, .., 20.tsv which I'll merge by

                  cat 1.tsv 2.tsv .. 20.tsv > blast_result.tsv

                  And that's that. I hope others might find this useful..
                  savetherhino.org

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Best Practices for Single-Cell Sequencing Analysis
                    by seqadmin



                    While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                    06-06-2024, 07:15 AM
                  • seqadmin
                    Latest Developments in Precision Medicine
                    by seqadmin



                    Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                    Somatic Genomics
                    “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                    05-24-2024, 01:16 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Today, 07:49 AM
                  0 responses
                  12 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 07:23 AM
                  0 responses
                  14 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 06-17-2024, 06:54 AM
                  0 responses
                  16 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 06-14-2024, 07:24 AM
                  0 responses
                  24 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X