Header Leaderboard Ad

Collapse

Bowtie and Clustering question.

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bowtie and Clustering question.

    Hi Group,
    I am a relative newbie tying to come upto speed. So I managed to assemble a 20 core cluster and am just beginning to figure out how to work the bioinformatics assembly algorithms. So my scenario is this

    1) I currently have a WES raw data file measuring 5 GB. I have a quality score file which is approximately 12 GB.

    2) I have a four node AMD cluster with 32 GB RAM. I installed and configured Rocks software on the same.

    3) I have been looking into Bowtie to do the analysis on this cluster.

    Some questions which come to my mind are as follows

    1) How and where do I start?

    2) Is it possible to install bowtie on the ROCKS cluster such that I can use the 4 nodes to run the analysis in parallel?

    3) For this single massive file of 5 GB raw reads, how do I go about doing the assembly?

    4) With bowtie, am I restricted to using only ONE node on which to run the analysis on?

    5) OR, can I split my raw reads of file X4 and farm out each file to each one of the nodes and do the assembly and then do a final assembly of all the four assembled files?

    6) Has anyone installed Galaxy tools on a ROCKS cluster? Could you share your experiences of the same?

    I realize these are very basic and fundamental questions. But I would highly appreciate an answer. Hopefully I will be able to answer these questions on the forum in the near future.
    Regards
    Quantrix

  • #2
    Howdy, I'm new here, but I do parallel for a living. (hpc type)

    I can't speak to some of the things you've asked, but I have installed pMap and bowtie for customers for things such as this. I'd recommend pMap for simplicity. Either way you should be able to get every core on every node working with bowtie in parallel. IO will more than likely be your limiting factor then.

    http://bmi.osu.edu/hpc/software/pmap/pmap.html

    http://bowtie-bio.sourceforge.net/crossbow/index.shtml

    pmap is MPI based, so if you have an interconnect (eth, ib, quadrics,myri, etc) and some type of MPI installed you should be good. pMap supports BWA, SOAP, Bowtie, GSNAP, MAQ and RMAP.

    crossbow is Hadoop based. I can't say I've seen hadoop on rocks (not a fan of rocks myself, but it is an excellent way to start with clusters) but it is possible. I'd be REALLY surprised if no one has ever done it as there are some rather decent sized clusters out there (TACC, PNNL) using rocks. I'd search for a hadoop roll. I'd be willing to bet it's out there.

    hpc

    Comment


    • #3
      There is some discussion of running Galaxy on ROCKS in this Galaxy-dev thread from this January.

      Comment


      • #4
        Hi hpcguy and Tnab,
        Thanks for the replies. I shall look into pMap right away. It sounds like one possible solution for me to start exploring.

        @hpcguy,
        You say you are not a fan of Rocks. I have had to wrestle with quite a few issues in getting it upto speed due to a combination of factors. However, it is running smoothly now. I was wondering if I should not go ahead and use something like plain CentOS and install other stuff separately. What is your take on this? Do you have a favorite and why? I was also looking into Ubuntu with Kerrighed as one option. (Ubuntu enterprise maybe?)
        Problem is there is not very much out there in terms of leads of how to go about clustering. If at all.

        Comment


        • #5
          the following is an example of how to run bowtie on multiple nodes... will require splitting the .fastq file, then reassembling the .sam in the end.
          First see how many reads you have.

          "cat yourfile.fastq | echo $((`wc -l`/4))"

          the result was = 14901431, so create two jobs in this case to run on two different nodes
          of the rocks cluster. I created a few .sh scripts... and just keep editing them for each different job. "nano bowtie_script_1.sh"... then edit as follows:

          #!/bin/bash
          #
          #$ -S /bin/bash
          bowtie -m 1 -S -p 4 -s 0 --qupto 7450715 share/apps/bowtie-1.0.0/indexes/hg19 yourfile.fastq

          second job will have different start and finish... split as many times as nodes you want to run it on.. this example uses 2 nodes.
          second script: "nano bowtie_script_2.sh"... then edit as follows:
          #!/bin/bash
          #
          #$ -S /bin/bash
          bowtie -m 1 -S -p 4 -s 7450715 --qupto 14901431 share/apps/bowtie-1.0.0/indexes/hg19 yourfile.fastq

          If you have bowtie installed correctly, you can then run the following:

          qsub bowtie_script_1.sh
          qsub bowtie_script_2.sh

          this will result in two files in .SAM format

          bowtie_script_1.sh.o##
          bowtie_script_2.sh.o##

          you would then need to join the two outputs into one .SAM file.

          "cat bowtie_script_1.sh.o## <(grep -v '^@' bowtie_script_2.sh.o##) > merged_sam.sam"

          Install of bowtie...

          to make it available to all of your compute nodes, install it into the /export/apps/ folder, which will make it available to all of your nodes.

          then edit the "/etc/skel/.bash_profile" PATH to include ":/share/apps/bowtie-1.0.0"

          if you run these jobs using qsub.. if it error's out, it will create an error file in your home directory.. which will point you into the right direction.

          good luck.

          Comment


          • #6
            Originally posted by hpcguy View Post
            Howdy, I'm new here, but I do parallel for a living. (hpc type)

            I can't speak to some of the things you've asked, but I have installed pMap and bowtie for customers for things such as this. I'd recommend pMap for simplicity. Either way you should be able to get every core on every node working with bowtie in parallel. IO will more than likely be your limiting factor then.

            http://bmi.osu.edu/hpc/software/pmap/pmap.html

            http://bowtie-bio.sourceforge.net/crossbow/index.shtml

            pmap is MPI based, so if you have an interconnect (eth, ib, quadrics,myri, etc) and some type of MPI installed you should be good. pMap supports BWA, SOAP, Bowtie, GSNAP, MAQ and RMAP.

            crossbow is Hadoop based. I can't say I've seen hadoop on rocks (not a fan of rocks myself, but it is an excellent way to start with clusters) but it is possible. I'd be REALLY surprised if no one has ever done it as there are some rather decent sized clusters out there (TACC, PNNL) using rocks. I'd search for a hadoop roll. I'd be willing to bet it's out there.

            hpc
            I suppose pMap will work flawlessly on a Rocks cluster based on SGE right?
            It supports bowtie, does it also supports bowtie2?

            Thanks.

            Comment


            • #7
              Howdy. To all the folks that have sent me Private Messages about this: please set up your mailbox such that I can reply. I cannot answer your questions without a way to reach you. thanks.

              H

              Comment


              • #8
                Rocks is fantastic when a group/person/dept is starting out. No bones about it. Fantastic. Roll it out on a single rack in 10 min if you just give it a go. Be up and running apps in 15 min (with data being available). Not much beats this. Even AWS takes more work to configure. I've personally installed it and had a 2 rack cluster up and running from turn on in under 30 minutes and was running batch jobs. But the cluster was NEVER supposed to run another application ever again.

                The problem becomes as soon as there is a move into a more intermediate need/area. Rocks does not lend itself to being as flexible as needed for simplicity in advanced work. Moving to stock CentOS or Scientific Linux, RHEL, Ubuntu LTS,etc becomes a large step that can be intimidating but long term most folks that I've spoke or worked with look back and say they were glad they made the move.

                I would recommend making the change to something else when you feel Rocks just is too restrictive or you need more than you can find in the normal Rolls, etc.

                Originally posted by quantrix View Post
                Hi hpcguy and Tnab,
                Thanks for the replies. I shall look into pMap right away. It sounds like one possible solution for me to start exploring.

                @hpcguy,
                You say you are not a fan of Rocks. I have had to wrestle with quite a few issues in getting it upto speed due to a combination of factors. However, it is running smoothly now. I was wondering if I should not go ahead and use something like plain CentOS and install other stuff separately. What is your take on this? Do you have a favorite and why? I was also looking into Ubuntu with Kerrighed as one option. (Ubuntu enterprise maybe?)
                Problem is there is not very much out there in terms of leads of how to go about clustering. If at all.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Targeted Sequencing: Choosing Between Hybridization Capture and Amplicon Sequencing
                  by seqadmin




                  Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...
                  03-10-2023, 05:31 AM
                • seqadmin
                  Expert Advice on Automating Your Library Preparations
                  by seqadmin



                  Using automation to prepare sequencing libraries isn’t a new concept, and most researchers are aware that there are numerous benefits to automating this process. However, many labs are still hesitant to switch to automation and often believe that it’s not suitable for their lab. To combat these concerns, we’ll cover some of the key advantages, review the most important considerations, and get real-world advice from automation experts to remove any lingering anxieties....
                  02-21-2023, 02:14 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-17-2023, 12:32 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-15-2023, 12:42 PM
                0 responses
                17 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-09-2023, 10:17 AM
                0 responses
                66 views
                1 like
                Last Post seqadmin  
                Started by seqadmin, 03-03-2023, 12:03 PM
                0 responses
                64 views
                0 likes
                Last Post seqadmin  
                Working...
                X