Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I have two clusters. One has 8 machines, 16 CPUs each, 128 GB memory each, all connected to a fast disk. However I can only run the command-line Bioscope on it. With that much machine power I do not worry about running out of memory.

    My other cluster also has 8 machines. 4 with 4 CPUs and 8 GB memory each and the other 4 with 8 CPUs and 32 GB memory each. I have been trying to run WT-bioscope on these machines but with less success. I am running out of memory plus sometimes getting kernel warnings. My current parameters are:

    mapping.np.per.node=4
    mapping.number.of.nodes=10
    mapping.memory.size=3

    In other words 4 CPUs per node and 10 nodes (I am making my 8-cpu machines into 2 nodes each so, in theory, I should have 12 nodes but I wanted to leave some processing power free).

    The memory parameter is 3 GB but I am unsure what this really means. Does bioscope start up 4-cpu jobs on a node using only 3 GB? Or does bioscope start up 4 1-cpu jobs on a node using 3 times 4 GB of memory? It appears to do the latter since my 8 GB machines have to use virtual memory at times.

    I really hesitate to go below 3 GB since my genome reference size is ~2 GBases. As far as I can tell Bioscope is chopping up the matching portion of its pipeline into many small chunks in order to accommodate this small memory allocation.

    Anyway I would say that the more memory you have then the better off you are. It makes sense to run fewer jobs with lots of memory than many jobs each starved for memory.

    Once I get Bioscope running on my small cluster using all 8 machines then I will try it out on the small cluster using just the 4 large memory machines. Our small cluster is sort of a 'recycled' cluster (e.g. some of the machines were given to us) and we would like to use it if possible. I hate to think that a 4-cpu, 8-GB machine is just so much junk and thus we should re-gift it but, for Bioscope at least, it appears that those machines may indeed be worthless.

    Comment


    • #17
      Originally posted by westerman View Post
      I have two clusters. One has 8 machines, 16 CPUs each, 128 GB memory each, all connected to a fast disk. However I can only run the command-line Bioscope on it. With that much machine power I do not worry about running out of memory.

      My other cluster also has 8 machines. 4 with 4 CPUs and 8 GB memory each and the other 4 with 8 CPUs and 32 GB memory each. I have been trying to run WT-bioscope on these machines but with less success. I am running out of memory plus sometimes getting kernel warnings. My current parameters are:

      mapping.np.per.node=4
      mapping.number.of.nodes=10
      mapping.memory.size=3

      In other words 4 CPUs per node and 10 nodes (I am making my 8-cpu machines into 2 nodes each so, in theory, I should have 12 nodes but I wanted to leave some processing power free).

      The memory parameter is 3 GB but I am unsure what this really means. Does bioscope start up 4-cpu jobs on a node using only 3 GB? Or does bioscope start up 4 1-cpu jobs on a node using 3 times 4 GB of memory? It appears to do the latter since my 8 GB machines have to use virtual memory at times.

      I really hesitate to go below 3 GB since my genome reference size is ~2 GBases. As far as I can tell Bioscope is chopping up the matching portion of its pipeline into many small chunks in order to accommodate this small memory allocation.

      Anyway I would say that the more memory you have then the better off you are. It makes sense to run fewer jobs with lots of memory than many jobs each starved for memory.

      Once I get Bioscope running on my small cluster using all 8 machines then I will try it out on the small cluster using just the 4 large memory machines. Our small cluster is sort of a 'recycled' cluster (e.g. some of the machines were given to us) and we would like to use it if possible. I hate to think that a 4-cpu, 8-GB machine is just so much junk and thus we should re-gift it but, for Bioscope at least, it appears that those machines may indeed be worthless.
      Apparently the min req is not just 2 GB per core but at least 16 GB ram per node and 24 GB ram is recom for human mapping.
      I have been wrestling with ABI to try to make it work but they are less responsive when told I am working with 8 GB machines.
      Just trying to map mouse transcriptome reads at this time and so far the 'big' memory jobs complete.
      its the small 2 Gb jobs that fail possibly cos of temporary network glitches which bioscope isn't written to handle and I was advised to restart the job

      Do drop me a pm or a reply here if u get the 8 GB machines working.
      else i think they would be good enough for BWA or bowtie mapping.
      http://kevin-gattaca.blogspot.com/

      Comment


      • #18
        Haven't had success with the 8 GB nodes yet. Will keep trying as time permits.

        I'll agree that bioscope does not handle temporary network glitches. While these should not occur I find that my disk appliance and network does get overwhelmed -- rarely but certainly -- with lots of SOLiD processes it. To the point where a request gets shunted off to the side and Bioscope goes belly up. :-( It is not that hard to write software that can handle temporary glitches. One retry is all I am asking for.

        Comment


        • #19
          Originally posted by clariet View Post
          Just saw this post. We were able to use a Whole-transcriptome pipeline of BioScope (1.0.1-42) on a RNA-seq dataset. And a note about its mapping statistics. I confirmed with their specialists that the current version of BS has bug on those numbers. so it will be fixed in next release, hopefully very soon.

          We have a feeling that a large proportion of reads are wasted for SOLiD data compared to Solexa. For example, for a current chip-seq dataset, we have seen a average of 80M reads generated for a sample (quad). However, after filtering of low quality alignment and non-unique hits, only ~4% of reads could be used for further peak detection. Has anyone have similar experience? Does this sound normal?
          Hi there,

          I'm currently using BioScope v1.0.1. May I know what the bug on the statistics is?

          Comment


          • #20
            Bioscope sounds like a complex system that is memory hungry and CPU intensive. Although I must say that I've used corona-lite in the past and that seemed to be a lot more difficult to work with especially with the amount of computational time required for the alignment stage.

            We've been developing a new aligner for AB colorspace, novoalignCS, featuring

            1. Mate-pair alignment (F3 & R3) of csfasta/csfastq. If reads are in bead order for F3/R3 mates then pairs are identified and mapped accordingly.
            2. Gapped alignment by default with mismatches.
            3. SAM output (supporting RG). We've been using samtools and Picard to validate our SAM records.
            4. Requires < 10Gb for matching against human/mouse/chimp,etc
            5. Multithreaded (and MPI cluster-aware in the near future)
            6. Polyclonal and color error filtering based on the SOPRA method (Sasson & Micheal, 2010).
            7. Calculates mate-pair fragment length distribution given an initial distribution e.g 5K library with SD=500.

            We are still busy with testing and comparison to other aligners i.e. BFAST, BWA. Although at this point we do welcome feedback from beta testers. If anybody is interested in obtaining a version please PM me or visit our site.

            Comment


            • #21
              I've used Bioscope 1.1 and now have Bioscope 1.2. They did away with a lot of the temporary files. I haven't noticed much improvement otherwise. I got a few of my RNA samples to run but half of them crashed. When I restarted the pipeline it finished, so I'm not exactly sure why, but I suspect NFS delays.

              I have a ChIP dataset that I tried to run through Bioscope and it flat out failed. ABI recommended I continue using the old version of Bioscope until they have a fix...over a week now.

              At this point, I'm not using Bioscope anymore. It looks like BWA or BFAST for color space reads.

              Comment


              • #22
                Recently I obtained Solid RNA-Seq data. I have been creating the transcriptome library by using the annotation from UCSC hg19 refFlat file. However, when I aligned using BWA, the percentage of mapping rate is around 7%. I have built the color space index of the transcriptome library using the command:
                bwa index -a bwtsw -c hg19_transcript.fa
                then,
                bwa aln -c hg19_transcript.fa reads.fastq > align-reads.sai

                I was wondering, what could be the reason for such a low mapping rate? Well, I know that some can be mapped to exon junction, but I have created junction library as well, and the increment in mapping rate is very low (less than 0.1%).

                Anybody has similar experience? Do I need to tweak certain parameters in the bwa aln?

                Any input will be highly appreciated. Thanks!

                Comment


                • #23
                  Low mapping rate for SOLiD data with BWA

                  Originally posted by win804 View Post
                  Recently I obtained Solid RNA-Seq data. I have been creating the transcriptome library by using the annotation from UCSC hg19 refFlat file. However, when I aligned using BWA, the percentage of mapping rate is around 7%. I have built the color space index of the transcriptome library using the command:
                  bwa index -a bwtsw -c hg19_transcript.fa
                  then,
                  bwa aln -c hg19_transcript.fa reads.fastq > align-reads.sai

                  I was wondering, what could be the reason for such a low mapping rate? Well, I know that some can be mapped to exon junction, but I have created junction library as well, and the increment in mapping rate is very low (less than 0.1%).

                  Anybody has similar experience? Do I need to tweak certain parameters in the bwa aln?

                  Any input will be highly appreciated. Thanks!
                  Maybe your reads are low quality, especially towards the end? Mapping to a transcriptome library might also not be such a good idea. Usually you align to the whole genome and afterwards assign the genomic regions to genes.
                  Considering that there are two mismatches in color space for a SNP in nucleotide space, the default mismatch rate is too strict. Using BWA aln defaults, I got 34% mapping rate to the genome. Allowing for a higher mismatch rate with options l=25 n=8, as suggested somewhere in the forum, improved to 51% mapped but runtime was more than 5 times increased. BWA is great for nucleotide space but not optimized for color space.
                  From my recent experience, I'd recommend using BFAST, there mapping rate was 69% with defaults and by smartly piping the commands, runtime was even lower than for BWA with l=25 n=8.

                  Comment


                  • #24
                    Originally posted by epigen View Post
                    Maybe your reads are low quality, especially towards the end? Mapping to a transcriptome library might also not be such a good idea. Usually you align to the whole genome and afterwards assign the genomic regions to genes.
                    Considering that there are two mismatches in color space for a SNP in nucleotide space, the default mismatch rate is too strict. Using BWA aln defaults, I got 34% mapping rate to the genome. Allowing for a higher mismatch rate with options l=25 n=8, as suggested somewhere in the forum, improved to 51% mapped but runtime was more than 5 times increased. BWA is great for nucleotide space but not optimized for color space.
                    From my recent experience, I'd recommend using BFAST, there mapping rate was 69% with defaults and by smartly piping the commands, runtime was even lower than for BWA with l=25 n=8.
                    Thank you very much epigen. I will try as what you recommended and see again. I am considering BFAST also, but building of the index is way too slow. I am still building it now. Probably after finished building the index, I will try the alignment using BFAST.

                    Thanks again for your input.

                    Comment


                    • #25
                      win804, if you are mapping to the whole genome using SOLiD reads perhaps you can try novoalignCS (available from www.novocraft.com). The whole genome index for colorspace will take about 6-8 minutes to build and you can map csfasta/csqual or csfastq straight away.

                      PM me for more info if you would like some help.

                      Comment


                      • #26
                        Originally posted by rdeborja View Post
                        Is anyone using/testing Bioscope as a replacement for corona lite and the whole transcriptome pipeline? I've recently installed it on our cluster and was curious to find other opinions/experiences with it.
                        You may wish to try NextGENe's tool for this, it is quite robust.

                        Comment


                        • #27
                          Hi, ALL,

                          I’m a green hand in this kind of data handling…but now need to try on BioScope.

                          Our system consists of 128 nodes. Each node contains two 64-bit Intel quad-core Nehalem processors of 2.53GHz and 32GB of RAM.

                          I installed bioscope_1.3 at /home/guo/bioscope_cm1, and the example folder at /file2/guo/examples, while output folder at /file2/guo/bioscope… is this where the problem happened?

                          As I’m testing the ReseqFrag workflow, and I entered the example folder, simply run something like this
                          nohup /home/guo/bioscope_cml/bioscope/bin/bioscope.sh -l workflow1.log analysis.plan &

                          -------it turns out that nothing moves at all.

                          Then I submitted via qsub script like this (I use PBS scheduler):

                          #!/bin/sh
                          #PBS -N workflow_ReseqFreg
                          # request the queue (enter the possible names, if omitted, serial is the default)
                          #PBS -q parallel
                          #PBS -l nodes=3pn=8
                          #PBS -l walltime=10:00:00
                          # By default, PBS scripts execute in your home directory, not the
                          # directory from which they were submitted. The following line
                          # places you in the directory from which the job was submitted.
                          cd /file13/chengguo/examples/workflows/ReseqFrag
                          # run the program
                          /home/guo/bioscope_cml/bioscope/bin/bioscope.sh -l workflow1.log analysis.plan
                          exit 0

                          This job is terminated after 10 hours with same result like that of the command line above!
                          I know it might be really too long, just sincerely want to make it clear and, anyone, please help!!
                          Thanks!!

                          Comment


                          • #28
                            Hi guo,

                            making Bioscope run gives even system administrators a hard time. (Ours complained a lot ...) When you installed it, did you tell it you have 128 nodes? It's better to reduce the number drastically, otherwise Bioscope will think it's allowed to use the whole cluster, split up your data into too many jobs trying to use all nodes and will most probably fail.

                            There's also quite some hardcoding in the scripts the Bioscope wrapper runs. First check if the .ini files that your analysis.plan calls contain all paths. Normally Bioscope complains if something it needs does not exist, but you write that nothing happened at all. Did you use the example analysis.plan and .ini files? And is analysis.plan in the folder you call bioscope.sh from? You may want to try again specifying the full path.

                            Comment


                            • #29
                              Originally posted by guo View Post
                              Hi, ALL,

                              I’m a green hand in this kind of data handling…but now need to try on BioScope.

                              Our system consists of 128 nodes. Each node contains two 64-bit Intel quad-core Nehalem processors of 2.53GHz and 32GB of RAM.

                              I installed bioscope_1.3 at /home/guo/bioscope_cm1, and the example folder at /file2/guo/examples, while output folder at /file2/guo/bioscope… is this where the problem happened?

                              As I’m testing the ReseqFrag workflow, and I entered the example folder, simply run something like this
                              nohup /home/guo/bioscope_cml/bioscope/bin/bioscope.sh -l workflow1.log analysis.plan &

                              -------it turns out that nothing moves at all.

                              Then I submitted via qsub script like this (I use PBS scheduler):

                              #!/bin/sh
                              #PBS -N workflow_ReseqFreg
                              # request the queue (enter the possible names, if omitted, serial is the default)
                              #PBS -q parallel
                              #PBS -l nodes=3pn=8
                              #PBS -l walltime=10:00:00
                              # By default, PBS scripts execute in your home directory, not the
                              # directory from which they were submitted. The following line
                              # places you in the directory from which the job was submitted.
                              cd /file13/chengguo/examples/workflows/ReseqFrag
                              # run the program
                              /home/guo/bioscope_cml/bioscope/bin/bioscope.sh -l workflow1.log analysis.plan
                              exit 0

                              This job is terminated after 10 hours with same result like that of the command line above!
                              I know it might be really too long, just sincerely want to make it clear and, anyone, please help!!
                              Thanks!!

                              Did you install this as a user, or as root? It would make your life a lot easier if you install as root, following the install docs guidelines (you will run into all sorts of path and permission issues otherwise). Check in /bioscope/etc/conf and look in the files there to be sure your configuration is correct (I think the default is to limit the useable cluster to 10 nodes and 8 cores per node - as mentioned, there is rapidly diminishing returns from using more nodes than that). Is the Bioscope queue set up correctly for the queue you are submitting to? Have the paths in the *.ini files for the demo been edited to reflect your current environment settings? Is JMS running on the cluster? Did the examples install hg18 in /bioscope/etc/files ?

                              Did you try running any of the verification scripts or stress test scripts before running an example analysis? Those scripts are included when you buy a cluster with BioScope pre-installed, but they may be a separate download from somewhere on the ABI web site.

                              My initial suggestion would be first, to reinstall as root, or have your sys. admin. install it as root, it makes life simpler as there is far less fussing needed with the configuration files, example and shared files.
                              Michael Black, Ph.D.
                              ScitoVation LLC. RTP, N.C.

                              Comment


                              • #30
                                Thank you, ALL. I will start again from the re-installation. Thanks!

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                31 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X