Seqanswers Leaderboard Ad

**westerman** · 03-30-2010, 07:23 AM

I have two clusters. One has 8 machines, 16 CPUs each, 128 GB memory each, all connected to a fast disk. However I can only run the command-line Bioscope on it. With that much machine power I do not worry about running out of memory.

My other cluster also has 8 machines. 4 with 4 CPUs and 8 GB memory each and the other 4 with 8 CPUs and 32 GB memory each. I have been trying to run WT-bioscope on these machines but with less success. I am running out of memory plus sometimes getting kernel warnings. My current parameters are:

mapping.np.per.node=4
mapping.number.of.nodes=10
mapping.memory.size=3

In other words 4 CPUs per node and 10 nodes (I am making my 8-cpu machines into 2 nodes each so, in theory, I should have 12 nodes but I wanted to leave some processing power free).

The memory parameter is 3 GB but I am unsure what this really means. Does bioscope start up 4-cpu jobs on a node using only 3 GB? Or does bioscope start up 4 1-cpu jobs on a node using 3 times 4 GB of memory? It appears to do the latter since my 8 GB machines have to use virtual memory at times.

I really hesitate to go below 3 GB since my genome reference size is ~2 GBases. As far as I can tell Bioscope is chopping up the matching portion of its pipeline into many small chunks in order to accommodate this small memory allocation.

Anyway I would say that the more memory you have then the better off you are. It makes sense to run fewer jobs with lots of memory than many jobs each starved for memory.

Once I get Bioscope running on my small cluster using all 8 machines then I will try it out on the small cluster using just the 4 large memory machines. Our small cluster is sort of a 'recycled' cluster (e.g. some of the machines were given to us) and we would like to use it if possible. I hate to think that a 4-cpu, 8-GB machine is just so much junk and thus we should re-gift it but, for Bioscope at least, it appears that those machines may indeed be worthless.

**KevinLam** · 03-30-2010, 11:16 PM

Originally posted by westerman View Post

I have two clusters. One has 8 machines, 16 CPUs each, 128 GB memory each, all connected to a fast disk. However I can only run the command-line Bioscope on it. With that much machine power I do not worry about running out of memory.

My other cluster also has 8 machines. 4 with 4 CPUs and 8 GB memory each and the other 4 with 8 CPUs and 32 GB memory each. I have been trying to run WT-bioscope on these machines but with less success. I am running out of memory plus sometimes getting kernel warnings. My current parameters are:

mapping.np.per.node=4
mapping.number.of.nodes=10
mapping.memory.size=3

In other words 4 CPUs per node and 10 nodes (I am making my 8-cpu machines into 2 nodes each so, in theory, I should have 12 nodes but I wanted to leave some processing power free).

The memory parameter is 3 GB but I am unsure what this really means. Does bioscope start up 4-cpu jobs on a node using only 3 GB? Or does bioscope start up 4 1-cpu jobs on a node using 3 times 4 GB of memory? It appears to do the latter since my 8 GB machines have to use virtual memory at times.

I really hesitate to go below 3 GB since my genome reference size is ~2 GBases. As far as I can tell Bioscope is chopping up the matching portion of its pipeline into many small chunks in order to accommodate this small memory allocation.

Anyway I would say that the more memory you have then the better off you are. It makes sense to run fewer jobs with lots of memory than many jobs each starved for memory.

Once I get Bioscope running on my small cluster using all 8 machines then I will try it out on the small cluster using just the 4 large memory machines. Our small cluster is sort of a 'recycled' cluster (e.g. some of the machines were given to us) and we would like to use it if possible. I hate to think that a 4-cpu, 8-GB machine is just so much junk and thus we should re-gift it but, for Bioscope at least, it appears that those machines may indeed be worthless.

Apparently the min req is not just 2 GB per core but at least 16 GB ram per node and 24 GB ram is recom for human mapping.
I have been wrestling with ABI to try to make it work but they are less responsive when told I am working with 8 GB machines.
Just trying to map mouse transcriptome reads at this time and so far the 'big' memory jobs complete.
its the small 2 Gb jobs that fail possibly cos of temporary network glitches which bioscope isn't written to handle and I was advised to restart the job

Do drop me a pm or a reply here if u get the 8 GB machines working.
else i think they would be good enough for BWA or bowtie mapping.

**westerman** · 04-01-2010, 10:37 AM

Haven't had success with the 8 GB nodes yet. Will keep trying as time permits.

I'll agree that bioscope does not handle temporary network glitches. While these should not occur I find that my disk appliance and network does get overwhelmed -- rarely but certainly -- with lots of SOLiD processes it. To the point where a request gets shunted off to the side and Bioscope goes belly up. :-( It is not that hard to write software that can handle temporary glitches. One retry is all I am asking for.

**Haneko** · 06-01-2010, 05:27 PM

Originally posted by clariet View Post

Just saw this post. We were able to use a Whole-transcriptome pipeline of BioScope (1.0.1-42) on a RNA-seq dataset. And a note about its mapping statistics. I confirmed with their specialists that the current version of BS has bug on those numbers. so it will be fixed in next release, hopefully very soon.

We have a feeling that a large proportion of reads are wasted for SOLiD data compared to Solexa. For example, for a current chip-seq dataset, we have seen a average of 80M reads generated for a sample (quad). However, after filtering of low quality alignment and non-unique hits, only ~4% of reads could be used for further peak detection. Has anyone have similar experience? Does this sound normal?

Hi there,

I'm currently using BioScope v1.0.1. May I know what the bug on the statistics is?

**zee** · 06-02-2010, 02:14 AM

Bioscope sounds like a complex system that is memory hungry and CPU intensive. Although I must say that I've used corona-lite in the past and that seemed to be a lot more difficult to work with especially with the amount of computational time required for the alignment stage.

We've been developing a new aligner for AB colorspace, novoalignCS, featuring

1. Mate-pair alignment (F3 & R3) of csfasta/csfastq. If reads are in bead order for F3/R3 mates then pairs are identified and mapped accordingly.
2. Gapped alignment by default with mismatches.
3. SAM output (supporting RG). We've been using samtools and Picard to validate our SAM records.
4. Requires < 10Gb for matching against human/mouse/chimp,etc
5. Multithreaded (and MPI cluster-aware in the near future)
6. Polyclonal and color error filtering based on the SOPRA method (Sasson & Micheal, 2010).
7. Calculates mate-pair fragment length distribution given an initial distribution e.g 5K library with SD=500.

We are still busy with testing and comparison to other aligners i.e. BFAST, BWA. Although at this point we do welcome feedback from beta testers. If anybody is interested in obtaining a version please PM me or visit our site.

**golharam** · 06-02-2010, 12:12 PM

I've used Bioscope 1.1 and now have Bioscope 1.2. They did away with a lot of the temporary files. I haven't noticed much improvement otherwise. I got a few of my RNA samples to run but half of them crashed. When I restarted the pipeline it finished, so I'm not exactly sure why, but I suspect NFS delays.

I have a ChIP dataset that I tried to run through Bioscope and it flat out failed. ABI recommended I continue using the old version of Bioscope until they have a fix...over a week now.

At this point, I'm not using Bioscope anymore. It looks like BWA or BFAST for color space reads.

**win804** · 07-15-2010, 04:48 AM

Recently I obtained Solid RNA-Seq data. I have been creating the transcriptome library by using the annotation from UCSC hg19 refFlat file. However, when I aligned using BWA, the percentage of mapping rate is around 7%. I have built the color space index of the transcriptome library using the command:
bwa index -a bwtsw -c hg19_transcript.fa
then,
bwa aln -c hg19_transcript.fa reads.fastq > align-reads.sai

I was wondering, what could be the reason for such a low mapping rate? Well, I know that some can be mapped to exon junction, but I have created junction library as well, and the increment in mapping rate is very low (less than 0.1%).

Anybody has similar experience? Do I need to tweak certain parameters in the bwa aln?

Any input will be highly appreciated. Thanks!

**epigen** · 07-15-2010, 05:39 AM

Low mapping rate for SOLiD data with BWA

Originally posted by win804 View Post

Recently I obtained Solid RNA-Seq data. I have been creating the transcriptome library by using the annotation from UCSC hg19 refFlat file. However, when I aligned using BWA, the percentage of mapping rate is around 7%. I have built the color space index of the transcriptome library using the command:
bwa index -a bwtsw -c hg19_transcript.fa
then,
bwa aln -c hg19_transcript.fa reads.fastq > align-reads.sai

I was wondering, what could be the reason for such a low mapping rate? Well, I know that some can be mapped to exon junction, but I have created junction library as well, and the increment in mapping rate is very low (less than 0.1%).

Anybody has similar experience? Do I need to tweak certain parameters in the bwa aln?

Any input will be highly appreciated. Thanks!

Maybe your reads are low quality, especially towards the end? Mapping to a transcriptome library might also not be such a good idea. Usually you align to the whole genome and afterwards assign the genomic regions to genes.
Considering that there are two mismatches in color space for a SNP in nucleotide space, the default mismatch rate is too strict. Using BWA aln defaults, I got 34% mapping rate to the genome. Allowing for a higher mismatch rate with options l=25 n=8, as suggested somewhere in the forum, improved to 51% mapped but runtime was more than 5 times increased. BWA is great for nucleotide space but not optimized for color space.
From my recent experience, I'd recommend using BFAST, there mapping rate was 69% with defaults and by smartly piping the commands, runtime was even lower than for BWA with l=25 n=8.

**win804** · 07-15-2010, 06:19 AM

Originally posted by epigen View Post

Maybe your reads are low quality, especially towards the end? Mapping to a transcriptome library might also not be such a good idea. Usually you align to the whole genome and afterwards assign the genomic regions to genes.
Considering that there are two mismatches in color space for a SNP in nucleotide space, the default mismatch rate is too strict. Using BWA aln defaults, I got 34% mapping rate to the genome. Allowing for a higher mismatch rate with options l=25 n=8, as suggested somewhere in the forum, improved to 51% mapped but runtime was more than 5 times increased. BWA is great for nucleotide space but not optimized for color space.
From my recent experience, I'd recommend using BFAST, there mapping rate was 69% with defaults and by smartly piping the commands, runtime was even lower than for BWA with l=25 n=8.

Thank you very much epigen. I will try as what you recommended and see again. I am considering BFAST also, but building of the index is way too slow. I am still building it now. Probably after finished building the index, I will try the alignment using BFAST.

Thanks again for your input.

**zee** · 07-15-2010, 06:39 AM

win804, if you are mapping to the whole genome using SOLiD reads perhaps you can try novoalignCS (available from www.novocraft.com). The whole genome index for colorspace will take about 6-8 minutes to build and you can map csfasta/csqual or csfastq straight away.

PM me for more info if you would like some help.

**SoftGenetics** · 07-15-2010, 11:18 AM

Originally posted by rdeborja View Post

Is anyone using/testing Bioscope as a replacement for corona lite and the whole transcriptome pipeline? I've recently installed it on our cluster and was curious to find other opinions/experiences with it.

You may wish to try NextGENe's tool for this, it is quite robust.

**guo** · 09-19-2011, 02:29 AM

Hi, ALL,

I’m a green hand in this kind of data handling…but now need to try on BioScope.

Our system consists of 128 nodes. Each node contains two 64-bit Intel quad-core Nehalem processors of 2.53GHz and 32GB of RAM.

I installed bioscope_1.3 at /home/guo/bioscope_cm1, and the example folder at /file2/guo/examples, while output folder at /file2/guo/bioscope… is this where the problem happened?

As I’m testing the ReseqFrag workflow, and I entered the example folder, simply run something like this
nohup /home/guo/bioscope_cml/bioscope/bin/bioscope.sh -l workflow1.log analysis.plan &

-------it turns out that nothing moves at all.

Then I submitted via qsub script like this (I use PBS scheduler):

#!/bin/sh
#PBS -N workflow_ReseqFreg
# request the queue (enter the possible names, if omitted, serial is the default)
#PBS -q parallel
#PBS -l nodes=3

pn=8
#PBS -l walltime=10:00:00
# By default, PBS scripts execute in your home directory, not the
# directory from which they were submitted. The following line
# places you in the directory from which the job was submitted.
cd /file13/chengguo/examples/workflows/ReseqFrag
# run the program
/home/guo/bioscope_cml/bioscope/bin/bioscope.sh -l workflow1.log analysis.plan
exit 0

This job is terminated after 10 hours with same result like that of the command line above!
I know it might be really too long, just sincerely want to make it clear and, anyone, please help!!
Thanks!!

**epigen** · 09-19-2011, 03:33 AM

Hi guo,

making Bioscope run gives even system administrators a hard time. (Ours complained a lot ...) When you installed it, did you tell it you have 128 nodes? It's better to reduce the number drastically, otherwise Bioscope will think it's allowed to use the whole cluster, split up your data into too many jobs trying to use all nodes and will most probably fail.

There's also quite some hardcoding in the scripts the Bioscope wrapper runs. First check if the .ini files that your analysis.plan calls contain all paths. Normally Bioscope complains if something it needs does not exist, but you write that nothing happened at all. Did you use the example analysis.plan and .ini files? And is analysis.plan in the folder you call bioscope.sh from? You may want to try again specifying the full path.

**mbblack** · 09-19-2011, 07:14 AM

Originally posted by guo View Post

Hi, ALL,

I’m a green hand in this kind of data handling…but now need to try on BioScope.

Our system consists of 128 nodes. Each node contains two 64-bit Intel quad-core Nehalem processors of 2.53GHz and 32GB of RAM.

I installed bioscope_1.3 at /home/guo/bioscope_cm1, and the example folder at /file2/guo/examples, while output folder at /file2/guo/bioscope… is this where the problem happened?

As I’m testing the ReseqFrag workflow, and I entered the example folder, simply run something like this
nohup /home/guo/bioscope_cml/bioscope/bin/bioscope.sh -l workflow1.log analysis.plan &

-------it turns out that nothing moves at all.

Then I submitted via qsub script like this (I use PBS scheduler):

#!/bin/sh
#PBS -N workflow_ReseqFreg
# request the queue (enter the possible names, if omitted, serial is the default)
#PBS -q parallel
#PBS -l nodes=3

pn=8
#PBS -l walltime=10:00:00
# By default, PBS scripts execute in your home directory, not the
# directory from which they were submitted. The following line
# places you in the directory from which the job was submitted.
cd /file13/chengguo/examples/workflows/ReseqFrag
# run the program
/home/guo/bioscope_cml/bioscope/bin/bioscope.sh -l workflow1.log analysis.plan
exit 0

This job is terminated after 10 hours with same result like that of the command line above!
I know it might be really too long, just sincerely want to make it clear and, anyone, please help!!
Thanks!!

Did you install this as a user, or as root? It would make your life a lot easier if you install as root, following the install docs guidelines (you will run into all sorts of path and permission issues otherwise). Check in /bioscope/etc/conf and look in the files there to be sure your configuration is correct (I think the default is to limit the useable cluster to 10 nodes and 8 cores per node - as mentioned, there is rapidly diminishing returns from using more nodes than that). Is the Bioscope queue set up correctly for the queue you are submitting to? Have the paths in the *.ini files for the demo been edited to reflect your current environment settings? Is JMS running on the cluster? Did the examples install hg18 in /bioscope/etc/files ?

Did you try running any of the verification scripts or stress test scripts before running an example analysis? Those scripts are included when you buy a cluster with BioScope pre-installed, but they may be a separate download from somewhere on the ABI web site.

My initial suggestion would be first, to reinstall as root, or have your sys. admin. install it as root, it makes life simpler as there is far less fussing needed with the configuration files, example and shared files.

**guo** · 09-19-2011, 10:25 PM

Thank you, ALL. I will start again from the re-installation. Thanks!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News