Seqanswers Leaderboard Ad

**ondovb** · 05-14-2010, 08:02 AM

Hi jinghanna,

Did you get the bases for the read by directly translating from color space to base space? If you compare the color space sequences:

T30301232232233211102303212320130230200112020001113
T30301232232233211102303212320100230200112020021113

There is most likely a sequencing error at color 31. SOLiD errors change every base to the right of them if you translate from left to right (in this case changing them to their complements). That's why SOLiD aligners do alignment in color space. This allows errors to be distinguished, since it's very unlikely that these color space sequences were the same (except for one color) just by chance. The chance gets higher for color space mismatches close to the end of the read, but in this case you can be pretty sure that the reference sequence is actually what the base space sequence of the read is.

By default, SOCS will not give you a translation, since it assumes it's just the reference sequence (I did this to keep the output files small). If you tell it to look for short variants, alignments.txt will show translations of the reads with any variants detected.

**jinghanna** · 05-14-2010, 11:04 AM

Thanks a lot, ondovb. Your reply completely resolved my puzzle.

Earlier I did not realize that one error in the base space could lead to all wrong bases following that base. The alignment needs to be done in color space.

One more question, if I want to run SOCS on a cluster, do I simply need to add the option -N, and then specify the number of nodes to be used, just like

socs -N 5

Thanks a lot for your help!

**ondovb** · 05-14-2010, 12:53 PM

Originally posted by jinghanna View Post

One more question, if I want to run SOCS on a cluster, do I simply need to add the option -N, and then specify the number of nodes to be used, just like

socs -N 5

You also need to tell each node which one it is with -n, ie:

socs -N 5 -n 1 ...
socs -N 5 -n 2 ...
socs -N 5 -n 3 ...
...

**jinghanna** · 05-14-2010, 12:53 PM

Got it, thanks again!

**Haneko** · 06-29-2010, 06:20 PM

Hi there,

I'm sorry, could you please elaborate on how to run the program on a cluster? I installed it on a cluster with about 40 nodes (i intend to only use maybe 5 or 10 as a test).

Just for an example, let's say i have a test set with approx 100,000 reads. I want to run SOCS across 10 nodes, each using all 8 processors on the node. How do I go about editing the socs.pref file to achieve this? How do I know which nodes the process was allocated to? Perhaps you could give a sample .pref file for reference?

I'm trying to map to large genomes such as the human or mouse genome. Do you have any estimate in running time?

Thanks!

**jinghanna** · 06-29-2010, 09:28 PM

run SOCS on computer clusters

Below is what I did to run SOCS on computer cluster:

First create a template script with the command "socs" and add "-n [datagram]" to the command. The template script should look something like this:
input1 = [datagram1]
input2 = [datagram2]
socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d [datagram1] -N 3 -n [datagram2]

Do not forget the parameter -p, which is necessary for batch or cluster runs.

Then create the datagram file. In this case, it will be the numbers from 1 to N:
~~~
output1 1
output2 2
output3 3
~~~

Finally, you will need a general cluster submission script, which should contain all environment settings and your template script, to submit jobs to the computer cluster, something like

submitjobs.sh --script template_script --datagrams datagram_file

Hope this helps.

**jinghanna** · 06-29-2010, 09:35 PM

For estimate on running time, please refer to this paper published by the original authors,

Brian D. Ondov, Anjana Varadarajan, Karla D. Passalacqua, and Nicholas H. Bergman, "Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications," Bioinformatics 2008 December 1; 24(23): 2776–2777.

Page not available - PMC

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2639273/pdf/btn512.pdf

**zee** · 06-29-2010, 09:45 PM

Haneko, we have an MPI version of novoalign that is able to map color space reads using as many nodes as you like. If you would like to give it a run then PM me. I have been running these sorts of tests on large reference genomes such as human and mouse.

Originally posted by Haneko View Post

Hi there,

I'm sorry, could you please elaborate on how to run the program on a cluster? I installed it on a cluster with about 40 nodes (i intend to only use maybe 5 or 10 as a test).

Just for an example, let's say i have a test set with approx 100,000 reads. I want to run SOCS across 10 nodes, each using all 8 processors on the node. How do I go about editing the socs.pref file to achieve this? How do I know which nodes the process was allocated to? Perhaps you could give a sample .pref file for reference?

I'm trying to map to large genomes such as the human or mouse genome. Do you have any estimate in running time?

Thanks!

**Haneko** · 06-29-2010, 10:09 PM

Hi jinghanna,

Thanks! Just to make sure I've really understood, could i simply have 3 scripts:

script1 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output1 -N 3 -n 1
script2 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output2 -N 3 -n 2
script3 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output3 -N 3 -n 3

Then separately queue them into the cluster? They don't necessarily have to run in parallel (as in, at the exact same time), right?

Hi zee,

I actually want to use the new bisulfite mapping algorithm from SOCS, so i don't think novoalign fits my needs. But thanks for the suggestion!

**jinghanna** · 06-29-2010, 10:13 PM

Hi Haneko,

I believe you can do that. After all the jobs are done, you will need to run combineAlignments.pl to join the results from different output directories.

**Haneko** · 06-29-2010, 10:15 PM

Hi jinghanna,

Thanks a lot for your help!!

**zee** · 06-29-2010, 10:24 PM

FYI and just for clarification , novoalign does bisulfite alignment but currently not for SOLiD reads.
In fact I'm not aware of anybody who are doing bisulfite sequencing with SOLiD as yet.

Originally posted by Haneko View Post

Hi jinghanna,

Thanks! Just to make sure I've really understood, could i simply have 3 scripts:

script1 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output1 -N 3 -n 1
script2 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output2 -N 3 -n 2
script3 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output3 -N 3 -n 3

Then separately queue them into the cluster? They don't necessarily have to run in parallel (as in, at the exact same time), right?

Hi zee,

I actually want to use the new bisulfite mapping algorithm from SOCS, so i don't think novoalign fits my needs. But thanks for the suggestion!

**Haneko** · 06-29-2010, 10:39 PM

Hi zee,

Oh ok! But I'm dealing with SOLiD reads now, unfortunately.

**ondovb** · 06-30-2010, 05:59 AM

jinghanna, thanks for answering Haneko's questions.

A couple other notes-

- The output directories can be the same for each node, since they will each include their node # in their output file names. If your nodes have a shared file system, this can save you some copying.

- Running times for bisulfite are a lot longer than for the standard algorithm. For reference, we aligned ~55M bisulfite reads to Arabidopsis in about 30 hours using 16 threads (with sensitivity=3).

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 8 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

question on running SOCS program

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News