Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • question on running SOCS program

    I tried to use SOCS to map Solid reads (50bp long each) to a set of reference sequences (with varying length, 50bp and longer each). Basically, I used default parameter settings: tolerance and mismatch sensitivity set to 2.

    In the output file "alignments.txt", I found one alignment as follows,

    one of the reads:
    TAATTGATCTAGATAGTGTTCGGCTGATCCATTCGGAAACAGGAAAACACG

    is aligned to the reference sequence:
    TAATTGATCTAGATAGTGTTCGGCTGATCCAAAGCCTTTGTCCTTTCACATG

    the first 31nts of the read and the aligned reference sequence are the same, but the rest part of the read is complement to the reference sequence. Seems only the first part of the read is used for alignment. Is this result reasonable? Any suggestions are highly appreciately!

  • #2
    Hi jinghanna,

    Did you get the bases for the read by directly translating from color space to base space? If you compare the color space sequences:

    T30301232232233211102303212320130230200112020001113
    T30301232232233211102303212320100230200112020021113

    There is most likely a sequencing error at color 31. SOLiD errors change every base to the right of them if you translate from left to right (in this case changing them to their complements). That's why SOLiD aligners do alignment in color space. This allows errors to be distinguished, since it's very unlikely that these color space sequences were the same (except for one color) just by chance. The chance gets higher for color space mismatches close to the end of the read, but in this case you can be pretty sure that the reference sequence is actually what the base space sequence of the read is.

    By default, SOCS will not give you a translation, since it assumes it's just the reference sequence (I did this to keep the output files small). If you tell it to look for short variants, alignments.txt will show translations of the reads with any variants detected.
    Last edited by ondovb; 05-14-2010, 08:03 AM. Reason: misspelled jinghanna...

    Comment


    • #3
      Thanks a lot, ondovb. Your reply completely resolved my puzzle.

      Earlier I did not realize that one error in the base space could lead to all wrong bases following that base. The alignment needs to be done in color space.

      One more question, if I want to run SOCS on a cluster, do I simply need to add the option -N, and then specify the number of nodes to be used, just like

      socs -N 5

      Thanks a lot for your help!

      Comment


      • #4
        Originally posted by jinghanna View Post
        One more question, if I want to run SOCS on a cluster, do I simply need to add the option -N, and then specify the number of nodes to be used, just like

        socs -N 5
        You also need to tell each node which one it is with -n, ie:

        socs -N 5 -n 1 ...
        socs -N 5 -n 2 ...
        socs -N 5 -n 3 ...
        ...

        Comment


        • #5
          Got it, thanks again!

          Comment


          • #6
            Hi there,

            I'm sorry, could you please elaborate on how to run the program on a cluster? I installed it on a cluster with about 40 nodes (i intend to only use maybe 5 or 10 as a test).

            Just for an example, let's say i have a test set with approx 100,000 reads. I want to run SOCS across 10 nodes, each using all 8 processors on the node. How do I go about editing the socs.pref file to achieve this? How do I know which nodes the process was allocated to? Perhaps you could give a sample .pref file for reference?

            I'm trying to map to large genomes such as the human or mouse genome. Do you have any estimate in running time?

            Thanks!
            Last edited by Haneko; 06-29-2010, 07:02 PM. Reason: Added question

            Comment


            • #7
              run SOCS on computer clusters

              Below is what I did to run SOCS on computer cluster:

              First create a template script with the command "socs" and add "-n [datagram]" to the command. The template script should look something like this:
              input1 = [datagram1]
              input2 = [datagram2]
              socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d [datagram1] -N 3 -n [datagram2]

              Do not forget the parameter -p, which is necessary for batch or cluster runs.

              Then create the datagram file. In this case, it will be the numbers from 1 to N:
              ~~~
              output1 1
              output2 2
              output3 3
              ~~~

              Finally, you will need a general cluster submission script, which should contain all environment settings and your template script, to submit jobs to the computer cluster, something like

              submitjobs.sh --script template_script --datagrams datagram_file

              Hope this helps.

              Comment


              • #8
                For estimate on running time, please refer to this paper published by the original authors,

                Brian D. Ondov, Anjana Varadarajan, Karla D. Passalacqua, and Nicholas H. Bergman, "Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications," Bioinformatics 2008 December 1; 24(23): 2776–2777.

                Comment


                • #9
                  Haneko, we have an MPI version of novoalign that is able to map color space reads using as many nodes as you like. If you would like to give it a run then PM me. I have been running these sorts of tests on large reference genomes such as human and mouse.



                  Originally posted by Haneko View Post
                  Hi there,

                  I'm sorry, could you please elaborate on how to run the program on a cluster? I installed it on a cluster with about 40 nodes (i intend to only use maybe 5 or 10 as a test).

                  Just for an example, let's say i have a test set with approx 100,000 reads. I want to run SOCS across 10 nodes, each using all 8 processors on the node. How do I go about editing the socs.pref file to achieve this? How do I know which nodes the process was allocated to? Perhaps you could give a sample .pref file for reference?

                  I'm trying to map to large genomes such as the human or mouse genome. Do you have any estimate in running time?

                  Thanks!

                  Comment


                  • #10
                    Hi jinghanna,

                    Thanks! Just to make sure I've really understood, could i simply have 3 scripts:

                    script1 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output1 -N 3 -n 1
                    script2 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output2 -N 3 -n 2
                    script3 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output3 -N 3 -n 3

                    Then separately queue them into the cluster? They don't necessarily have to run in parallel (as in, at the exact same time), right?

                    Hi zee,

                    I actually want to use the new bisulfite mapping algorithm from SOCS, so i don't think novoalign fits my needs. But thanks for the suggestion!

                    Comment


                    • #11
                      Hi Haneko,

                      I believe you can do that. After all the jobs are done, you will need to run combineAlignments.pl to join the results from different output directories.

                      Comment


                      • #12
                        Hi jinghanna,

                        Thanks a lot for your help!!

                        Comment


                        • #13
                          FYI and just for clarification , novoalign does bisulfite alignment but currently not for SOLiD reads.
                          In fact I'm not aware of anybody who are doing bisulfite sequencing with SOLiD as yet.

                          Originally posted by Haneko View Post
                          Hi jinghanna,

                          Thanks! Just to make sure I've really understood, could i simply have 3 scripts:

                          script1 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output1 -N 3 -n 1
                          script2 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output2 -N 3 -n 2
                          script3 : socs -p -r ref_seq.fa -c xxx.csfasta -q xxx.qual -d output3 -N 3 -n 3

                          Then separately queue them into the cluster? They don't necessarily have to run in parallel (as in, at the exact same time), right?

                          Hi zee,

                          I actually want to use the new bisulfite mapping algorithm from SOCS, so i don't think novoalign fits my needs. But thanks for the suggestion!

                          Comment


                          • #14
                            Hi zee,

                            Oh ok! But I'm dealing with SOLiD reads now, unfortunately.

                            Comment


                            • #15
                              jinghanna, thanks for answering Haneko's questions.

                              A couple other notes-

                              - The output directories can be the same for each node, since they will each include their node # in their output file names. If your nodes have a shared file system, this can save you some copying.

                              - Running times for bisulfite are a lot longer than for the standard algorithm. For reference, we aligned ~55M bisulfite reads to Arabidopsis in about 30 hours using 16 threads (with sensitivity=3).

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin




                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin


                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 07:20 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-16-2024, 05:49 AM
                              0 responses
                              35 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-15-2024, 06:53 AM
                              0 responses
                              39 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X