Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Requesting advice regarding CAP3

    Hello everybody,

    Now a days, I am working with CAP3 program which assembles the sequences into contigs and singletons. When i use the program with small data set i.e. upto 50,000 sequences input file, the program works perfect and give me the output files. But when I use large data sets i.e. >2,50,000 sequence input file, the command prompt displays

    $ ran out of memory, -195789398 bytes requested.

    sombody please explain me about this message and possible solution as I have to use even more more large data sets

    I am using RH Linux operating system with 1 TB hardisk, 8 GB RAM and Quad core Xeon Proecessor.

  • #2
    Obviously CAP3 is running out of memory with your data, too many input reads at once. What kind of data is it, genomic or transcriptome ?

    Either way, you might want to try some cleaning and clustering first (with a package like TGICL) so you assemble the clusters instead of the whole thing at once (unless all those reads are suppose to assemble into a single chromosome). Also I hope you are using the 64bit version of CAP3 that is able to make use of that memory (check for the latest version for your platform, if you haven't done it already, download the version for 64-bit Linux system with an Intel processor).


    • #3
      What type of data is it? 454 I assume, maybe Sanger? There are several option out there for WGS - I don't have much experience with CAP3. Newbler or Celera Assembler could prove to be viable alternatives.
      Justin H. Johnson | Twitter: @BioInfo | LinkedIn: | EdgeBio


      • #4
        I am using transcriptiomic datasets. I have treated my data with "seqclean" that removed the vector contaminants from the set of sequences. I have also tried TGICL... It generates the ACE file, cluster file and contig file but it is unable to generate singiltons file. In error reporting file the message displays in last as follows

        "The clusters are stored in file 'My_data_set.fasta_cl_clusters'

        >>> --- ASSEMBLE [My_data_set.fasta] started at Jan 14 12:06:58 2010

        Process terminated with an error, at step 'ASSEMBLE'!
        tgicl (My_data_set.fasta) encountered an error at step ASSEMBLE
        Working directory was /root/Desktop/Software_Collection/Assembly/tgicl_linux."

        As per I think, there is some problem in last step which is unable to make singiltons file.
        secondly, this sort of message is displayed when I am using large as well as small data sets.

        Similar log and results when I swithched to OpenSUSE Operating system machine.

        I would also like to know, Is there any matrix constructed with our n number of sequences when we use the TGCIL and CAP3 softwares, so that it take lot of time ???
        Last edited by Bharat; 01-22-2010, 08:39 PM.


        • #5
          Would it be a good approach if I divide my data sets in to files of 50,000 sequences each and then I use CAP3 and concatinate the all resultant singiltons and contig files???

          Please advice me.


          • #6
            Dividing the input arbitrarily like that doesn't sound right, you're likely to get redundant or incomplete contigs etc.
            TGICL or other clustering tool should be used to partition the input data if the assembler is not able to do it by itself..
            I can provide some limited assistance for TGICL even though I haven't used it in a while and I thought it would be deprecated by now (I wrote those scripts many years ago when I was working on EST clustering myself). So if you can't find a better assembly solution for your data I would suggest you could try fixing TGICL to make it work for you. You can start by looking at all the err_* files left around by TGICL (not only in the main working directory, but also look into the asm_* subdirectories), look in there for any suspicious error messages, perhaps you'll find the exact cause of TGICL failure and address that (or let me know what errors you see there and perhaps I can help with fixing them).
            Last edited by gpertea; 01-23-2010, 06:45 PM.


            • #7

              you may try one of sequence clustering programs:


              I have used uclust on rather small set (30k gss NCBI FastA sequences) but it did run without any problems.

              Darek Kedra


              • #8
                There a still two things unclear:
                a) cap3 32/64bit?
                b) what type of input data?

                Does this " >2,50,000" mean 250K or 2.5M input reads?

                Another recommendation for a clustering program may be "wcd" (
                For 454-generated data (as well as for sanger data) I use MIRA3 ( ..



                • #9
                  Thanks to all for providing me help and suggestions


                  The "err_log" file in asm_1 folder displays as
                  sh: line 1: 17975 Aborted cap3 CL4 -p 93 > CL4.align
                  Error! cap3 failure detected (code=34304) on: CL4
                  sh: line 1: 18941 Aborted cap3 CL71 -p 93 > CL71.align
                  Error! cap3 failure detected (code=34304) on: CL71
                  sh: line 1: 19795 Aborted cap3 CL131 -p 93 > CL131.align
                  Error! cap3 failure detected (code=34304) on: CL131
                  sh: line 1: 19842 Aborted cap3 CL135 -p 93 > CL135.align
                  Error! cap3 failure detected (code=34304) on: CL135
                  sh: line 1: 23712 Aborted cap3 CL409 -p 93 > CL409.align
                  Error! cap3 failure detected (code=34304) on: CL409
                  sh: line 1: 24016 Aborted cap3 CL431 -p 93 > CL431.align
                  Error! cap3 failure detected (code=34304) on: CL431
                  sh: line 1: 24613 Aborted cap3 CL474 -p 93 > CL474.align
                  Error! cap3 failure detected (code=34304) on: CL474
                  sh: line 1: 28899 Aborted cap3 CL779 -p 93 > CL779.align
                  Error! cap3 failure detected (code=34304) on: CL779
                  sh: line 1: 29114 Aborted cap3 CL795 -p 93 > CL795.align
                  Error! cap3 failure detected (code=34304) on: CL795
                  sh: line 1: 4258 Aborted cap3 CL1326 -p 93 > CL1326.align
                  Error! cap3 failure detected (code=34304) on: CL1326
                  sh: line 1: 6597 Aborted cap3 CL1492 -p 93 > CL1492.align
                  Error! cap3 failure detected (code=34304) on: CL1492
                  sh: line 1: 7594 Aborted cap3 CL1563 -p 93 > CL1563.align
                  Error! cap3 failure detected (code=34304) on: CL1563
                  sh: line 1: 7865 Aborted cap3 CL1583 -p 93 > CL1583.align
                  Error! cap3 failure detected (code=34304) on: CL1583
                  sh: line 1: 7954 Aborted cap3 CL1590 -p 93 > CL1590.align
                  Error! cap3 failure detected (code=34304) on: CL1590
                  where as "err_tgicl_My_data_set.fasta.log" in main tgicl folder displays me error as

                  >>> --- Initialization [My_data_set.fasta] started at Jan 14 11:59:28 2010
                  tgicl running options:
                  tgicl My_data_set.fasta
                  Standard log file: tgicl_My_data_set.fasta.log
                  Error log file: err_tgicl_My_data_set.fasta.log
                  Using 1 CPUs for clustering and assembly
                  Path is : /root/Desktop/Software_Collection/Assembly/tgicl_linux/bin:/root/Desktop/Software_Collection/Assembly/tgicl_linux:/usr/lib64/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin
                  -= Rebuilding My_data_set.fasta indices =-
                  34071 entries from file My_data_set.fasta were indexed in file My_data_set.fasta.cidx
                  >>> --- clustering [My_data_set.fasta] started at Jan 14 11:59:30 2010
                  Launching distributed clustering:
                  psx -p 1 -n 1000 -i My_data_set.fasta -d cluster -C '/root/Desktop/Software_Collection/Assembly/tgicl_linux/My_data_set.fasta:94:30:40:' -c '/root/Desktop/Software_Collection/Assembly/tgicl_linux/bin/tgicl_cluster.psx'
                  WAITING for all children to finish before starting last child!
                  WAITING for all children to finish!
                  <<< --- clustering [My_data_set.fasta] finished at Jan 14 12:06:37 2010
                  Running transitive closure command: gzip -cd My_data_set.fasta_cl_tabhits_*.Z | tclust PID=94 OVL=40 OVHANG=30 -o My_data_set.fasta_cl_clusters

                  Total t-clusters: 2889
                  Largest cluster has 12681 nodes
                  *** all done ***
                  The clusters are stored in file 'My_data_set.fasta_cl_clusters'.

                  >>> --- ASSEMBLE [My_data_set.fasta] started at Jan 14 12:06:58 2010
                  WAITING for all children to finish before starting last child!
                  WAITING for all children to finish!

                  Process terminated with an error, at step 'ASSEMBLE'!
                  tgicl (My_data_set.fasta) encountered an error at step ASSEMBLE
                  Working directory was /root/Desktop/Software_Collection/Assembly/tgicl_linux.
                  Is there any wrong with within script. tgicl is a binary file and I am unable to read it. what sort of modification required in this script???



                  I am using 32bit CAP3 on a 64bit machine?

                  As i have notified above, the program works fine with small data set and give me error " ran out of memory!" when I am using large data sets.

                  I am using 2,50,000 EST sequences in a file.


                  • #10
                    Somebody please help me regarding TGICL


                    • #11
                      You have a problem with your assembly.

                      First use a 64bit cap3 on a 64bit system, update your cap3 if necessary.

                      You get an error on assembly of clustered data, " ran out of memory!".
                      That's pretty clear. You probably have one or more very deep clusters or preprocessing
                      of your data (adaptors, barcodes, vector?) didn't work very well ( as a consequence you get "deep clusters").

                      Have a look at the TGICL clustering results; are there very deep
                      clusters which might cause cap3 to fail?

                      I am still not sure if you mean 2.5 mio. sequences for the large dataset?

                      You have removed vector contaminants (did you?) so I assume you are using
                      sanger based data? Did you just remove vector reads or did you end-clip your
                      data? For sanger data 'lucy' is doing a good job.

                      Really, try to fnd out if preprocessing worked well, try to use different cluster/
                      assembly programs and see if you get different results.

                      Last but not least, be aware that 8G is not really much for assembly of
                      huge transcriptome datasets.

                      You don't supply enough info to effectively help you ..



                      • #12
                        I agree with Sven, the OP keeps asking for help while not paying attention to good suggestions (like updating CAP3) or to simple questions (what is that number? 2.5 million or 250K ?! Can't you see how confusing that formatting is ?)

                        However, the TGICL error logs supplied above show that CAP3 still fails on many of the clusters, probably due to out of memory issues again (not sure what exit code 34304 is, you could ask the author of CAP3 about that). Again, I think you should upgrade CAP3 to the latest 64bit version if you haven't done it yet (make sure you replace or delete the CAP3 binary that comes with the tgicl package in the tgicl/bin/ subdirectory, that's very old).
                        The largest cluster reported there has 12,681 reads which is still very large, and I see that you used 93% identity - you might want to try increasing the stringency of the clustering&assembly process to reduce the cluster noise or unwanted expansion. There is an entire section with advices for dealing with larger clusters in the README file that comes with TGICL, make sure you read that.
                        If you did all these and you still have errors with TGICL, please contact me privately so we don't turn this public discussion thread into a specific TGICL debugging exercise.

                        However as Sven and others suggested, you could also try other clustering/assembly packages, they might be more user friendly than TGICL and work better on your configuration.


                        • #13
                          Thankyou all of you for your Kind suggestions

                          Now CAP3 is working fine for my 250K sequences data set as I am using 64 bit version of CAP3
                          But my workstation looks like a hang.. I mean its processing goes too slow. May be the program uses almost full RAM.

                          My another question, what will be the ideal configuration of a workstation that can works for whole genome assembly, annotaion and other analysis. In near future I have to deal with very very large data.

                          Please give me your valuable suggestions
                          Last edited by Bharat; 01-31-2010, 08:59 PM.


                          • #14
                            Hello everybody,

                            can somebody tell me how to run cap3 in multiple processors, mine is an 8 core machine and i want to utilize 4-5 core of it.

                            Thanks in advance.


                            • #15
                              You can't. cap3 is not multithreaded. If you are going for EST assemblies, cluster the data and then assemble the distinct clusters.



                              Latest Articles


                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin

                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin

                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM





                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              Last Post seqadmin