Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • pacBioToCa .spec file options

    Hi All,

    I have some pacbio long reads that I am attempting to error correct using using the Celera 7 pipeline. Essentially, it is a 2mb genome. I have ~50x coverage in long reads in a .fastq, and I'm using ~50x Circular Consensus Reads to error correct the long reads. I'm running this on a single server with 16 logical cores, and 72gb of RAM. Essentially, I'm using the default "high memory" spec file found at:

    the only thing I changed in it was the "merylMemory" variable from 128,000 to 72,000 (commas added by me here for readability, not included in the .spec file). I was able to follow along the sourceforge wiki:
    Download Whole-Genome Shotgun Assembler for free. Celera Assembler (CA) is a whole-genome shotgun (WGS) assembler for the reconstruction of genomic DNA sequence from WGS sequencing data.

    and its up and running, but it has been now for 8 days. Additionally, the memory usage on the machine is only ~3,300/72,000 and hasn't really moved around at all (although all 16 processors have been running at 100% this entire time). I feel like I'm under utilizing the system resources, and that this process shouldn't take as long as it has on this system.

    Has anyone run a data set similar to this, on a machine like this? Or does it seem reasonable that it is taking as long as it is to complete this process?


  • #2
    I'm also running into speed issues. My 50x coverage was estimated to take 4weeks to complete.

    I would be surprised if all the people using it are ready to wait that long :-)


    • #3
      pacBioToCA is quite a time sink, but I did find the results useful.

      I do wonder if it would be more efficient to assemble the short read (or CCS) data alone very conservatively and then feed the contigs plus unused reads into PacBioToCA. An awful lot of the assembly is not altered by the PacBio reads, but those regions must eat a lot of time.


      • #4
        If you have short reads (these are intended to be illumina reads, for example) for your sample besides the PacBio data then consider the following.

        PacBio just released a new version (v.1.3.1) of their SMRTAnalysis suite. A new error correction module (P_ErrorCorrection) is included.

        The description for P_errorCorrection module from the manual says: "This module takes as input long reads and short reads in standard formats, aligns the short reads to the long reads, and outputs a corrected version of the long reads."

        If your data used v.2 chemistry (and you have short read data) then it may be worthwhile to re-analyze your data using the new version of SMRTAnalysis package with the new error correction module.


        • #5
          We will be upgrading our smart portal from 1.3.0 to 1.3.1 soon (in a few days) but I would really like to have pacbioToCa work in a reasonable amount of time for 2 reasons

          1- Many people that do not have access to smart portal use it successfully. Being a sequencing center, it's a good thing to be able to propose this open solution (on another note since others don't complain too much about speed, I'm thinking it's a problem on our end, a setting say, but I don't know what it is).

          2- PacBio will be adding pacbioToCA in 1.3.3, so might as well get familiar with it now.


          • #6
            I was actually able to eventually get a pipeline for error correction using pacbioToCA. Basically (with the help of the PacBio folks) what solved my issue was an updated pacbio.spec file. Once I had exchanged mine for the one Pacific Biosciences had modified, the error correction took *way* less time, using the machine I originally posted about it was done in under an hour. IIRC less than half an hour. Much improved. Assembly using Celera actually took longer than the error correction. Again, this is a pretty small genome (2MB) so YMMV.

            I have also heard that the error correction pipeline with the update to the pacbio software works very well for some people. I've heard that it only works for small genomes, i.e. <10MB whereas the Celera pipeline can handle much larger data... I don't know when/if that is going to change but it was recommended to us to use the Celera pipeline if we were ever going to sequence "big" genomes. Since we do plan on it, Celera it was.

            If people are still interested I can post the new pacbio.spec that worked well for me.



            • #7
              Yes, yes please do!


              • #8
                Actually here it is:

                # original asm settings
                utgErrorRate = 0.25
                utgErrorLimit = 4.5
                cnsErrorRate = 0.25
                cgwErrorRate = 0.25
                ovlErrorRate = 0.25
                merylMemory = 128000
                merylThreads = 16
                ovlStoreMemory = 8192
                # grid info
                useGrid = 0 
                scriptOnGrid = 0
                frgCorrOnGrid = 0
                ovlCorrOnGrid = 0
                sge = -A assembly
                sgeScript = -pe threads 16
                sgeConsensus = -pe threads 1
                sgeOverlap = -pe threads 2
                sgeFragmentCorrection = -pe threads 2
                sgeOverlapCorrection = -pe threads 1
                #ovlMemory=8GB --hashload 0.7
                ovlHashBits = 25
                ovlThreads = 2
                ovlHashBlockLength = 20000000
                ovlRefBlockSize =  50000000
                # for mer overlapper
                merCompression = 1
                merOverlapperSeedBatchSize = 500000
                merOverlapperExtendBatchSize = 250000
                frgCorrThreads = 2
                frgCorrBatchSize = 100000
                ovlCorrBatchSize = 100000
                # non-Grid settings, if you set useGrid to 0 above these will be used
                merylMemory = 128000
                merylThreads = 4
                ovlStoreMemory = 8192
                ovlConcurrency = 8
                cnsConcurrency = 8
                merOverlapperThreads = 3 
                merOverlapperSeedConcurrency = 3
                merOverlapperExtendConcurrency = 3
                frgCorrConcurrency = 2
                ovlCorrConcurrency = 4 
                cnsConcurrency = 4
                A lot of this is greek to me. I tried going through and wrestling it out of the documentation, but the documentation won. Basically, because I have 16 logical processors on that machine, that's what I used for several of the "thread" options. Other than that... *shrug* I'm sure there are Celera experts here that can parse this.


                • #9
                  8+ days to less than an hour is pretty spectacular

                  Originally posted by jpearl01 View Post
                  Once I had exchanged mine for the one Pacific Biosciences had modified, the error correction took *way* less time, using the machine I originally posted about it was done in under an hour. IIRC less than half an hour.



                  • #10
                    I have to admit, I was pretty skeptical at first when they said the time to do the error correction pipeline could be vastly reduced (8 days was when I first posted, I let it run for another week before I finally cancelled it). I assume that it wasn't actually doing anything, or rather whatever it was doing was not progressing the pipeline (it was running 100% on all 16 cpus during the entire time, so whatever "nothing" it was doing, it was doing a lot of it). Anyway half-hour error correction was pretty much beyond my dreams at that point so I was rather pleased.

                    @lletourn Could you let us know if the .spec file worked for you, and how long it took you to error-correct?


                    • #11
                      Of course! I am really looking forward to running this this weekend.


                      • #12
                        I just noticed, why the stopAfter=Overlaper?

                        Did they tell you to run something manually?


                        • #13
                          I don't believe so, unless the program would normally go directly into the assembler, which I did run manually. Other than that, I just let it do its thing. At the end of this process I think there is a 9_terminator folder that holds the results. But I didn't enter anything else manually into the error correction analysis. Just that spec file.


                          • #14
                            I also tried to run the pacbioToCa pipeline, but for our case it initially took 200days to complere (extrapolated off course!) . It turned out that we have a huge E.coli contamination, making the coverage of that genome over 5000x. Once we got rid of the e.coli Illumina reads it run in 14hrs. ( I also modified the pacbio.spec file somewhat like shown by jpearl01)
                            But the results were not promising:
                            Input: (for a 100MB 'genome')
                            400MB pacbio data (est. 4x coverage)
                            100M reads Illuimina (est 100x coverage)
                            43MB clean pacbio data (est. <1x coverage)

                            Is there something we can look into?

                            I also want to try the P_ErrorCorrection module of the smrtportal software, but I read in this thread that it might not be capable of handling a 100mb genome

                            ps. Our 100mb genome is not a real genome, but this should be the total amount of scatered (1000 pieces) seqeunced genomic area.
                            Last edited by HenrivdGeest; 03-19-2013, 02:48 AM.


                            • #15
                              I was able to lower the time to about ~24hours
                              1- I only use 50x of illumina or ccs reads
                              2- I modified the .spec file a bit. We have large memeory machines so I changed settings to load as much as possible.

                              Also try to launch as many processes as you can and limit the amount of threads. BUT processes use up the set amount of memory for each process. Threads share memory. The problem is some steps of the pacbioToCA are single threaded so have more process goes faster.

                              One Thing that happened to us was that I forgot to change the LEN_BIT in the celera sources, so any pacbio read longer than 2kb got thrown out. We had less than 0.1% of our reads after correction.

                              After the change we kept about 70% of our bases.


                              Latest Articles


                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin

                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin

                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM





                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              Last Post seqadmin