Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PBJelly

    Hi,

    I have SMRTanalysis and PBJelly package installed. I was able to run the PBJelly with test data provided with software. However, when I was running it for a bacterial genome which is (~10 MB in size and ~350 scaffolds with ~3000 gaps) and used pacbio long reads for correction. I got memory correction error as:

    Command:
    Code:
    nohup Jelly.py assembly Protocol.xml -x "--nproc=8" &
    Error:
    Code:
    ======= Backtrace: =========
    /lib/x86_64-linux-gnu/libc.so.6(+0x7eb96)[0x7fe21cf7cb96]
    make-consensus[0x40ce91]
    make-consensus[0x427e89]
    make-consensus[0x409d1c]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fe21cf1f76d]
    make-consensus[0x406439]
    ======= Memory map: ========
    00400000-0047c000 r-xp 00000000 08:21 113928                             /data2/smart/smrtanalysis-1.4.0/analysis/bin/make-consensus
    0067b000-0067d000 r--p 0007b000 08:21 113928                             /data2/smart/smrtanalysis-1.4.0/analysis/bin/make-consensus
    0067d000-0067e000 rw-p 0007d000 08:21 113928                             /data2/smart/smrtanalysis-1.4.0/analysis/bin/make-consensus
    0067e000-0067f000 rw-p 00000000 00:00 0 
    01b28000-01b8e000 rw-p 00000000 00:00 0                                  [heap]
    7fe21cce8000-7fe21ccfd000 r-xp 00000000 08:01 933932                     /lib/x86_64-linux-gnu/libgcc_s.so.1
    7fe21ccfd000-7fe21cefc000 ---p 00015000 08:01 933932                     /lib/x86_64-linux-gnu/libgcc_s.so.1
    7fe21cefc000-7fe21cefd000 r--p 00014000 08:01 933932                     /lib/x86_64-linux-gnu/libgcc_s.so.1
    7fe21cefd000-7fe21cefe000 rw-p 00015000 08:01 933932                     /lib/x86_64-linux-gnu/libgcc_s.so.1
    7fe21cefe000-7fe21d0b3000 r-xp 00000000 08:01 933906                     /lib/x86_64-linux-gnu/libc-2.15.so
    7fe21d0b3000-7fe21d2b2000 ---p 001b5000 08:01 933906                     /lib/x86_64-linux-gnu/libc-2.15.so
    7fe21d2b2000-7fe21d2b6000 r--p 001b4000 08:01 933906                     /lib/x86_64-linux-gnu/libc-2.15.so
    7fe21d2b6000-7fe21d2b8000 rw-p 001b8000 08:01 933906                     /lib/x86_64-linux-gnu/libc-2.15.so
    7fe21d2b8000-7fe21d2bd000 rw-p 00000000 00:00 0 
    7fe21d2bd000-7fe21d3b8000 r-xp 00000000 08:01 933918                     /lib/x86_64-linux-gnu/libm-2.15.so
    7fe21d3b8000-7fe21d5b7000 ---p 000fb000 08:01 933918                     /lib/x86_64-linux-gnu/libm-2.15.so
    7fe21d5b7000-7fe21d5b8000 r--p 000fa000 08:01 933918                     /lib/x86_64-linux-gnu/libm-2.15.so
    7fe21d5b8000-7fe21d5b9000 rw-p 000fb000 08:01 933918                     /lib/x86_64-linux-gnu/libm-2.15.so
    7fe21d5b9000-7fe21d69b000 r-xp 00000000 08:01 748647                     /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
    7fe21d69b000-7fe21d89a000 ---p 000e2000 08:01 748647                     /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
    7fe21d89a000-7fe21d8a2000 r--p 000e1000 08:01 748647                     /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
    7fe21d8a2000-7fe21d8a4000 rw-p 000e9000 08:01 748647                     /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
    7fe21d8a4000-7fe21d8ba000 rw-p 00000000 00:00 0 
    7fe21d8c0000-7fe21d8d8000 r-xp 00000000 08:21 119775                     /data2/smart/smrtanalysis-1.4.0/common/lib/libz.so.1
    7fe21d8d8000-7fe21dad7000 ---p 00018000 08:21 119775                     /data2/smart/smrtanalysis-1.4.0/common/lib/libz.so.1
    7fe21dad7000-7fe21dad8000 r--p 00017000 08:21 119775                     /data2/smart/smrtanalysis-1.4.0/common/lib/libz.so.1
    7fe21dad8000-7fe21dad9000 rw-p 00018000 08:21 119775                     /data2/smart/smrtanalysis-1.4.0/common/lib/libz.so.1
    7fe21dad9000-7fe21dafb000 r-xp 00000000 08:01 933924                     /lib/x86_64-linux-gnu/ld-2.15.so
    7fe21dcf4000-7fe21dcfb000 rw-p 00000000 00:00 0 
    7fe21dcfb000-7fe21dcfc000 r--p 00022000 08:01 933924                     /lib/x86_64-linux-gnu/ld-2.15.so
    7fe21dcfc000-7fe21dcfe000 rw-p 00023000 08:01 933924                     /lib/x86_64-linux-gnu/ld-2.15.so
    7fffcf60a000-7fffcf62c000 rw-p 00000000 00:00 0                          [stack]
    7fffcf78b000-7fffcf78c000 r-xp 00000000 00:00 0                          [vdso]
    ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
    *** glibc detected *** make-consensus: free(): invalid next size (fast): 0x0000000000a0e5f0 ***
    ======= Backtrace: =========
    /lib/x86_64-linux-gnu/libc.so.6(+0x7eb96)[0x7f166a931b96]
    make-consensus[0x43e287]
    make-consensus[0x43a1fc]
    make-consensus[0x409cb1]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f166a8d476d]
    make-consensus[0x406439]
    ======= Memory map: ========
    Any suggestions to resolve this? I know PBJelly is not best describe to handle the memory but any suggestion regarding memory/parameter optimization will be helpful.

    Thanks

  • #2
    Well,

    Replying to my own thread. It might be useful for others to know. Seems like the PBjelly is not yet optimized to make full use of available memory. There might be some memory leaks in python code which causing the above error.

    Despite of the above mentioned errors, the gap filling process was complete and got decent results. If you face same error be patient and let it proceed.

    Thanks

    Comment


    • #3
      Hi,
      I also got this error a few times.
      But, its not that its not a problem. The errors you see come from an individual gap closure assembly operation. That one just failed. You can see these failed ones in the assembly.out as well.
      I tried different OS systems, including Ubuntu 8, 12, CEntos6, centos5. On none of them I could either get PBjelly working at all, or it failed with these same messages.
      Also, I emailed with the author of PBjelly and it seems that make-consensus crashes in 90% of these cases. So, you can try to manually restart those failed jobs. However, that's a hassle in my experience. To be clear, normally >90% of the gap filling assemblies just succeeds, but the one that fails, tend to fail almost each time you re-run pbjelly.

      These errors are also causing you program to run for a very long time. Because these assemblies fail abnormally Pbjelly doesn't see them as finished. So it waits 1440 minutes and than these stalled ones are killed. I changed the timeout in the code to 2 minutes, (individual assemblies only take a few seconds in my case) and the whole PB jelly was done in 12 minutes instead of 2 hrs. However, you have to set the timeout back to normal if you are running the mapping/blasr step, these will take more than 2 minutes for sure.
      code change:
      /PBJelly_12.9.14/CommandRunner.py #line 10 #def exe(cmd, timeout=2): #was 1440 (in minutes)

      edit: make-consensus is the one with problems I think, and that's part of the open source AMOS package, not something that belongs to pbjelly.

      Comment


      • #4
        Originally posted by HenrivdGeest View Post

        These errors are also causing you program to run for a very long time. Because these assemblies fail abnormally Pbjelly doesn't see them as finished. So it waits 1440 minutes and than these stalled ones are killed. I changed the timeout in the code to 2 minutes, (individual assemblies only take a few seconds in my case) and the whole PB jelly was done in 12 minutes instead of 2 hrs. However, you have to set the timeout back to normal if you are running the mapping/blasr step, these will take more than 2 minutes for sure.
        code change:
        /PBJelly_12.9.14/CommandRunner.py #line 10 #def exe(cmd, timeout=2): #was 1440 (in minutes)
        Thanks for this suggestion. But I guess PBJelly always run blasr/mapping step in order to fill the gap. So, may not be feasible to change the time.
        Last edited by sagarutturkar; 02-27-2013, 06:19 AM.

        Comment


        • #5
          Originally posted by sagarutturkar View Post
          Thanks for this suggestion. But I guess PBJelly always run blasr/mapping step in order to fill the gap. So, may not be feasible to change the time.
          If you run PBJelly, you run the mapping and the assembly step separately, so you can change the code in between. Or make some if statement in the code itself

          Comment


          • #6
            Oh yes! That's true. I just made a generic shell script to run PBJelly and forgot about different steps. You are right!

            Comment


            • #7
              Modified CommandRunner.py worked; what exactly did it do?

              This change allowed me to complete a PBJelly run; thanks! I made a copy of the original CommandRunner.py script, as well as a modified copy. Copied the modified version (with 2 minute timeout) over the script before the assembly step, then copied the original back over afterwards.

              However, how can I know if my assemblies were behaving like yours, and either succeeding within 2 minutes, or were destined to fail? What were you watching in which output / log file? Or were you tracking individual files? I guess I'm concerned that PBJelly seems to have shrunk a fraction of the scaffolding gaps that I had in a CLC assembly, but not others. I haven't found a gap that's been filled yet with PB sequence, where there were simply N's before. The most common thing I'm seeing is scaffolding gaps that were flanked by low complexity sequence (single base repeats) have been closed to simply consist of the low complexity sequence, without the gap (I'm guessing that PB reads were found to support this). My output.err file says this:

              2013-03-18 09:49:00,594 [INFO] Number of Gaps Addressed 824
              2013-03-18 09:49:00,594 [INFO] No Filling Metrics 349
              2013-03-18 09:49:00,594 [INFO] Filled 473
              2013-03-18 09:49:00,595 [INFO] Reduced 2
              2013-03-18 09:49:00,595 [INFO] Overfilled 1
              2013-03-18 09:49:00,595 [INFO] Loading Fasta/Qual Files
              2013-03-18 09:49:17,344 [INFO] Finished Loading Files
              2013-03-18 09:49:17,371 [INFO] Closed ref0000087_0_1
              2013-03-18 09:49:17,372 [INFO] Didn't Pass Assembly ref0000086_0_1
              ...


              Would what I'm describing above fall into the "Filled" category?? Or "Reduced" (though I saw more than 2 such conversions). So, can I be confident that all gaps have been filled or shrunk that could be? I guess I'm seeing later comments like:


              Trimmed 1 from 5' end of ref0000085_5
              or
              Closed ref0000085_10_11


              Do these relate to gap coordinates?? (seems unlikely) Or gap "numbers" assigned by PBJelly?

              Any guidance would be appreciated.
              ~Joe

              Comment


              • #8
                I think you should look in the that output log file for "Didn't pass assembly". Those are the assemblies that failed. Could be due to the crash or due to another issue of which I am not aware of At least, that's how I interpreted those messages.
                The ones with closed gap have a message like " [INFO] Closed ref0001480_0_1". You could try visualizing that closed gap in some viewer.(not an easy way, map the reads with blasr first on the scaffold, map new contig with blasr and generate a SAM/BAM file out of this.)

                You could also try to re-run PBjelly on the output PB jelly generated and see whats comes out. I Asked the author (Adam English) of PBjelly about that and he replied:

                So, yes, you can run PBJelly iteratively. However! You should be aware that when you use the same dataset multiple times, you run the risk of a gap that closed one gap could -- somehow -- map elsewhere and close another gap. I have the script exclude.py that removes all gaps that previously supported gaps, but this won't be incredibly helpful to you because you're trying to fix gaps that failed assembly - but may succeed a second time. You see, the reads that support these gaps will be excluded by the script. I have plans of making better read accounting for multiple iterations, but this isn't implemented yet.

                I still don't know exactly why this iterative approach wouldn't work, since my initial assembly could easily be equal to one of the PBjelly outputs. The risk of reads that closing multiple gaps is always an issue I would say.

                About the 2 minutes time out, its easy to see if that's enough, in top or htop(F5, tree view), you can see the process takes up cpu time, and in my case, it was at zero percent after 20 seconds. If its still at 100% cpu after 2 minutes, clearly you should put the time out longer.

                Comment


                • #9
                  PBJelly Fix

                  Hello,

                  The way to fix this glibc 'memory leak' problem is to simply comment change one small piece of the code. When PBJelly 12.9.14 was being developed, blasr couldn't handle IUB nucleotide codes, and PBJelly removed them from the local assembly 'seed' contigs, it didn't remove 'Ns' corresponding quality score. This created a problem -- for example:
                  @fasta
                  ATCGGACT
                  +
                  )#()@(!)#(!))(0

                  You can see they're of different lengths and this caused amos to break.

                  To fix this:
                  At line 178 in Extraction.py - Change the cleanSequence method so that it doesn't modify what is sent to it.
                  Code:
                      
                      def cleanSequence(self,seq):
                          """
                          Remove IUB characters from the reference since
                          we use contigs as part of the input reference for
                          the local assembly and blasr doesn't like them
                          """
                          seq = seq.replace('M','C')
                          seq = seq.replace('R','A')
                          seq = seq.replace('W','T')
                          seq = seq.replace('S','G')
                          seq = seq.replace('Y','C')
                          seq = seq.replace('K','T')
                          seq = seq.replace('V','G')
                          seq = seq.replace('H','A')
                          return seq  # <- Don't replace the 'N's since blasr is fixed
                          seq = seq.replace('N','')
                          return seq
                  Obviously, you could also just delete
                  Code:
                  seq = seq.replace('N','')
                  as well, I just left that in for a point of reference.
                  I won't be releasing a fix for this issue in PBJelly12.9.14 because I'm about to finish a completely refactored version of PBJelly that doesn't make the same mistake but does fill Inter-Scaffold gaps.
                  Last edited by ACEnglish; 05-20-2013, 12:59 PM.

                  Comment


                  • #10
                    Originally posted by ACEnglish View Post
                    I won't be releasing a fix for this issue in PBJelly12.9.14 because I'm about to finish a completely refactored version of PBJelly that doesn't make the same mistake but does fill Inter-Scaffold gaps.
                    This sounds very interesting.. do I understand you correctly and PBJelly will then be able to act as a 'scaffolder' using the pacbio reads.

                    Do you envisage this being released in 2013?

                    Comment


                    • #11
                      Originally posted by SLB View Post
                      ...PBJelly will then be able to act as a 'scaffolder' using the pacbio reads.

                      Do you envisage this being released in 2013?
                      Correct. Inter-scaffold gap filling - pretty much a scaffolder. I hope to have it finished by August at the latest.

                      Comment


                      • #12
                        What type of pacbio reads are recommended for PBJelly? So far, I was using native Pacbio files - i.e. f(iltered.subreads.fasta). However, If I use corrected PacBio reads, would that be beneficial or better? Did anybody used corrected reads and got better results?

                        Comment


                        • #13
                          I've never tried pbjelly with clean reads. Since most of the time those reads will be shorter, the chance of finding an overlap is also smaller. If there aren't shorter, I would not see any drawback in using cleaned reads.
                          In case they are shorter, I would suggest pbjelly to fill gaps with raw reads, and run quiver after that to perform base error correction on the new consensus.


                          The code change helped a lot, now after filling 7500 gaps, I only got a single " Error in `make-consensus': double free or corruption " error. Previously,I got many more.

                          However I am still wondering if pbjelly could fire up multiple make-consensus threads at once. Now its doing one-by-one, with nproc=n, but that nproc does not seem to make any difference. A cluster would help here, but I think pbjelly could also just fire up multiple gap filling processes. Does anyone know how to do this?

                          Comment


                          • #14
                            My first attempt to address 90K gaps wasn't so successful.

                            2013-07-23 14:26:43,585 [INFO] Number of Gaps Addressed 31998
                            2013-07-23 14:26:43,585 [INFO] No Filling Metrics 31376
                            2013-07-23 14:26:43,585 [INFO] Filled 208
                            2013-07-23 14:26:43,585 [INFO] Reduced 414
                            2013-07-23 14:26:43,585 [INFO] Overfilled 84

                            What are the main reasons for
                            (1) [INFO] Skipping Singleton ref0015196
                            (2) [INFO] Didn't Address ref0002885_0_1
                            (3) [INFO] Didn't Pass Assembly ref0002885_3_4

                            I guess (1) is due to a single pacbio read addressing the gap, but not sure about the other two. Anyone know this?

                            Comment


                            • #15
                              Singleton means a scaffold without a gap.
                              Didn't address means that no reads mapped in a way that they seem to support the gap (into/over the gap)
                              Didn't pass assembly means a good assembly couldn't be created.
                              Those numbers look incredibly depreciated from what to expect.

                              With 90K gaps, and only 30k addressed, you must have low input pacbio coverage.. Yes? If not, then there's an issue worth investigating.
                              I would also check the logs of some of the 'didn't pass assembly.' Maybe there are clues as to why so many failed.

                              PBJelly2 will be out at or around September 13th.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Multiomics Techniques Advancing Disease Research
                                by seqadmin


                                New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                                A major leap in the field has
                                ...
                                02-08-2024, 06:33 AM
                              • seqadmin
                                The 3D Genome: New Technologies and Emerging Insights
                                by seqadmin


                                The study of three-dimensional (3D) genomics explores the spatial structure of genomes and their role in processes like gene expression and DNA replication. By employing innovative technologies, researchers can study these arrangements to discover their role in various biological processes. Scientists continue to find new ways in which the organization of DNA is involved in processes like development1 and disease2.

                                Basic Organization and Structure
                                Understanding...
                                01-22-2024, 03:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 08:57 AM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-14-2024, 09:19 AM
                              0 responses
                              43 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-12-2024, 03:37 PM
                              0 responses
                              410 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-09-2024, 03:36 PM
                              0 responses
                              649 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X