Seqanswers Leaderboard Ad

**sagarutturkar** · 02-26-2013, 07:56 AM

Well,

Replying to my own thread. It might be useful for others to know. Seems like the PBjelly is not yet optimized to make full use of available memory. There might be some memory leaks in python code which causing the above error.

Despite of the above mentioned errors, the gap filling process was complete and got decent results. If you face same error be patient and let it proceed.

Thanks

**HenrivdGeest** · 02-27-2013, 12:00 AM

Hi,
I also got this error a few times.
But, its not that its not a problem. The errors you see come from an individual gap closure assembly operation. That one just failed. You can see these failed ones in the assembly.out as well.
I tried different OS systems, including Ubuntu 8, 12, CEntos6, centos5. On none of them I could either get PBjelly working at all, or it failed with these same messages.
Also, I emailed with the author of PBjelly and it seems that make-consensus crashes in 90% of these cases. So, you can try to manually restart those failed jobs. However, that's a hassle in my experience. To be clear, normally >90% of the gap filling assemblies just succeeds, but the one that fails, tend to fail almost each time you re-run pbjelly.

These errors are also causing you program to run for a very long time. Because these assemblies fail abnormally Pbjelly doesn't see them as finished. So it waits 1440 minutes and than these stalled ones are killed. I changed the timeout in the code to 2 minutes, (individual assemblies only take a few seconds in my case) and the whole PB jelly was done in 12 minutes instead of 2 hrs. However, you have to set the timeout back to normal if you are running the mapping/blasr step, these will take more than 2 minutes for sure.
code change:
/PBJelly_12.9.14/CommandRunner.py #line 10 #def exe(cmd, timeout=2): #was 1440 (in minutes)

edit: make-consensus is the one with problems I think, and that's part of the open source AMOS package, not something that belongs to pbjelly.

**sagarutturkar** · 02-27-2013, 06:17 AM

Originally posted by HenrivdGeest View Post

These errors are also causing you program to run for a very long time. Because these assemblies fail abnormally Pbjelly doesn't see them as finished. So it waits 1440 minutes and than these stalled ones are killed. I changed the timeout in the code to 2 minutes, (individual assemblies only take a few seconds in my case) and the whole PB jelly was done in 12 minutes instead of 2 hrs. However, you have to set the timeout back to normal if you are running the mapping/blasr step, these will take more than 2 minutes for sure.
code change:
/PBJelly_12.9.14/CommandRunner.py #line 10 #def exe(cmd, timeout=2): #was 1440 (in minutes)

Thanks for this suggestion. But I guess PBJelly always run blasr/mapping step in order to fill the gap. So, may not be feasible to change the time.

**HenrivdGeest** · 02-27-2013, 06:23 AM

Originally posted by sagarutturkar View Post

Thanks for this suggestion. But I guess PBJelly always run blasr/mapping step in order to fill the gap. So, may not be feasible to change the time.

If you run PBJelly, you run the mapping and the assembly step separately, so you can change the code in between. Or make some if statement in the code itself

**sagarutturkar** · 02-27-2013, 06:28 AM

Oh yes! That's true. I just made a generic shell script to run PBJelly and forgot about different steps. You are right!

**jnfass** · 03-18-2013, 01:40 PM

Modified CommandRunner.py worked; what exactly did it do?

This change allowed me to complete a PBJelly run; thanks! I made a copy of the original CommandRunner.py script, as well as a modified copy. Copied the modified version (with 2 minute timeout) over the script before the assembly step, then copied the original back over afterwards.

However, how can I know if my assemblies were behaving like yours, and either succeeding within 2 minutes, or were destined to fail? What were you watching in which output / log file? Or were you tracking individual files? I guess I'm concerned that PBJelly seems to have shrunk a fraction of the scaffolding gaps that I had in a CLC assembly, but not others. I haven't found a gap that's been filled yet with PB sequence, where there were simply N's before. The most common thing I'm seeing is scaffolding gaps that were flanked by low complexity sequence (single base repeats) have been closed to simply consist of the low complexity sequence, without the gap (I'm guessing that PB reads were found to support this). My output.err file says this:

2013-03-18 09:49:00,594 [INFO] Number of Gaps Addressed 824
2013-03-18 09:49:00,594 [INFO] No Filling Metrics 349
2013-03-18 09:49:00,594 [INFO] Filled 473
2013-03-18 09:49:00,595 [INFO] Reduced 2
2013-03-18 09:49:00,595 [INFO] Overfilled 1
2013-03-18 09:49:00,595 [INFO] Loading Fasta/Qual Files
2013-03-18 09:49:17,344 [INFO] Finished Loading Files
2013-03-18 09:49:17,371 [INFO] Closed ref0000087_0_1
2013-03-18 09:49:17,372 [INFO] Didn't Pass Assembly ref0000086_0_1
...

Would what I'm describing above fall into the "Filled" category?? Or "Reduced" (though I saw more than 2 such conversions). So, can I be confident that all gaps have been filled or shrunk that could be? I guess I'm seeing later comments like:

Trimmed 1 from 5' end of ref0000085_5
or
Closed ref0000085_10_11

Do these relate to gap coordinates?? (seems unlikely) Or gap "numbers" assigned by PBJelly?

Any guidance would be appreciated.
~Joe

**HenrivdGeest** · 03-19-2013, 12:27 AM

I think you should look in the that output log file for "Didn't pass assembly". Those are the assemblies that failed. Could be due to the crash or due to another issue of which I am not aware of

At least, that's how I interpreted those messages.
The ones with closed gap have a message like " [INFO] Closed ref0001480_0_1". You could try visualizing that closed gap in some viewer.(not an easy way, map the reads with blasr first on the scaffold, map new contig with blasr and generate a SAM/BAM file out of this.)

You could also try to re-run PBjelly on the output PB jelly generated and see whats comes out. I Asked the author (Adam English) of PBjelly about that and he replied:

So, yes, you can run PBJelly iteratively. However! You should be aware that when you use the same dataset multiple times, you run the risk of a gap that closed one gap could -- somehow -- map elsewhere and close another gap. I have the script exclude.py that removes all gaps that previously supported gaps, but this won't be incredibly helpful to you because you're trying to fix gaps that failed assembly - but may succeed a second time. You see, the reads that support these gaps will be excluded by the script. I have plans of making better read accounting for multiple iterations, but this isn't implemented yet.

I still don't know exactly why this iterative approach wouldn't work, since my initial assembly could easily be equal to one of the PBjelly outputs. The risk of reads that closing multiple gaps is always an issue I would say.

About the 2 minutes time out, its easy to see if that's enough, in top or htop(F5, tree view), you can see the process takes up cpu time, and in my case, it was at zero percent after 20 seconds. If its still at 100% cpu after 2 minutes, clearly you should put the time out longer.

**ACEnglish** · 05-16-2013, 12:17 PM

PBJelly Fix

Hello,

The way to fix this glibc 'memory leak' problem is to simply comment change one small piece of the code. When PBJelly 12.9.14 was being developed, blasr couldn't handle IUB nucleotide codes, and PBJelly removed them from the local assembly 'seed' contigs, it didn't remove 'Ns' corresponding quality score. This created a problem -- for example:
@fasta
ATCGGACT
+
)#()@(!)#(!))(0

You can see they're of different lengths and this caused amos to break.

To fix this:
At line 178 in Extraction.py - Change the cleanSequence method so that it doesn't modify what is sent to it.

Code:

    
    def cleanSequence(self,seq):
        """
        Remove IUB characters from the reference since
        we use contigs as part of the input reference for
        the local assembly and blasr doesn't like them
        """
        seq = seq.replace('M','C')
        seq = seq.replace('R','A')
        seq = seq.replace('W','T')
        seq = seq.replace('S','G')
        seq = seq.replace('Y','C')
        seq = seq.replace('K','T')
        seq = seq.replace('V','G')
        seq = seq.replace('H','A')
        return seq  # <- Don't replace the 'N's since blasr is fixed
        seq = seq.replace('N','')
        return seq

Obviously, you could also just delete

Code:

seq = seq.replace('N','')

as well, I just left that in for a point of reference.
I won't be releasing a fix for this issue in PBJelly12.9.14 because I'm about to finish a completely refactored version of PBJelly that doesn't make the same mistake but does fill Inter-Scaffold gaps.

**SLB** · 05-29-2013, 06:48 AM

Originally posted by ACEnglish View Post

I won't be releasing a fix for this issue in PBJelly12.9.14 because I'm about to finish a completely refactored version of PBJelly that doesn't make the same mistake but does fill Inter-Scaffold gaps.

This sounds very interesting.. do I understand you correctly and PBJelly will then be able to act as a 'scaffolder' using the pacbio reads.

Do you envisage this being released in 2013?

**ACEnglish** · 05-29-2013, 07:40 AM

Originally posted by SLB View Post

...PBJelly will then be able to act as a 'scaffolder' using the pacbio reads.

Do you envisage this being released in 2013?

Correct. Inter-scaffold gap filling - pretty much a scaffolder. I hope to have it finished by August at the latest.

**sagarutturkar** · 06-21-2013, 09:05 AM

What type of pacbio reads are recommended for PBJelly? So far, I was using native Pacbio files - i.e. f(iltered.subreads.fasta). However, If I use corrected PacBio reads, would that be beneficial or better? Did anybody used corrected reads and got better results?

**HenrivdGeest** · 06-27-2013, 10:29 PM

I've never tried pbjelly with clean reads. Since most of the time those reads will be shorter, the chance of finding an overlap is also smaller. If there aren't shorter, I would not see any drawback in using cleaned reads.
In case they are shorter, I would suggest pbjelly to fill gaps with raw reads, and run quiver after that to perform base error correction on the new consensus.

The code change helped a lot, now after filling 7500 gaps, I only got a single " Error in `make-consensus': double free or corruption " error. Previously,I got many more.

However I am still wondering if pbjelly could fire up multiple make-consensus threads at once. Now its doing one-by-one, with nproc=n, but that nproc does not seem to make any difference. A cluster would help here, but I think pbjelly could also just fire up multiple gap filling processes. Does anyone know how to do this?

**SLB** · 07-23-2013, 09:23 AM

My first attempt to address 90K gaps wasn't so successful.

2013-07-23 14:26:43,585 [INFO] Number of Gaps Addressed 31998
2013-07-23 14:26:43,585 [INFO] No Filling Metrics 31376
2013-07-23 14:26:43,585 [INFO] Filled 208
2013-07-23 14:26:43,585 [INFO] Reduced 414
2013-07-23 14:26:43,585 [INFO] Overfilled 84

What are the main reasons for
(1) [INFO] Skipping Singleton ref0015196
(2) [INFO] Didn't Address ref0002885_0_1
(3) [INFO] Didn't Pass Assembly ref0002885_3_4

I guess (1) is due to a single pacbio read addressing the gap, but not sure about the other two. Anyone know this?

**ACEnglish** · 08-09-2013, 08:15 AM

Singleton means a scaffold without a gap.
Didn't address means that no reads mapped in a way that they seem to support the gap (into/over the gap)
Didn't pass assembly means a good assembly couldn't be created.
Those numbers look incredibly depreciated from what to expect.

With 90K gaps, and only 30k addressed, you must have low input pacbio coverage.. Yes? If not, then there's an issue worth investigating.
I would also check the logs of some of the 'didn't pass assembly.' Maybe there are clues as to why so many failed.

PBJelly2 will be out at or around September 13th.

Topics	Statistics	Last Post
Study Reveals How Bacteria Defend Against Viral Attacks by seqadmin Started by seqadmin, 08-27-2024, 04:40 AM	0 responses 16 views 0 likes	Last Post by seqadmin 08-27-2024, 04:40 AM
New Single-Molecule Sequencing Platform Introduces Advanced Features for High-Throughput Genomics by seqadmin Started by seqadmin, 08-22-2024, 05:00 AM	0 responses 293 views 0 likes	Last Post by seqadmin 08-22-2024, 05:00 AM
New DNA Code Discovered Revealing Complex Gene Regulation Mechanisms by seqadmin Started by seqadmin, 08-21-2024, 10:49 AM	0 responses 135 views 0 likes	Last Post by seqadmin 08-21-2024, 10:49 AM
Epigenetic Clocks Derived from Retroelements Offer New Insights into Aging by seqadmin Started by seqadmin, 08-19-2024, 05:12 AM	0 responses 124 views 0 likes	Last Post by seqadmin 08-19-2024, 05:12 AM

Seqanswers Leaderboard Ad

Announcement

PBJelly

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News