Unconfigured Ad

**gringer** · 11-08-2011, 01:01 AM

It might be possible to shoehorn Ray into doing something like the 'Inchworm' part of Trinity:

Best Open Source Bio-Informatics Software 2026

http://trinityrnaseq.sourceforge.net/

Compare the best free open source Bio-Informatics Software at SourceForge. Free, secure and fast Bio-Informatics Software downloads from the largest Open Source applications and software directory

I've had a bit of a hiatus from work on Ray due to additional projects, but I'm interested in seeing if this will work because the current transcriptome assembly programs have really high memory requirements. The memory requirements are odd because the transcript graphs should be simpler (fewer repeats because you're making things like proteins, so branches should be mostly due to different isoforms), and the transcriptome size is smaller than the genome size.

My guess is trying something like disabling the genome coverage graph functions -- with RNASeq the mean coverage is per-transcript, but there can be within-transcript bias -- and writing out sequences that have some minimum coverage level based on the average coverage for each disconnected graph.

**santiagosnchez** · 02-11-2012, 10:49 AM

Ray error message: Fatal error

Hi Sébastien,

I've been using Ray to assemble a 30-50 Mb fungal genome from 454 and PE Illumina reads. When I was testing the software with raw reads I had no trouble en the assembly carried on correctly. The problem arose when I quality filtered all the reads and created a new fasta and fastq files. I´m pasting the error message here:

What could the problem be?

Cheers,
Santiago

Rank 5: gathering scaffold links [1/3559] [1/28971]
Rank 2: gathering scaffold links [1/3854] [1/30494]
Rank 4: gathering scaffold links [1/3726] [1/56682]
Fatal Error: ReadIndex: 18854336 but Reads: 18635750
Ray: code/communication/MessageProcessor.cpp:127: void MessageProcessor::call_RAY_MPI_TAG_GET_READ_MARKERS(Message*): Assertion `readId<(int)m_myReads->size()' failed.
[ipara:21878] *** Process received signal ***
[ipara:21878] Signal: Aborted (6)
[ipara:21878] Signal code: (-6)
[ipara:21878] [ 0] /lib/libpthread.so.0 [0x7ff0190d3a80]
[ipara:21878] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7ff018da3ed5]
[ipara:21878] [ 2] /lib/libc.so.6(abort+0x183) [0x7ff018da53f3]
[ipara:21878] [ 3] /lib/libc.so.6(__assert_fail+0xe9) [0x7ff018d9cdc9]
[ipara:21878] [ 4] Ray(_ZN16MessageProcessor33call_RAY_MPI_TAG_GET_READ_MARKERSEP7Message+0x454) [0x43fa74]
[ipara:21878] [ 5] Ray(_ZN7Machine10runVanillaEv+0x99) [0x454f19]
[ipara:21878] [ 6] Ray(_ZN7Machine5startEv+0x1031) [0x456c51]
[ipara:21878] [ 7] Ray(main+0x3c) [0x4c0abc]
[ipara:21878] [ 8] /lib/libc.so.6(__libc_start_main+0xe6) [0x7ff018d901a6]
[ipara:21878] [ 9] Ray(__gxx_personality_v0+0x201) [0x42cd09]
[ipara:21878] *** End of error message ***
mpiexec noticed that job rank 0 with PID 21872 on node ipara exited on signal 15 (Terminated).
6 additional processes aborted (not shown)

**seb567** · 02-15-2012, 02:03 PM

Originally posted by gringer View Post

It might be possible to shoehorn Ray into doing something like the 'Inchworm' part of Trinity:

Best Open Source Bio-Informatics Software 2026

http://trinityrnaseq.sourceforge.net/

Compare the best free open source Bio-Informatics Software at SourceForge. Free, secure and fast Bio-Informatics Software downloads from the largest Open Source applications and software directory

I've had a bit of a hiatus from work on Ray due to additional projects, but I'm interested in seeing if this will work because the current transcriptome assembly programs have really high memory requirements. The memory requirements are odd because the transcript graphs should be simpler (fewer repeats because you're making things like proteins, so branches should be mostly due to different isoforms), and the transcriptome size is smaller than the genome size.

My guess is trying something like disabling the genome coverage graph functions -- with RNASeq the mean coverage is per-transcript, but there can be within-transcript bias -- and writing out sequences that have some minimum coverage level based on the average coverage for each disconnected graph.

Hello,

I don't think we can assume that each transcript will be a disconnected-from-the-rest component in the graph.

Also, I think you should work with the mode k-mer coverage, not the mean k-mer coverage because the mean will be artificially increased by repeats.

We tested Ray on the Schizosaccharomyces pombe dataset from the Trinity paper.

Ray is quite good but presently we are focusing on assembly of metagenomes and biological abundances using virtual colors.

Sébastien

**seb567** · 02-15-2012, 02:10 PM

Originally posted by santiagosnchez View Post

Hi Sébastien,

I've been using Ray to assemble a 30-50 Mb fungal genome from 454 and PE Illumina reads. When I was testing the software with raw reads I had no trouble en the assembly carried on correctly. The problem arose when I quality filtered all the reads and created a new fasta and fastq files. I´m pasting the error message here:

What could the problem be?

Cheers,
Santiago

Rank 5: gathering scaffold links [1/3559] [1/28971]
Rank 2: gathering scaffold links [1/3854] [1/30494]
Rank 4: gathering scaffold links [1/3726] [1/56682]
Fatal Error: ReadIndex: 18854336 but Reads: 18635750
Ray: code/communication/MessageProcessor.cpp:127: void MessageProcessor::call_RAY_MPI_TAG_GET_READ_MARKERS(Message*): Assertion `readId<(int)m_myReads->size()' failed.
[ipara:21878] *** Process received signal ***
[ipara:21878] Signal: Aborted (6)
[ipara:21878] Signal c areode: (-6)
[ipara:21878] [ 0] /lib/libpthread.so.0 [0x7ff0190d3a80]
[ipara:21878] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7ff018da3ed5]
[ipara:21878] [ 2] /lib/libc.so.6(abort+0x183) [0x7ff018da53f3]
[ipara:21878] [ 3] /lib/libc.so.6(__assert_fail+0xe9) [0x7ff018d9cdc9]
[ipara:21878] [ 4] Ray(_ZN16MessageProcessor33call_RAY_MPI_TAG_GET_READ_MARKERSEP7Message+0x454) [0x43fa74]
[ipara:21878] [ 5] Ray(_ZN7Machine10runVanillaEv+0x99) [0x454f19]
[ipara:21878] [ 6] Ray(_ZN7Machine5startEv+0x1031) [0x456c51]
[ipara:21878] [ 7] Ray(main+0x3c) [0x4c0abc]
[ipara:21878] [ 8] /lib/libc.so.6(__libc_start_main+0xe6) [0x7ff018d901a6]
[ipara:21878] [ 9] Ray(__gxx_personality_v0+0x201) [0x42cd09]
[ipara:21878] *** End of error message ***
mpiexec noticed that job rank 0 with PID 21872 on node ipara exited on signal 15 (Terminated).
6 additional processes aborted (not shown)

Paired reads are stored in two files usually. For any pair of files, each file of the pair must have the same sequence count.

I suspect that the resulting fastq files you generated (after filtering) don't have a coherent number of sequences.

This is due to the fact that for any pair of sequences, 0, 1 or 2 sequences can be filtered out. In the 0 and 2 cases, there is no problem because it is a 'remove all' or a 'keep all' scenario.

But when only 1 sequence is filtered out, its twin should also be filtered out or perhaps put aside in a file containing 'alone' sequences.

The problem arises because Ray utilises Unique Sequencer Identifier, which are computed from the initial partition (fastq identifiers are not utilised at all).

The problem will go away should you provide Ray with a coherent sequence count for each file.

Sébastien

**santiagosnchez** · 02-15-2012, 02:20 PM

Thanks for replying Sebastien,

I figured out the problem right after my post. Do you recommend a way to exclude / delete unpaired filtered reads from each file? I've been trying to find some scripts, but no luck.

By the way, excellent program(!), by far the best assembler I've used.

Cheers,
Santiago

**jtladner** · 03-06-2012, 12:48 PM

Ray - Coverage too high

Hello, I have been using Ray for the de novo synthesis of several bacterial genomes. Overall it seems to be a really good program that has been giving me longer contigs that SOAPdenovo.

However, recently I ran into an error that seems to be due to genome coverage that is too high:

Rank 0: the minimum coverage is 2
Rank 0: the peak coverage is 2
Rank 0: Assembler panic: no peak observed in the k-mer coverage distribution.
Rank 0: to deal with the sequencing error rate, try to lower the k-mer length (-k)

At first I thought that I had the opposite problem, not enough coverage. I tried to lower the k as suggested, but I kept getting the same error. The only way I have been able to get Ray to run on this dataset is too either decrease the number of sequences that I am inputting into the program (in which case I get very good contigs) or increasing the k-mer to very high numbers (e.g., 63).

If possible, could you explain why high coverage would result in this type of error?

And can you provide guidelines for the optimal genome coverage for Ray?

Thank you.

Jason

**santiagosnchez** · 03-23-2012, 06:36 AM

Sébastien,

Is there a way to reuse some of Ray's output files in order to avoid some of the initial computations on the same data?

Cheers,
Santiago

**seb567** · 04-05-2012, 07:47 AM

Originally posted by santiagosnchez View Post

Thanks for replying Sebastien,

I figured out the problem right after my post. Do you recommend a way to exclude / delete unpaired filtered reads from each file? I've been trying to find some scripts, but no luck.

By the way, excellent program(!), by far the best assembler I've used.

Cheers,
Santiago

I don't know any particularly good program for this precise task.

**seb567** · 04-05-2012, 08:17 AM

Originally posted by jtladner View Post

Hello, I have been using Ray for the de novo synthesis of several bacterial genomes. Overall it seems to be a really good program that has been giving me longer contigs that SOAPdenovo.

However, recently I ran into an error that seems to be due to genome coverage that is too high:

Rank 0: the minimum coverage is 2
Rank 0: the peak coverage is 2
Rank 0: Assembler panic: no peak observed in the k-mer coverage distribution.
Rank 0: to deal with the sequencing error rate, try to lower the k-mer length (-k)

This limitation was removed in the Release of Ray 2.0-Release Candidate 5.

You can try Ray 2.0-rc5.

We modified this to enable metagenome assemblies.

Originally posted by jtladner View Post

At first I thought that I had the opposite problem, not enough coverage. I tried to lower the k as suggested, but I kept getting the same error. The only way I have been able to get Ray to run on this dataset is too either decrease the number of sequences that I am inputting into the program (in which case I get very good contigs) or increasing the k-mer to very high numbers (e.g., 63).

If you plot the coverage distribution, I am sure you will see something thatg is not smooth, yet I am sure you will see a sizable peak.

To plot your data (enter these commands in your terminal)

Code:

cd Place-Where-My-Assembly-Is-Located
ls CoverateDistribution.txt # make sure you are at the good place
R --vanilla

# the next commands will be given to R
data=read.table('CoverageDistribution.txt',header=TRUE)
pdf('MyCoverageFrequencies.pdf')
plot(data[,1],data[,2],xlab='k-mer coverage depth',ylab='Frequency',log='xy',type='l')
dev.off()

There is also a fancy script that ships with Ray that does that automatically.

Code:

~/git-clones/ray/scripts/plot-coverage-distribution.R CoverageDistribution.txt

Originally posted by jtladner View Post

If possible, could you explain why high coverage would result in this type of error?

We bought an Illumina HiSeq 1000 at our institution.

One of the acceptation tests was to do a whole lane of PhiX, a virus whose genome has just 5386 nucleotides.

The coverage distribution was ridiculous:

If we zoom in, we can see that the peak is not smooth.

This *may* be caused be cluster complexity on the flow cell.

*Maybe* your data look like this also, maybe not.

Originally posted by jtladner View Post

And can you provide guidelines for the optimal genome coverage for Ray?

As the saying goes, "the more, the better."

You should plot your distributions to assess the quality of your data.

Originally posted by jtladner View Post

Thank you.

Jason

**seb567** · 04-05-2012, 08:23 AM

Greetings !

Originally posted by santiagosnchez View Post

Sébastien,

Is there a way to reuse some of Ray's output files in order to avoid some of the initial computations on the same data?

Cheers,
Santiago

Yes, they are called checkpoints.

You just have to add -read-write-checkpoints

However, note that checkpointing files (they are binary and have the .ray extension) are only valid with the same command using the same data with the same number of MPI rank.

This mechanism is a checkpointing facility.

HTML Code:

mpiexec -n 1 Ray -help | less

  Checkpointing

       -write-checkpoints
              Write checkpoint files

       -read-checkpoints
              Read checkpoint files

       -read-write-checkpoints
              Read and write checkpoint files

**santiagosnchez** · 04-05-2012, 12:46 PM

So this could be achieved by typing something like:

mpiexec -n <#> Ray -o <$$$$> -read-checkpoints
(after you did a run with -write-checkpoints)

Is it possible to change the k-mer size for instance?

Thanks,

Santiago

**Anelda** · 04-05-2012, 05:17 PM

RAY on colourspace

Hi there,

Do you have any news on the colourspace issue? We ran RAY today for the first time and was very impressed, except that we mostly deal with SOLiD data and would need the contigs in base space eventually :-)

Thanks!

Anelda

**steph** · 06-05-2012, 04:37 AM

Problem at compilation with latest GCC version

Hi everyone,

I encountered a problem when trying to build the latest stable version of Ray (1.7) with the latest version of GCC (v4.7.0).

The problem occured at the make step.

With GCC v4.7.0, I got the following errors:

Code:

code/communication/MessageProcessor.cpp: In member function 'void MessageProcessor::call_RAY_MPI_TAG_ASK_VERTEX_PATH(Message*)':
code/communication/MessageProcessor.cpp:1685:7: error: redeclaration of 'int i'
code/communication/MessageProcessor.cpp:1675:10: error: 'int i' previously declared here
make: *** [code/communication/MessageProcessor.o] Error

However, when I used GCC v4.1.2 (which was also installed on this machine) instead, the installation finished correctly.

**gringer** · 06-05-2012, 05:00 AM

That's because the more recent versions of GCC do more code checking. Redeclaring variables introduces some scoping issues, and usually means that the coder hasn't realised there's an ambiguity. Luckily, these redeclaration errors are usually easily fixed, for example by changing the name of the inner loop variable to j instead of i.

**seb567** · 06-05-2012, 07:00 AM

Originally posted by santiagosnchez View Post

So this could be achieved by typing something like:

mpiexec -n <#> Ray -o <$$$$> -read-checkpoints
(after you did a run with -write-checkpoints)

Is it possible to change the k-mer size for instance?

Thanks,

Santiago

No, you can not change the k-mer size if you use the same checkpointing files.

There is the option -read-write-checkpoints that read and write these checkpoints too.

Topics	Statistics	Last Post
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM
Scientists Solve a 25-Year Mystery in RNA Interference by SEQadmin2 Started by SEQadmin2, 05-26-2026, 10:12 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 05-26-2026, 10:12 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News