Originally posted by corthay
View Post
I am currently modifying Ray heuristics to produce assemblies using paired sequences separated by very large physical distances.
I simulated 4 libraries from E. coli K-12 MG1655 genome sequences. These reads include 0.5% mismatch errors. ANd I used wgsim available in samtools.
Fragment standard deviation is 10% of the fragment average length.
These libraries are
200 +/- 20
1000 +/- 100
4000 +/- 400
10000 +/- 1000
Code:
/software/samtools-0.1.7/wgsim -e 0.005 -d 200 -s 20 -N 4000000 -1 30 -2 30 -r 0 -R 0 -X 0 Streptococcus-pneumoniae-R6.fasta 200-17_1.fastq 200-17_2.fastq /software/samtools-0.1.7/wgsim -e 0.005 -d 1000 -s 100 -N 4000000 -1 30 -2 30 -r 0 -R 0 -X 0 Ecoli-k12-mg1655.fasta 1000_1.fastq 1000_2.fastq /software/samtools-0.1.7/wgsim -e 0.005 -d 4000 -s 400 -N 4000000 -1 30 -2 30 -r 0 -R 0 -X 0 Ecoli-k12-mg1655.fasta 4000_1.fastq 4000_2.fastq /software/samtools-0.1.7/wgsim -e 0.005 -d 10000 -s 1000 -N 4000000 -1 30 -2 30 -r 0 -R 0 -X 0 Ecoli-k12-mg1655.fasta 10000_1.fastq 10000_2.fastq
Then, using Ray:
Code:
mpirun -np 31 ~/Ray/trunk/code/Ray \ -p 200_1.fastq 200_2.fastq -p 1000_1.fastq 1000_2.fastq -p 4000_1.fastq 4000_2.fastq -p 10000_1.fastq 10000_2.fastq | tee log
Without scaffolding, Ray generates 65 contigs each having at least 500 nucleotides. There is a total of 4625792 assembled bases, all are A, T, C or G. The mean size is 71166, the N50 is 121756 and the largest is 371980.
This is without any scaffolding.
I then used MUMmer to validate contigs: 0 misassembly, 0 mismatch error and 0 small indel.
Code:
% & numberOfContigs & bases & meanSize & n50 & max & coverage & misassembled & mismatches & indels Ray & 65 & & 71166 & 121756 & 371980 & 0.9965 & 0 & 0 & 0 \\
Running time/31 processors:
Code:
[1,0]<stdout>: Beginning of computation: 2 seconds [1,0]<stdout>: Distribution of sequence reads: 52 seconds [1,0]<stdout>: Distribution of vertices & edges: 2 minutes, 56 seconds [1,0]<stdout>: Calculation of coverage distribution: 0 seconds [1,0]<stdout>: Indexing of sequence reads: 15 seconds [1,0]<stdout>: Computation of seeds: 27 seconds [1,0]<stdout>: Computation of library sizes: 4 minutes, 45 seconds [1,0]<stdout>: Extension of seeds: 4 minutes, 15 seconds [1,0]<stdout>: Computation of fusions: 37 seconds [1,0]<stdout>: Collection of fusions: 0 seconds [1,0]<stdout>: Completion of the assembly: 14 minutes, 9 seconds
The contig lengths:
388
641
1038
2164
2714
3438
4735
5845
7199
7939
9557
10364
11883
16323
16937
20213
24049
25788
26056
26433
27014
29449
30205
31159
33620
34875
36084
37445
38010
38550
39509
43086
43623
43841
46455
47970
58692
59083
59441
59889
64822
66096
66767
66783
67838
70713
82799
85321
94029
95691
97407
100961
102906
103167
121756
121858
138099
156439
161412
163121
169912
177402
213535
284796
318866
371980
Let me answer your question now. For seed extension, an algorithmic process that generates contigs, available versions of Ray will not accomodate very large fragment lengths, 400 at most I think.
It is because for larger distances, the right sequence in a pair most likely starts on a repeated k-mer. As such, constraints on the observed distances most be further enforced -- paired reads not satisfying the constraints must be set free in order to allow them to be used in the adequate time.
Ray 1.2.3 'cRAYfish' is set to be released soon. For this release, I want to give a running time on 512 processors for a human genome. But, for the time being, I am currently waiting for some compute time on a compute cluster.
1.2.3 fully utilizes large distances provided by jumping libraries (fosmid ends).
I also plan to incorporate a scaffolder, but that is not my priority now.
My priority is a memory utilization reducer that is utilized during the process of building the distributed de Bruijn graph.
Originally posted by corthay
View Post
Note however that Ray make full use of paired reads to traverse repeated elements in the genome. And apparently it does that very well.
Meanwhile, I suggest you use SSPACE.
Bioinformatics paper http://bioinformatics.oxfordjournals...s.btq683.short
Originally posted by corthay
View Post
Ray does not read the header in the fasta and fastq files containing sequences.
Comment