Seqanswers Leaderboard Ad

**seb567** · 06-10-2011, 07:13 PM

Originally posted by pmiguel View Post

Has anyone used SSPACE to scaffold Abyss data? Abyss already produces a .adj and a .dot file which might be as good as the scaffold is going to get.

Opinions?

--
Phillip

Ray (since v1.4.0) now includes a scaffolder (it is pretty good).

See http://denovoassembler.sourceforge.net/ (open source and well-documented !)

p.s.: I am the author of Ray (I am a PhD student).

**boetsie** · 06-11-2011, 04:52 AM

Originally posted by pmiguel View Post

Has anyone used SSPACE to scaffold Abyss data? Abyss already produces a .adj and a .dot file which might be as good as the scaffold is going to get.

Opinions?

--
Phillip

I've only tested ABYSS contigs myself for the E.coli dataset, and here it gave some very good results. I do recommend filtering small contigs (e.g. larger than 100 or 200bp), since smaller contigs are likely to be repeats or misassembled contigs.

For E.coli, scaffolding of contigs with a minimal of 100bp reduced 595 contigs to 127 scaffolds. In addition, the N50 went from 18k to 94k. I've tested these scaffolds with MUMmer and all were valid.

I must say, i am the developer of SSPACE, so i'm a bit biased

Some other post i found about ABYSS and SSPACE;

Illumina MP library for scaffolding only

http://groups.google.com/group/abyss-users/msg/fc505cff5cb974bd

Kind regards,
Boetsie

**pmiguel** · 06-14-2011, 03:43 AM

Originally posted by seb567 View Post

Ray (since v1.4.0) now includes a scaffolder (it is pretty good).

See http://denovoassembler.sourceforge.net/ (open source and well-documented !)

p.s.: I am the author of Ray (I am a PhD student).

Hi Seb567,
We did try Ray. Maybe we did not configure the Ray assembly correctly, but our Abyss results looked much better. For instance the following command:
/programs/Ray-1.4.0/code/Ray \
-k \
43 \
-i \
../FastQ/000617_TL3360_both.fastq \
-o \
000617_TL3360

produced ~3400 contigs ranging from 130 bp to 8.6 kb. Whereas Abyss produced 137 contigs ranging from 41- 450165 bp using a similar kmer size (41).

These were 2x100 bp reads from ~350bp fragment PEs -- about 200x coverage. The DNA was from the bacterium Salmonella.

--
Phillip

**pmiguel** · 06-14-2011, 04:06 AM

Originally posted by boetsie View Post

I've only tested ABYSS contigs myself for the E.coli dataset, and here it gave some very good results. I do recommend filtering small contigs (e.g. larger than 100 or 200bp), since smaller contigs are likely to be repeats or misassembled contigs.

For E.coli, scaffolding of contigs with a minimal of 100bp reduced 595 contigs to 127 scaffolds. In addition, the N50 went from 18k to 94k. I've tested these scaffolds with MUMmer and all were valid.

I must say, i am the developer of SSPACE, so i'm a bit biased

[...]

Kind regards,
Boetsie

Hi Boetsie,

Yes, I should try it.

After Abyss alone, our N50 for contigs >200 bases is already 17.5kb. (77 contigs, range 214-389830 bases, mean 58691 bases.) This was with setting the kmer higher (63) than the example I gave in the post above.

I will post here the results after SSPACE.

--
Phillip

**pmiguel** · 06-14-2011, 11:32 AM

Hi Boetsie,
Okay I ran SSPACE. Only one mysterious glitch in getting it to run (described below). I filtered my contigs by removing any shorter than 200 bases prior to running. Here are the initial and final results:

Inserted contig file;
Total number of contigs = 77
Sum (bp) = 5456937
Max contig size = 389830
Min contig size = 214
Average contig size = 70869
N50 = 225952

After extension;
Total number of contigs = 77
Sum (bp) = 5456953
Max contig size = 389830
Min contig size = 222
Average contig size = 70869
N50 = 225952

After scaffolding lib1:
Total number of scaffolds = 69
Sum (bp) = 5457073
Max scaffold size = 389830
Min scaffold size = 680
Average scaffold size = 79088
N50 = 226679

Overall and increase of >10% in the scaffold lengths over the initial contigs. Not bad! Actually I think I am likely coming up against a hard limit imposed by our library insert size.

Also it ran fast -- just a minute or two with -x 1 set.

I did have one problem getting it to run. It took me about 30 minutes with the perl debugger to track down the issue. So I'll describe it and the simple solution for anyone googling the warning SSPACE gave. The warning was:

Bowtie-build error; -1 at /bin/SSPACE/SSPACE-1.1_linux-x86_64/bin/mapWithBowtie.pl line 37.
WARNING: No scaffolding, because no reads found on contigs

Turns out to be because mapWithBowtie.pl was getting a permissions error when it attempted to run bowtie-build via a sys call. So

chmod +x /bin/SSPACE/SSPACE-1.1_linux-x86_64/bowtie/bow*

fixed the issue. That is, the programs in the bowtie subdirectory needed to be given execute permission.

--
Phillip

**boetsie** · 06-14-2011, 02:02 PM

Hi Phillip,

your results look OK, <70 contigs with only one paired-end library of 200bp is very good. I think there is not much to gain from this library. Remaining contigs are probably repeats (especially the small contigs) or contigs/scaffolds that could not be combined with each other since the library insert size is too small.

For example with E.coli we went from 127 to 89 scaffolds with a paired-end 500, and then to 9 scaffolds with a mate pair 2kb.

I'm aware of this problem, and i thought i had fixed it, but it did not. The next release will hopefully not contain this error. Thanks for mentioning it!

regards,
Boetsie

**pmiguel** · 06-15-2011, 03:04 AM

Hi Boetsie,
Actually the new TruSeq DNA library protocol recommends fragmenting DNA to a mean length of 300-400 bases for genomic DNA. Since our resulting sequence was at or above specifications for the instrument, I think the larger insert sizes are the way to go by default.
Thanks for the info about the effect of mate end (ME) reads. I did not have any for this bacterium. We do have some for a fungal genome we assembled. But they are 454 MEs. We are giving those a shot.

--
Phillip

**seb567** · 06-15-2011, 07:43 AM

Originally posted by pmiguel View Post

Hi Seb567,
We did try Ray. Maybe we did not configure the Ray assembly correctly, but our Abyss results looked much better. For instance the following command:
/programs/Ray-1.4.0/code/Ray \
-k \
43 \
-i \
../FastQ/000617_TL3360_both.fastq \
-o \
000617_TL3360

produced ~3400 contigs ranging from 130 bp to 8.6 kb. Whereas Abyss produced 137 contigs ranging from 41- 450165 bp using a similar kmer size (41).

These were 2x100 bp reads from ~350bp fragment PEs -- about 200x coverage. The DNA was from the bacterium Salmonella.

--
Phillip

What is the content of these files:

000617_TL3360.CoverageDistributionAnalysis.txt
000617_TL3360.LibraryStatistics.txt

Thank you.

**pmiguel** · 06-15-2011, 08:02 AM

Originally posted by seb567 View Post

What is the content of these files:
000617_TL3360.CoverageDistributionAnalysis.txt

MinimumCoverage: 46
PeakCoverage: 159
RepeatCoverage: 160
Percentage of vertices with coverage 1: 87.6321%
DistributionFile: 000617_TL3360.CoverageDistribution.txt

Originally posted by seb567 View Post

000617_TL3360.LibraryStatistics.txt

File: ../FastQ/000617_TL3360_both.fastq
NumberOfSequences: 13001302

Total: 13001302

NumberOfPairedLibraries: 1

LibraryNumber: 0
InputFormat: Interleaved,Paired
DetectionType: Automatic
File: ../FastQ/000617_TL3360_both.fastq
NumberOfSequences: 13001302
AverageOuterDistance: 385
StandardDeviation: 628
DetectionFailure: Yes

--
Phillip

**seb567** · 06-15-2011, 08:31 AM

Originally posted by pmiguel View Post

MinimumCoverage: 46
PeakCoverage: 159
RepeatCoverage: 160
Percentage of vertices with coverage 1: 87.6321%
DistributionFile: 000617_TL3360.CoverageDistribution.txt

File: ../FastQ/000617_TL3360_both.fastq
NumberOfSequences: 13001302

Total: 13001302

NumberOfPairedLibraries: 1

LibraryNumber: 0
InputFormat: Interleaved,Paired
DetectionType: Automatic
File: ../FastQ/000617_TL3360_both.fastq
NumberOfSequences: 13001302
AverageOuterDistance: 385
StandardDeviation: 628
DetectionFailure: Yes

--
Phillip

The CoverageDistributionAnalysis.txt file points to a bad detection of the repeat coverage, so nothing will work correctly for sure after that.

MinimumCoverage: 46
PeakCoverage: 159
RepeatCoverage: 160 <----

Can you put the content of 000617_TL3360.CoverageDistribution.txt on http://pastebin.com/ and link it here ?

**pmiguel** · 06-15-2011, 09:48 AM

Originally posted by seb567 View Post

The CoverageDistributionAnalysis.txt file points to a bad detection of the repeat coverage, so nothing will work correctly for sure after that.

MinimumCoverage: 46
PeakCoverage: 159
RepeatCoverage: 160 <----

Can you put the content of 000617_TL3360.CoverageDistribution.txt on http://pastebin.com/ and link it here ?

1 1704629202 89390723 23709664 10303685 5581326 3432407 2254148 16 - Pastebin.com

http://pastebin.com/sBQ6k4NY

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

Thanks
--
Phillip

**seb567** · 06-15-2011, 10:58 AM

Originally posted by pmiguel View Post

http://pastebin.com/sBQ6k4NY

Thanks
--
Phillip

OK, problem solved.

This is your coverage distribution:

http://i.imgur.com/caicf.png

However, it confuses Ray because it is going up and down near the inflection point:

142 1002
143 2012
144 432
145 1032
146 1098
147 1088
148 1166
149 1454
150 778
151 1122
152 1146
153 -720
154 424
155 192
156 -64
157 552
158 418
159 -406 Peak Coverage
160 164
161 -826
162 -434
163 -190
164 -124
165 26
166 1014
167 -1100
168 -562
169 -1376
170 -1288
171 -336
172 -984
173 -500
174 -1064

I added data smoothing and it fixes the problem.

File= /home/boiseb01/coverage-pmiguel
MinCoverage= 45
PeakCoverage= 158
RepeatCoverage= 290

* Added a smoothing routine for the detection of points in the coverage · sebhtml/ray@6590dd0

https://github.com/sebhtml/ray/commit/6590dd022

distribution. Thanks to pmiguel on SEQanswers for providing raw data points. http://seqanswers.com/forums/showthread.php?p=43979#post43979

https://github.com/sebhtml/ray/tarball/v1.6.1-rc1

seb

**SLB** · 06-16-2011, 12:27 AM

Hi,

I have used SSPACE with abyss output after assembly with 180 and 550 PE libraries. I filtered for contigs > 200 and below is the output from SSPACE. I have a quick question about the output relating to repeats. After scaffoldijng with the final library I get the following;
Number of repeats = 14553
Total size of repeats = 1494450560
What do these figures relate to? Its funny because If I add the total size of repeats to the total size of the scaffolded assembly after the final library is added I get, 1494450560 + 1149222136 = 2643672696, which is the estimated size of my genome!

Inserted contig file;
Total number of contigs = 440783
Sum (bp) = 657546051
Max contig size = 39800
Min contig size = 200
Average contig size = 1491
N50 = 3535

After scaffolding lib1: 3kb
Total number of scaffolds = 326357
Sum (bp) = 844894494
Max scaffold size = 102863
Min scaffold size = 200
Average scaffold size = 2588
N50 = 10046

After scaffolding lib2: 5kb
Total number of scaffolds = 266348
Sum (bp) = 993616335
Max scaffold size = 164536
Min scaffold size = 200
Average scaffold size = 3730
N50 = 17281

After scaffolding lib3: 10kb
Total number of scaffolds = 232199
Sum (bp) = 1149222136
Max scaffold size = 303516
Min scaffold size = 200
Average scaffold size = 4949
N50 = 29100

**boetsie** · 06-16-2011, 02:50 AM

It's a complicated calculation, but basically it counts the number of contigs that are linked left, and the number of contigs that are linked right from the contig.

Say that contigA has three contigs that are linked left and two contigs linked right. The repeat is the highest number of links, thus here 3. This contig is thus said to be repeated 3 times in the assembly.

Have a look at the *.repeat file in the intermediate_results folder. Here, all repeats are listed.

Remember though, that one of the repeated elements is also included in the final assembly, so the repeats should be subtracted from the final scaffolds. So if contigA is repeated 4 times with a size of 1300bp. The 1300bp should be subtracted from the final assembly, since the contig is already present within the scaffolds.

To improve your assembly, try to include the PE libraries in SSPACE too. Scaffolding a combination of Paired-End and Mate pair libraries is very powerfull.

Boetsie

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Is SSPACE good for Abyss assemblies?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News