Seqanswers Leaderboard Ad

**boetsie** · 11-28-2011, 07:43 AM

Hi Steve,

Just wanted to say that it is important to set the insert size for SSPACE as good as possible. There are tools to determine the median/mean insert size and its devation. One of them is within the SSPACE premium package. Since you are a BaseClear customer, you can get the SSPACE premium version for free if you do not have this already.

Furthermore, the assembly will not improve with the matepairs, they will sometimes even be worse. You already have enough coverage with your paired-end sequences. Main reason is that with matepair sequencing there is a bias in coverage at some regions along the genome, some regions are covered more than others. I would suggest not to include the matepair for the initial assembly. Only use the matepairs for scaffolding, as well as the paired-end reads used at the initial assembly.

What might improve the assembly is trimming of low-quality nucleotides and removing reads of low quality using the CLCBio's trimmer.

Once obtained the scaffolds, you can fill the gaps (N's) with tools like SOAP's GapClosure from BGI, or IMAGE. We are currently also working on a tool do this.

Regards,
Marten Boetzer
BaseClear

Originally posted by stevebaeyen View Post

Hello,
we are a Belgian research team studying the +/- 4Mbp genome of a bacterial plant pathogen (and newbies in NGS data analysis). We are getting some unexpected results during de novo assembly of our target genome using a combined paired-end and mate-pair library. No good reference genome is available, so de novo assembly is our only option. We would like to share some of our results for your consideration. Maybe some of you can tell us if this is a normal result, or if we are doing something wrong here…
First off, the data-sets:
1. One Illumina GA, paired-end short read set (50bp reads, 350Mb, 375bp insert), which gives us a theoretical 70x coverage.
2. One Illumina Hiseq, Mate-pair short read set (100bp reads, 500Mb, 5kb insert), which gives us a combined 160x coverage.
When we used the PE-set alone for de novo assembly in CLC-Bio, we get 478 contigs with an N50 of +/-20kbp. When looking at the contigs, we saw repetitive fragments (IS-sequences) were the major cause for the contig break-up. Based on the literature, we thought most of these gaps could be closed if we combined the PE-set with an extra MP-dataset.
However, if we combine both sets in a de novo assembly in CLC-Bio’s Beta-assembler (plugin in v4.8), we get 493 contigs and an N50 of 22kb.
When we try to scaffold the 478 contigs of the PE-only assembly with the MP-set in SSPACE, we can reduce them to 63 scaffolds, but the program has to introduce some 300.000 N’s in the sequence (total 4.2 Mb) to accomplish it. DNAStar also have problems with the Illumina 1.9 format from Hiseq2000...does anybody has experience using Hiseq data on this software?
Does anybody here have a clue what we are doing wrong and how we could improve this, or is there a logical explanation why the MP-set is not giving us a better gap closure?
Thank you for any remarks/suggestions!

**nickloman** · 11-28-2011, 08:06 AM

You are using mate-pair data to scaffold (join) contigs together rather than actually closing gaps, so what you are seeing is not unusual. The Ns represent repeat sequences of known length.

If you want to attempt to close gaps within those scaffolds, one option is to use a local assembly approach like GapCloser (part of SOAPdenovo) which I have had good results with but be aware that you won't be able to close all (or maybe even many) of the gaps this way.

But in fact I'd try your assembly with Velvet or SOAPdenovo rather than CLC-Bio first off and see if it does a better job.

**pmiguel** · 11-28-2011, 09:00 AM

So, the process of replacing those "N's" between scaffolds with actual sequence using PE and ME data is apparently called "gap closing". Since you are using a commercial software package to do your assembly, etc, you might want to ask them if they include a "gap closing" module.

Otherwise, there are liberated programs available for doing gap closing. (At least one is mentioned elsewhere in this thread.)
--
Phillip

**nickloman** · 11-28-2011, 09:14 AM

Originally posted by pmiguel View Post

So, the process of replacing those "N's" between scaffolds with actual sequence using PE and ME data is apparently called "gap closing".

That's what I call it anyway. One point about scaffolding (that perhaps is not well recognised) is that you don't usually end up with fewer gaps, just that the gaps become better characterised, e.g. you now know that contig A joins to contig B with a gap of N bases.

**stevebaeyen** · 12-14-2011, 06:47 AM

IMAGE2 gap closing

Originally posted by boetsie View Post

Hi Steve,
Once obtained the scaffolds, you can fill the gaps (N's) with tools like SOAP's GapClosure from BGI, or IMAGE. We are currently also working on a tool do this.
BaseClear

Hi Boetsie,
we obtained very nice scaffolds using your SSPACE Premium v2 software (up to 937kb and N50=275kb). I tried using IMAGE2 but there is no 'readme' or 'install' file and I can't find any information that helps me to run the software on the example provided with the program (program runs but does not close the gaps). I tried to contact Jason Tsai but no reply so far.
This is what i did:
I downloaded the Dec., 2 version (v2.3) from Sourceforge, copied the precompiled binaries to /usr/local/bin and made them executable on a Linux Ubuntu 11.10 64-bit distro. I looked at the scripts run.sh and saw some variables that have to be declared (such as paths to velvet, ssaha, etc.) but i still do not get the gaps closed in iteration 10 (see output in attachment imagetest.txt).

# software path
# this is the path where the IMAGE path is
# Please change it accordingly
VELPATH=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
SSAHADIR=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
WALKPATH=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
Then i did:
cd /home/sbaeyen/Bio/IMAGE/IMAGE_version2/example
sbaeyen@PXLSEQ:~/Bio/IMAGE/IMAGE_version2/example$ home/sbaeyen/Bio/IMAGE/IMAGE_version2/image.pl -prefix 76bp -iteration 1 -all_iteration 10 -dir_prefix iteration > imagetest.txt
When I run the 'image_run_summary.pl' script , I get:
sbaeyen@PXLSEQ:~/Bio/IMAGE/IMAGE_version2/example$ perl /home/sbaeyen/Bio/IMAGE/IMAGE_version2/image_run_summary.pl iteration
The prefix is : iteration
iteration Starting_gaps Gap_closed Gap_extend_oneside Gap_extend_bothside
1 5 0 0 0
2 5 0 0 0
3 5 0 0 0
4 5 0 0 0
5 5 0 0 0
6 5 0 0 0
7 5 0 0 0
8 5 0 0 0
9 5 0 0 0
10 5 0 0 0
Do you have any clue what i need to adapt to get this program running/what i did wrong?
Best regards and thanks (again) for any advice!
Steve
ps if you want i can send you the program output imagetest.txt

**boetsie** · 12-14-2011, 07:19 AM

Hi Steve,

I've tried to run IMAGE too, but did not succeed. The input is very complex and I even had to change the code to get it running, though it did not close any gap. I've asked one of the authors but did not get any reply. I would go for GapClosure from SOAP, which is very good but does not include the remaining gaps and seems to join repeated areas. We have finished our tool, but are working on a publication, after that it will be released.

Regards,
Boetsie

Originally posted by stevebaeyen View Post

Hi Boetsie,
we obtained very nice scaffolds using your SSPACE Premium v2 software (up to 937kb and N50=275kb). I tried using IMAGE2 but there is no 'readme' or 'install' file and I can't find any information that helps me to run the software on the example provided with the program (program runs but does not close the gaps). I tried to contact Jason Tsai but no reply so far.
This is what i did:
I downloaded the Dec., 2 version (v2.3) from Sourceforge, copied the precompiled binaries to /usr/local/bin and made them executable on a Linux Ubuntu 11.10 64-bit distro. I looked at the scripts run.sh and saw some variables that have to be declared (such as paths to velvet, ssaha, etc.) but i still do not get the gaps closed in iteration 10 (see output in attachment imagetest.txt).

# software path
# this is the path where the IMAGE path is
# Please change it accordingly
VELPATH=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
SSAHADIR=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
WALKPATH=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
Then i did:
cd /home/sbaeyen/Bio/IMAGE/IMAGE_version2/example
sbaeyen@PXLSEQ:~/Bio/IMAGE/IMAGE_version2/example$ home/sbaeyen/Bio/IMAGE/IMAGE_version2/image.pl -prefix 76bp -iteration 1 -all_iteration 10 -dir_prefix iteration > imagetest.txt
When I run the 'image_run_summary.pl' script , I get:
sbaeyen@PXLSEQ:~/Bio/IMAGE/IMAGE_version2/example$ perl /home/sbaeyen/Bio/IMAGE/IMAGE_version2/image_run_summary.pl iteration
The prefix is : iteration
iteration Starting_gaps Gap_closed Gap_extend_oneside Gap_extend_bothside
1 5 0 0 0
2 5 0 0 0
3 5 0 0 0
4 5 0 0 0
5 5 0 0 0
6 5 0 0 0
7 5 0 0 0
8 5 0 0 0
9 5 0 0 0
10 5 0 0 0
Do you have any clue what i need to adapt to get this program running/what i did wrong?
Best regards and thanks (again) for any advice!
Steve
ps if you want i can send you the program output imagetest.txt

**stevebaeyen** · 12-15-2011, 05:14 AM

Hi Boetsie,
thanks for the advice of using SOAP's GapCloser ! Using the PE reads, i was able to close 161 of 400 gaps (of N's) in the scaffolds. Do you think the performance of Gapfiller would be even better?
Regards,
Steve

**ragowthaman** · 12-15-2011, 03:09 PM

stevebaeyen: I recently started to use IMAGE2. It seems to work well with me. At least it finished the example well and closed gaps. But, when it comes to my own genome, it did extend the ends but did not close very many gaps. May be a problem with data not IMAGE2...

Did you make sure, velveth,velvetg,smalt etc are in path?

**boetsie** · 01-19-2012, 03:41 AM

Has anyone ever succeed to run IMAGE on his own data?

I want to run it with my scaffolds, but i'm having trouble to make the input files required by IMAGE. Does anyone have a script to automatically generate these files based on the original scaffolds?

Regards,
Boetsie

**Stegger** · 01-19-2012, 04:13 AM

Hi,
I had similar NGS data on a 3 Mbp bacterial pathogen, PE and MP Illumina data, and at least with an older version of the CLC GW assembler I also got much worse results with combined assembly even though I thought adding the MP would significantly reduce the number of contigs. I have not tried combining these with the new beta assembler CLC has, although it works better on my PE Illumina data alone.Have you tried that?
The solution for us was to use Velvet on both datasets and that brought our number of contigs down from approx. 70 to something like 15. These were verified by optical mapping, and we only saw one major error in these Velvet contigs... perhaps it is worth a try?

**stevebaeyen** · 01-19-2012, 05:08 AM

Originally posted by Stegger View Post

Hi,
I had similar NGS data on a 3 Mbp bacterial pathogen, PE and MP Illumina data, and at least with an older version of the CLC GW assembler I also got much worse results with combined assembly even though I thought adding the MP would significantly reduce the number of contigs. I have not tried combining these with the new beta assembler CLC has, although it works better on my PE Illumina data alone.Have you tried that?
The solution for us was to use Velvet on both datasets and that brought our number of contigs down from approx. 70 to something like 15. These were verified by optical mapping, and we only saw one major error in these Velvet contigs... perhaps it is worth a try?

Hi , I tried to denovo assemble PE+MP datasets on the new CLC scaffolder but didn't get a huge improvement compared to the PE dataset alone. A succesfull scaffolding with a +/-70% reduction was performed with SSPACE Premium v2 and and gaps closed with SOAP Gapcloser. Thanks for the Velvet tip, I'll give it a try! Do you have a good reference concerning optical mapping?

**Stegger** · 01-19-2012, 05:39 AM

My pleasure!
and yes I had a very good reference..

**stevebaeyen** · 01-19-2012, 06:00 AM

Originally posted by Stegger View Post

My pleasure!
and yes I had a very good reference..

and can you give me with a link to a review article about optical mapping ?

**hylei** · 07-09-2012, 07:07 AM

How to close the CLC-bio contigs according to the reference genome sequence?

Hi, I used the CLC-bio de novo assembly to analyze the Miseq 150bp PE data, and I have the 150 contigs. I also have the 170kb reference genome seq; I tried to use the IMAGE to close the gap, and it did not work for me. Can anyone suggest me how to close the gap? Thank you very much.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

scaffolding GAII paired-end library with Hiseq mate-pairs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News