Scaffolding problem - SEQanswers

christinawu2008 replied

06-28-2011, 07:25 PM
bowtie-build error

Originally posted by boetsie View Post

No problem, good luck with your further analysis.

Hi boetsie,

SSPACE must be very useful tool for scaffolding. But when I tried to use it, the process was failed by bowtie-build step. I only have contig file contains all name with super_contig sequences without other information and there are lots of 'N' gaps between. Do I need to modify some information and get bowtie-build works? If not, what's the problem?

The reads I have are 100PE
so the library is like
lib1 ***1.fastq ***2.fastq 200 0.7 0
or I should replace 200 to 400?
Leave a comment:
boetsie replied

03-08-2011, 06:06 AM
No problem, good luck with your further analysis.
Leave a comment:
Autotroph replied

03-08-2011, 05:59 AM
Hi boetsie,

Thanks a lot for the patient explanation.
Leave a comment:
boetsie replied

03-08-2011, 05:50 AM
Hi Autotroph,

sorry but i think it's simply not possible to merge them with SSPACE with the method you try to do. SSPACE will only look at the end of the contigs if there is any overlap, while you try to change the "N" characters into DNA characters by merging.

SSPACE does this;
CATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATC
.............................GCTACGATCGATCAGTAGTAGATAGATAGATGATAG

While you try to find an certain overlap, and determine the rest of the sequence;

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATCGTCATCG

TGTGTCAGCTAGCTACGAGCTAGCTAGCTACTACTAGCTACTAGCTAGCGCATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATCGTCATCG.......

As said, i think what you want to do is not possible with SSPACE. Maybe you can first do a gapclosure on the scaffolds (e.g. with SOAP's gapclosure method) so the N's will be removed out of your data.

Boetsie
Leave a comment:
Autotroph replied

03-08-2011, 05:26 AM
The point of giving an insert size of 100(50+50) is to not have any gaps in the final scaffold. I understood that the two reads could even overlap if an insert size less than 100 is given for 2*50 bp reads.

Actual sequence (without any gaps)expected would be:

"AGCTACTAGCTGCTACTAGCTCAGATGCATCGATCGACGATCTGATCGGCTGTGTCAGCTAGCTACGAGCTAGCTAGCTACTACTAGCTACTAGCTAGCGCATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATCGTCATCG"

I even tried with 200 as insert size, but it fails to merge the contigs "correctly".

output given below :

>scaffold1.1|size269
AGCTACTAGCTGCTACTAGCTCAGATGCATCGATCGACGATCTGATCGGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATCGTCATCGnCGATCGACGATCTGATCGGCTGTGTCAGCTAGCTACGAGCTAGCTAGCTACTACTAGCTACTAGCTAGCGCATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATCGTCATCG

Does it mean that the two reads of PE must have a gap between them?

Why "TGTGTCAGCTAGCTACGAGCTAGCTAGCTACTACTAGCTACTAGCTAGCG" is not replacing the N's while it has overlap and also has PE read connecting the two 'contigs'?
Leave a comment:
boetsie replied

03-08-2011, 04:42 AM
Originally posted by Autotroph View Post

Could you please look at below example and let me know why SSPACE does not merge the 2 "contigs"?

Hi Autotroph,

I've had a look at it, and i think i know why it did not merge. You should increase the insert size in your library file. SSPACE includes the read lengths within the determination of the gap/overlap. With 100bp insert size, it did not satisfy the minimum allowed distance.

The read lengths of your 2 reads are both 50bp. So increasing the insert size in your library with 100 (2*50bp of your reads) should do it, thus;

lib1 read1.fa read2.fa 200 0.7 0

If you need a more detailed description, please let me know

Kind regards,
Boetsie
Leave a comment:
Ashu replied

03-08-2011, 02:39 AM
Hi Boetsie,
Thank you for the information,
I have a mate pair, with a distance, estimated by bioanalyzer,
My library looks as follows

MP1 /G1/2_5kb/s_a_sequence_1.fastq /G1/2_5kb/s_a_sequence_2.fastq 2500 0.7 1
MP1 /G1/2_5kb/s_b_sequence_1.fastq /G1/2_5kb/s_b_sequence_2.fastq 2500 0.7 1
MP1 /G2/2_5kb/s_a_sequence_1.fastq /G2/2_5kb/s_a_sequence_2.fastq 2500 0.7 1
MP1 /G2/2_5kb/s_b_sequence_1.fastq /G2/2_5kb/s_b_sequence_2.fastq 2500 0.7 1
MP1 /G2/2_5kb/s_c_sequence_1.fastq /G2/2_5kb/s_c_sequence_2.fastq 2500 0.7 1
MP1 /G2/2_5kb/s_d_sequence_1.fastq /G2/2_5kb/s_d_sequence_2.fastq 2500 0.7 1

I will try it with paired end form (0), but i cant imagine why it turns out to be paired end not matepair. In the pairing issue file, I also see that there is a lot of distance problem, is there a way to put this in graph.
Thank you again for your kind reaction,
regards,
Ashu
Leave a comment:
Autotroph replied

03-08-2011, 02:10 AM
unfortunately Minimus can be used to merge contigs only, not scaffolds.Bambus is able to merge scaffolds but does not allow N's in the input.

It might be possible for me to use Minimus and SSPACE in some combination to merge the scaffolds.

Could you please look at below example and let me know why SSPACE does not merge the 2 "contigs"?

--------------------_________________--------------------------
read1 read2(rev-comped) (common anchor sequence)

Contigs.fa:

>contig1
AGCTACTAGCTGCTACTAGCTCAGATGCATCGATCGACGATCTGATCGGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATCGTCATCG
>contig2
TGTGTCAGCTAGCTACGAGCTAGCTAGCTACTACTAGCTACTAGCTAGCGCATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATCGTCATCG

read1.fa

>read1
AGCTACTAGCTGCTACTAGCTCAGATGCATCGATCGACGATCTGATCGGC

read2 .fa(first 50 bases of contig2 are reverse complemented)

>read2
CGCTAGCTAGTAGCTAGTAGTAGCTAGCTAGCTCGTAGCTAGCTGACACA

lib file:

lib1 read1.fa read2.fa 100 0.7 0

command:

perl SSPACE_v1-1.pl -l lib -s contigs.fa -k 1 -a 0.7 -x 1 -o 1 -b merger

This gives me 2 scaffolds instead of the 1 scaffold that i am expecting. When the length of the anchor sequence is reduced, it gives a single scaffold with a "n" placed between the 2 scaffolds.

Surprisingly if the same information is given in the form a set of 2 mate pairs, the 2 scaffolds are merged. My guess would be that SSPACE does not treat the initial set of N's in the same way as the N's added by it in the intermediate steps.

Last edited by Autotroph; 03-08-2011, 03:03 AM. Reason: additional information
Leave a comment:
boetsie replied

03-08-2011, 01:16 AM
Originally posted by Autotroph View Post

Thanks for the clarification Boetsie,

Bowtie can handle only reads that are a maximum of 1024 BP long. What does SSPACE do for reads that are longer than that?

SSPACE can unfortunately not handle sequences longer than 1024 bp long. They simply are not used for mapping.

I am interested in merging scaffolds, that is merging 2 sequences that look like below(SSPACE does not use reads with N's in the paired end files, am i correct?)

Indeed SSPACE does not allow reads with N's in the paired-end files.

I think you should consider another program for this, since you mention that you want to merge scaffolds, instead of extend them. You could try something like an alignment program if you want to merge 2 scaffolds. Maybe you can do something like Ken Kraaijeveld (http://www.kenkraaijeveld.nl/genomics/bioinformatics/). See the "combining contigs" section.

Boetsie
Leave a comment:
boetsie replied

03-08-2011, 01:07 AM
Originally posted by Ashu View Post

HI Boetsie,
I can't find any improvement before and after scaffolding ... Am I doing something wrong ??? Thanks

Hi Ashu,

i'm pretty sure you turned around the library file. Are you using paired-end (--> <-- direction) or mate pair (<-- --> direction) reads? If you use paired-end, your library should look something like this;

libname file1.fasta file2.fasta 700 0.25 0

With the last column containing a 0. For mate pairs, the last column should contain a 1;

libname file1.fasta file2.fasta 700 0.25 1

I think this should do it.

Boetsie
Leave a comment:
Autotroph replied

03-07-2011, 09:38 PM
longer reads

Thanks for the clarification Boetsie,

Bowtie can handle only reads that are a maximum of 1024 BP long. What does SSPACE do for reads that are longer than that?

I am interested in merging scaffolds, that is merging 2 sequences that look like below(SSPACE does not use reads with N's in the paired end files, am i correct?):

AGCTAGCTAGCTNNNNNNNNNCGATCGATGCNNNNNNNCGATCGATCGATCGNNNNCAGCTAGT

ANNNNNTAGCTACGATCGATCGNNNNNNNNNGATGCACGTACGATNNCGATNNNNNNNNNNNCAGCTAGT
Leave a comment:
Ashu replied

03-07-2011, 11:46 AM
SSPACE bo improvement in N50 or contig size

HI Boetsie,
I can't find any improvement before and after scaffolding ... Am I doing something wrong ??? Thanks

-x = 0
-k = 5
-a = 0.7
-n = 15
-p = 0

==================================

Number of single reads found on contigs = 84724494
Number of pairs found with pairing contigs / total pairs = 47882393 / 48019708
------------------------------------------------------------

READ PAIRS STATS:
------------------------------------------------------------
At least one sequence/pair missing from contigs: 137314
Assembled pairs: 47882393 (95764786 sequences)
Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 2500 +/-1750): 22
Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 11
Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 81
---
Satisfied in distance/logic within a given contig pair (pre-scaffold): 26534237
Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 21348042
---
Total satisfied: 26534259 unsatisfied: 21348134

------------------------------------------------------------

################################################################################

SUMMARY:
------------------------------------------------------------
Inserted contig file;
Total number of contigs = 1060008
Sum (bp) = 2114313317
Max contig size = 56175
Min contig size = 200
Average contig size = 1988
N50 = 3918

After scaffolding MP1:
Total number of scaffolds = 1060008
Sum (bp) = 2114313317
Max scaffold size = 56175
Min scaffold size = 200
Average scaffold size = 1988
N50 = 3918
Regards
Leave a comment:
boetsie replied

02-17-2011, 12:50 AM
Hi,

You say;

The problem with using SSPACE is that it does not allow N's in the input contig file.

while the SSPACE manual says;

Contigs having a non-ACGT character like “.” or “N” are not discarded. They are used for extension, mapping and building scaffolds. However, contigs having such character at either end of the sequence, could fail for proper contig extension.

So, they can be used for extending, only if the N's are at the end of a sequence it is unable to map reads.

I don't know about Velvet... I know SSAKE (which has basically the same procedure as SSPACE) also can use contigs as 'seeds' and extends them with additional reads. Difference is that SSPACE first maps the reads to the pre-assembled contigs and only uses the unmapped reads for contig/scaffold extension. SSAKE does not include mapping.

Kind regards,
Boetsie
Leave a comment:
Autotroph replied

02-16-2011, 09:51 PM
Thanks.

Ya i guess i will be extending the previous scaffolds.

The problem with using SSPACE is that it does not allow N's in the input contig file.

The scaffolds which i have are having varying insert sizes. Should i break each of them into paired end reads and use as separate libraries to use it in SSPACE?

Velvet is not able to handle long reads which are more than 20KB?
Leave a comment:
boetsie replied

02-16-2011, 10:19 AM
Hi,

do you want to scaffold the previous scaffold, or do you want to extend the previous scaffolds?

Anyway, maybe you can try out SSPACE for this purpose, see this thread;

SSPACE: a new stand-alone scaffolding tool for small and large genomes - SEQanswers

http://seqanswers.com/forums/showthread.php?t=8350

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

Kind regards,
Boetsie
Leave a comment:

Previous 1 2 template Next

Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
05-06-2024, 07:48 AM
Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, Yesterday, 07:03 AM	0 responses 14 views 0 likes	Last Post by seqadmin Yesterday, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 36 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 43 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 38 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News