Benchmark (or experience) between SOAPdenovo, Velvet, Abyss, and ALLPATHS2 - SEQanswers

You are currently viewing the SEQanswers forums as a guest, which limits your access. Click here to register now, and join the discussion

X

tonybolger

Senior Member

Join Date: Feb 2010

Posts: 156
- Share
- Tweet
#16

03-03-2011, 03:50 AM

Originally posted by Gators View Post

Quick question in the same vein as this thread...

I have some deep sequencing results from a virus-infected sample. We know the viral sequence - kinda. We know that there are differences in our reference sequence and what is actually in the cells. If I allow for a couple mismatches in the alignment I do with bowtie, I seem to have more or less complete coverage of the viral genome in our reads. I'd like to assemble the reads to get a "consensus" sequence of the virus. Any recommendations for what program to use for this small scale assembly? Reads are about 25 bp, total viral genome should be <10kb

I assume a reference based assembly would be ok. You just need to call the consensus on the alignment. You could try samtools, or the early steps of any snp pipeline.
Comment
lcollado

Member

Join Date: Jun 2009

Posts: 65
- Share
- Tweet
#17

03-09-2011, 07:26 AM

You could also try using the Columbus module from Velvet.

L. Collado Torres, Ph.D. student in Biostatistics.
Comment
themwg

Junior Member

Join Date: Jan 2011

Posts: 6
- Share
- Tweet
#18

05-03-2011, 12:28 PM

Originally posted by tonybolger View Post

We've noticed a tendency for CLC denovo to oversimplify complex repeat-filled areas, turning 'frayed ropes' into a single contigs. These don't tend to be in coding regions, so you won't find them with genes or RNA

Could you elaborate on what you mean by frayed ropes turning into single contigs?

We have been using CLCbio, though I worry there is some mystery to it's algorithm. CLC seems to return much better n50 and max contig lengths than SOAP, CLC is also faster and able to handle significantly more data.

We have an insect genome ~200 MB for which we use one illumina paired end lane (~200 million reads at 100bp) and one mate pair lane (~50 million reads at 36 bp, with library size of 3kb). Approx 75% of paired reads end up mapping. With 200bp min contigs we get an n50 of ~4kb.

To compare to SOAP using our limited machine (44GB RAM) we can only process ~50 million reads.. that subset with CLC gives an n50 of 1kb while SOAP gives n50 of 300bp.

We then Scaffold the CLC contigs using the mate pair reads with SSPACE.. which seems to do a very decent job (from above example, post SSPACE scaffolding gives n50=88kb). Still, CLC users are rare (due to it being proprietary) and the inability to control kmer size makes me weary. So i'd appreciate any further light on the subject.
Comment
tonybolger

Senior Member

Join Date: Feb 2010

Posts: 156
- Share
- Tweet
#19

05-04-2011, 12:51 AM

Originally posted by themwg View Post

Could you elaborate on what you mean by frayed ropes turning into single contigs?

This phrase 'frayed rope' refers to the shape of part of the assembly graph. If you have non-tandem repeats, you get a graph something like:

Code:

A---> ---->E C--->D B---> ---->F

where the 'correct' paths are A->C->D->E and B->C->D->F, with C->D being a repeat.

It appears the CLC tends to be overly aggressive for my taste, and collapses the A->C and B->C paths into a forced consensus, even in the presence of strong support for the different paths. Likewise, D->E and D->F. Unfortunately, due to lack of tuning options, this isn't easy to prevent. Check for Ns in the assembly - this might be an indicator.

Faced with this situation, other assemblers usually produce 5 contigs, whereas CLC will produce 1. This has already caused us to closely investigate family number differences of related genes (vs a related organism) which turned out to be merely 'merged' in the CLC assembly.

Originally posted by themwg View Post

We have been using CLCbio, though I worry there is some mystery to it's algorithm. CLC seems to return much better n50 and max contig lengths than SOAP, CLC is also faster and able to handle significantly more data.

Agreed on all points - the problem is one of correctness however.

Originally posted by themwg View Post

We have an insect genome ~200 MB for which we use one illumina paired end lane (~200 million reads at 100bp) and one mate pair lane (~50 million reads at 36 bp, with library size of 3kb). Approx 75% of paired reads end up mapping. With 200bp min contigs we get an n50 of ~4kb.

To compare to SOAP using our limited machine (44GB RAM) we can only process ~50 million reads.. that subset with CLC gives an n50 of 1kb while SOAP gives n50 of 300bp.

This would tally with my experience.

For SOAP assemblies, i would strongly recommend pre-filtering the reads by quality - it considerably reduces the memory footprint. Both assemblers may well give better N50 with filtering. Still, i would expect CLC to beat SOAP by a factor of 5-10 in contig N50.

SOAP contig N50 is somewhat hampered by the fact that it doesn't use pairing information at all until the scaffolding stage. It is also broken in other interesting ways, but there doesn't seem to be a perfect beast for the job. You might also want to give the new CLC v4 beta a spin - it doesn't work on very big assemblies, but 200 million reads may be ok.

Originally posted by themwg View Post

We then Scaffold the CLC contigs using the mate pair reads with SSPACE.. which seems to do a very decent job (from above example, post SSPACE scaffolding gives n50=88kb).

Originally posted by themwg View Post

Still, CLC users are rare (due to it being proprietary) and the inability to control kmer size makes me weary. So i'd appreciate any further light on the subject.

You can control the k-mer size with CLC, with -w, up to a max of 31 (at least in the version i'm using - 3.20) - unfortunately, it's about the only thing you can control
Comment
jiltysequence

Banned

Join Date: Jun 2011

Posts: 5
- Share
- Tweet
#20

06-17-2011, 10:27 AM

Originally posted by seb567 View Post

Yes, I think it is very clever to store genome variations as they are encountered.

I've been hosting genome variations in a secure cloud server (have you heard of http://www.rackspace.com?) it would be interesting if some of us were able to collaborate and create some kind of an archive. This would be a good step in making information, from basic to advance, available to interested people of all shapes and sizes. What do you guys think?

Last edited by jiltysequence; 06-23-2011, 10:16 AM.
Comment
sagarutturkar

Member

Join Date: Sep 2010

Posts: 61
- Share
- Tweet
#21

02-20-2012, 02:49 PM

Running Abyss

Hi,

Multiple people posted in this thread were able to run abyss succesfully. I am novice and have some doubts about running abyss. Please answer:

Question 1:
I want to use abyss for paired reads assembly. But I have paired reads (Forward and reverse) in single file. This is the file generated after quality trimming.

The file structure is
>001_forward
ATGC.......
>001_reverse
ATGC....
>002_forward
ATGC....
>002_reverse
ATGC....

How do I run Abyss for such file? I need command for this. Any suggestions?

Question2:

I have paired end files for single genome. e.g. Genome X reads are
001_R1.fastq 001_R2.fastq
002_R1.fastq 002_R2.fastq
003_R1.fastq 003_R2.fastq

Do i need to treat each pair as separate library? or if I mention
abyss-pe name=ecoli k=64 in='001_R1.fastq 001_R2.fastq 002_R1.fastq 002_R2.fastq'
should work fine?

Question3:
Does abyss have automated qulaity trimming incorporated or its necessory to use quality trimmed reads? I read somewhere it has -q flag

Thanks
Comment
westerman

Rick Westerman

Join Date: Jun 2008

Posts: 1103
- Share
- Tweet
#22

02-21-2012, 07:15 AM

#1) Run ABySS as SE (single end) or split your file into two parts. perl or awk would be my tools of choice for this.

#2) Set them up as individual libraries, e.g., lib="libA libB" libA="001_R1.fastq 001_R2.fastq" libB="002_R1.fastq 002_R2.fastq"

#3) I always do trimming pre-ABySS. What did you read about the '-q' flag? Did it mention trimming?
Comment
sagarutturkar

Member

Join Date: Sep 2010

Posts: 61
- Share
- Tweet
#23

02-22-2012, 09:08 AM

Originally posted by westerman View Post

#1) Run ABySS as SE (single end) or split your file into two parts. perl or awk would be my tools of choice for this.

#2) Set them up as individual libraries, e.g., lib="libA libB" libA="001_R1.fastq 001_R2.fastq" libB="002_R1.fastq 002_R2.fastq"

#3) I always do trimming pre-ABySS. What did you read about the '-q' flag? Did it mention trimming?

Thank you very much. That was helpful.
Comment
Nomijill

Member

Join Date: Sep 2009

Posts: 25
- Share
- Tweet
#24

02-22-2012, 03:25 PM

Originally posted by eslondon View Post

I have been playing around with ABYSS, SOAPdenovo and CLC Bio for a genome project. To cut a very long story short, these are our experiences.

We started from a set of standard 200bp PE reads and a set of 5kb mate pair reads.

-ABYSS: with our limited 5kb reads, we never managed to get ABYSS to use them properly for scaffolding. The Contig N50 was a bit poor, whatever we tried. It took a fair while, we never got it to parallelize

-SOAPdenovo: very fast because using multiple threads is as simple as saying -p number of processors, and VERY good at scaffolding. The Contig N50 was not great, but better than ABYSS (around 600bp)

-CLC Bio: although it does not support scaffolding, it gave us by far the best N50 in terms of contigs (an N50 of 2.2Kb)

In the end we used CLCBio contigs with SOAPdenovo for scaffolding, which got us a nice N50 of 8kb.

Finally we use the SOAPdenovo GapCloser to close GAPS in the scaffolds produced, which removed about 25% of the Ns we had in the assembly!

All the QC on these assemblies (mapping known genes, mapping RNA-Seq reads, etc) pointed to the CLCBio + SOAPdenovo as being the best we had.

Now we are going to throw more data at it, hoping for a much better assembly

best regards

Elia

Update on the CLC bio de novo assembler- It has scaffolding. It has the ability to control for bubble size, and it is faster than ever. I assemble 10 million paired end MiSeq reads in 15 minutes on my 8GB laptop. This is in the new version 5.0. The memory footprint makes it possible to assemble on machines that would otherwise be too small. It is commercial, but two weeks is free and the Genomics Workbench is very easy to use on Mac, Windows or Linux.
Comment
Aman Mahajan

Member

Join Date: Jan 2012

Posts: 22
- Share
- Tweet
#25

04-01-2012, 07:47 AM

- Information for assembly Scaffold 'output.scafSeq'.(cut_off_length <
100bp) -->

Size_includeN 14238304
Size_withoutN 14238304
Scaffold_Num 69976
Mean_Size 203
Median_Size 154
Longest_Seq 5423
Shortest_Seq 100
Singleton_Num 69976
Average_length_of_break(N)_in_scaffold 0

Known_genome_size NaN
Total_scaffold_length_as_percentage_of_known_genome_size NaN

scaffolds>100 69864 99.84%
scaffolds>500 2964 4.24%
scaffolds>1K 324 0.46%
scaffolds>10K 0 0.00%
scaffolds>100K 0 0.00%
scaffolds>1M 0 0.00%

Nucleotide_A 3733290 26.22%
Nucleotide_C 3403704 23.91%
Nucleotide_G 3387000 23.79%
Nucleotide_T 3714310 26.09%
GapContent_N 0 0.00%
Non_ACGTN 0 0.00%
GC_Content 47.69% (G+C)/(A+C+G+T)

N10 611 1677
N20 420 4532
N30 315 8483
N40 250 13577
N50 206 19868
N60 174 27405
N70 151 36212
N80 134 46255
N90 120 57488

Can anyone explain what is size include N means and how the size without N numbers is same.?

and N50 value of this result is?
Comment
nangillala

Junior Member

Join Date: Aug 2011

Posts: 9
- Share
- Tweet
#26

04-02-2012, 03:20 AM

Hi,
first of all: Which program gave you this output?

Originally posted by Aman Mahajan View Post

Size_includeN 14238304
Size_withoutN 14238304

Nucleotide_A 3733290 26.22%
Nucleotide_C 3403704 23.91%
Nucleotide_G 3387000 23.79%
Nucleotide_T 3714310 26.09%
GapContent_N 0 0.00%

I'm just guessing here, but in your example the nucleotides A,C,G and T add up to the total size and it seems like you have no Ns in it, thus the number including Ns is the same.

Originally posted by Aman Mahajan View Post

N10 611 1677
N20 420 4532
N30 315 8483
N40 250 13577
N50 206 19868
N60 174 27405
N70 151 36212
N80 134 46255
N90 120 57488

I _guess_ that your N50 is 206 here because the N60 should be smaller and so on. I don't know what the second number is. Maybe number of contigs above this threshold or something?
Shouldn't this be in the doku of the program you are using to generate this output?

Hope this is of any help.
Comment
Aman Mahajan

Member

Join Date: Jan 2012

Posts: 22
- Share
- Tweet
#27

04-02-2012, 09:10 AM

Software used- SOAPdenovo Trans
Comment

Previous 1 2 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
- Channel: Articles
Yesterday, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

	Topics		Statistics	Last Post
	Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM		0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
	Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM		0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
	Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM		0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
	Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM		0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Working...

X