Originally posted by Gators
View Post
Header Leaderboard Ad
Collapse
Benchmark (or experience) between SOAPdenovo, Velvet, Abyss, and ALLPATHS2
Collapse
Announcement
Collapse
SEQanswers June Challenge Has Begun!
The competition has begun! We're giving away a $50 Amazon gift card to the member who answers the most questions on our site during the month. We want to encourage our community members to share their knowledge and help each other out by answering questions related to sequencing technologies, genomics, and bioinformatics. The competition is open to all members of the site, and the winner will be announced at the beginning of July. Best of luck!
For a list of the official rules, visit (https://www.seqanswers.com/forum/sit...wledge-and-win)
For a list of the official rules, visit (https://www.seqanswers.com/forum/sit...wledge-and-win)
See more
See less
X
-
You could also try using the Columbus module from Velvet.L. Collado Torres, Ph.D. student in Biostatistics.
Comment
-
Originally posted by tonybolger View PostWe've noticed a tendency for CLC denovo to oversimplify complex repeat-filled areas, turning 'frayed ropes' into a single contigs. These don't tend to be in coding regions, so you won't find them with genes or RNA
We have been using CLCbio, though I worry there is some mystery to it's algorithm. CLC seems to return much better n50 and max contig lengths than SOAP, CLC is also faster and able to handle significantly more data.
We have an insect genome ~200 MB for which we use one illumina paired end lane (~200 million reads at 100bp) and one mate pair lane (~50 million reads at 36 bp, with library size of 3kb). Approx 75% of paired reads end up mapping. With 200bp min contigs we get an n50 of ~4kb.
To compare to SOAP using our limited machine (44GB RAM) we can only process ~50 million reads.. that subset with CLC gives an n50 of 1kb while SOAP gives n50 of 300bp.
We then Scaffold the CLC contigs using the mate pair reads with SSPACE.. which seems to do a very decent job (from above example, post SSPACE scaffolding gives n50=88kb). Still, CLC users are rare (due to it being proprietary) and the inability to control kmer size makes me weary. So i'd appreciate any further light on the subject.
Comment
-
Originally posted by themwg View PostCould you elaborate on what you mean by frayed ropes turning into single contigs?
Code:A---> ---->E C--->D B---> ---->F
where the 'correct' paths are A->C->D->E and B->C->D->F, with C->D being a repeat.
It appears the CLC tends to be overly aggressive for my taste, and collapses the A->C and B->C paths into a forced consensus, even in the presence of strong support for the different paths. Likewise, D->E and D->F. Unfortunately, due to lack of tuning options, this isn't easy to prevent. Check for Ns in the assembly - this might be an indicator.
Faced with this situation, other assemblers usually produce 5 contigs, whereas CLC will produce 1. This has already caused us to closely investigate family number differences of related genes (vs a related organism) which turned out to be merely 'merged' in the CLC assembly.
Originally posted by themwg View PostWe have been using CLCbio, though I worry there is some mystery to it's algorithm. CLC seems to return much better n50 and max contig lengths than SOAP, CLC is also faster and able to handle significantly more data.
Originally posted by themwg View PostWe have an insect genome ~200 MB for which we use one illumina paired end lane (~200 million reads at 100bp) and one mate pair lane (~50 million reads at 36 bp, with library size of 3kb). Approx 75% of paired reads end up mapping. With 200bp min contigs we get an n50 of ~4kb.
To compare to SOAP using our limited machine (44GB RAM) we can only process ~50 million reads.. that subset with CLC gives an n50 of 1kb while SOAP gives n50 of 300bp.
For SOAP assemblies, i would strongly recommend pre-filtering the reads by quality - it considerably reduces the memory footprint. Both assemblers may well give better N50 with filtering. Still, i would expect CLC to beat SOAP by a factor of 5-10 in contig N50.
SOAP contig N50 is somewhat hampered by the fact that it doesn't use pairing information at all until the scaffolding stage. It is also broken in other interesting ways, but there doesn't seem to be a perfect beast for the job. You might also want to give the new CLC v4 beta a spin - it doesn't work on very big assemblies, but 200 million reads may be ok.
Originally posted by themwg View PostWe then Scaffold the CLC contigs using the mate pair reads with SSPACE.. which seems to do a very decent job (from above example, post SSPACE scaffolding gives n50=88kb).Originally posted by themwg View PostStill, CLC users are rare (due to it being proprietary) and the inability to control kmer size makes me weary. So i'd appreciate any further light on the subject.
Comment
-
Originally posted by seb567 View PostYes, I think it is very clever to store genome variations as they are encountered.Last edited by jiltysequence; 06-23-2011, 10:16 AM.
Comment
-
Running Abyss
Hi,
Multiple people posted in this thread were able to run abyss succesfully. I am novice and have some doubts about running abyss. Please answer:
Question 1:
I want to use abyss for paired reads assembly. But I have paired reads (Forward and reverse) in single file. This is the file generated after quality trimming.
The file structure is
>001_forward
ATGC.......
>001_reverse
ATGC....
>002_forward
ATGC....
>002_reverse
ATGC....
How do I run Abyss for such file? I need command for this. Any suggestions?
Question2:
I have paired end files for single genome. e.g. Genome X reads are
001_R1.fastq 001_R2.fastq
002_R1.fastq 002_R2.fastq
003_R1.fastq 003_R2.fastq
Do i need to treat each pair as separate library? or if I mention
abyss-pe name=ecoli k=64 in='001_R1.fastq 001_R2.fastq 002_R1.fastq 002_R2.fastq'
should work fine?
Question3:
Does abyss have automated qulaity trimming incorporated or its necessory to use quality trimmed reads? I read somewhere it has -q flag
Thanks
Comment
-
#1) Run ABySS as SE (single end) or split your file into two parts. perl or awk would be my tools of choice for this.
#2) Set them up as individual libraries, e.g., lib="libA libB" libA="001_R1.fastq 001_R2.fastq" libB="002_R1.fastq 002_R2.fastq"
#3) I always do trimming pre-ABySS. What did you read about the '-q' flag? Did it mention trimming?
Comment
-
Originally posted by westerman View Post#1) Run ABySS as SE (single end) or split your file into two parts. perl or awk would be my tools of choice for this.
#2) Set them up as individual libraries, e.g., lib="libA libB" libA="001_R1.fastq 001_R2.fastq" libB="002_R1.fastq 002_R2.fastq"
#3) I always do trimming pre-ABySS. What did you read about the '-q' flag? Did it mention trimming?
Comment
-
Originally posted by eslondon View PostI have been playing around with ABYSS, SOAPdenovo and CLC Bio for a genome project. To cut a very long story short, these are our experiences.
We started from a set of standard 200bp PE reads and a set of 5kb mate pair reads.
-ABYSS: with our limited 5kb reads, we never managed to get ABYSS to use them properly for scaffolding. The Contig N50 was a bit poor, whatever we tried. It took a fair while, we never got it to parallelize
-SOAPdenovo: very fast because using multiple threads is as simple as saying -p number of processors, and VERY good at scaffolding. The Contig N50 was not great, but better than ABYSS (around 600bp)
-CLC Bio: although it does not support scaffolding, it gave us by far the best N50 in terms of contigs (an N50 of 2.2Kb)
In the end we used CLCBio contigs with SOAPdenovo for scaffolding, which got us a nice N50 of 8kb.
Finally we use the SOAPdenovo GapCloser to close GAPS in the scaffolds produced, which removed about 25% of the Ns we had in the assembly!
All the QC on these assemblies (mapping known genes, mapping RNA-Seq reads, etc) pointed to the CLCBio + SOAPdenovo as being the best we had.
Now we are going to throw more data at it, hoping for a much better assembly
best regards
Elia
Comment
-
- Information for assembly Scaffold 'output.scafSeq'.(cut_off_length <
100bp) -->
Size_includeN 14238304
Size_withoutN 14238304
Scaffold_Num 69976
Mean_Size 203
Median_Size 154
Longest_Seq 5423
Shortest_Seq 100
Singleton_Num 69976
Average_length_of_break(N)_in_scaffold 0
Known_genome_size NaN
Total_scaffold_length_as_percentage_of_known_genome_size NaN
scaffolds>100 69864 99.84%
scaffolds>500 2964 4.24%
scaffolds>1K 324 0.46%
scaffolds>10K 0 0.00%
scaffolds>100K 0 0.00%
scaffolds>1M 0 0.00%
Nucleotide_A 3733290 26.22%
Nucleotide_C 3403704 23.91%
Nucleotide_G 3387000 23.79%
Nucleotide_T 3714310 26.09%
GapContent_N 0 0.00%
Non_ACGTN 0 0.00%
GC_Content 47.69% (G+C)/(A+C+G+T)
N10 611 1677
N20 420 4532
N30 315 8483
N40 250 13577
N50 206 19868
N60 174 27405
N70 151 36212
N80 134 46255
N90 120 57488
Can anyone explain what is size include N means and how the size without N numbers is same.?
and N50 value of this result is?
Comment
-
Hi,
first of all: Which program gave you this output?
Originally posted by Aman Mahajan View PostSize_includeN 14238304
Size_withoutN 14238304
Nucleotide_A 3733290 26.22%
Nucleotide_C 3403704 23.91%
Nucleotide_G 3387000 23.79%
Nucleotide_T 3714310 26.09%
GapContent_N 0 0.00%
Originally posted by Aman Mahajan View PostN10 611 1677
N20 420 4532
N30 315 8483
N40 250 13577
N50 206 19868
N60 174 27405
N70 151 36212
N80 134 46255
N90 120 57488
Shouldn't this be in the doku of the program you are using to generate this output?
Hope this is of any help.
Comment
Latest Articles
Collapse
-
by seqadmin
Developments in sequencing technologies and methodologies have transformed the field of epigenetics, giving researchers a better way to understand the complex world of gene regulation and heritable modifications. This article explores some of the diverse sequencing methods employed in the study of epigenetics, ranging from classic techniques to cutting-edge innovations while providing a brief overview of their processes, applications, and advances.
Methylation Detect...-
Channel: Articles
05-31-2023, 10:46 AM -
-
Differential Expression and Data Visualization: Recommended Tools for Next-Level Sequencing Analysisby seqadmin
After covering QC and alignment tools in the first segment and variant analysis and genome assembly in the second segment, we’re wrapping up with a discussion about tools for differential gene expression analysis and data visualization. In this article, we include recommendations from the following experts: Dr. Mark Ziemann, Senior Lecturer in Biotechnology and Bioinformatics, Deakin University; Dr. Medhat Mahmoud Postdoctoral Research Fellow at Baylor College of Medicine;...-
Channel: Articles
05-23-2023, 12:26 PM -
-
by seqadmin
Continuing from our previous article, we share variant analysis and genome assembly tools recommended by our experts Dr. Medhat Mahmoud, Postdoctoral Research Fellow at Baylor College of Medicine, and Dr. Ming "Tommy" Tang, Director of Computational Biology at Immunitas and author of From Cell Line to Command Line.
Variant detection and analysis tools
Mahmoud classifies variant detection work into two main groups: short variants (<50...-
Channel: Articles
05-19-2023, 10:03 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 06-01-2023, 08:56 PM
|
0 responses
9 views
0 likes
|
Last Post
by seqadmin
06-01-2023, 08:56 PM
|
||
Deep Sequencing Unearths Novel Genetic Variants: Enhancing Precision Medicine for Vascular Anomalies
by seqadmin
Started by seqadmin, 06-01-2023, 07:33 AM
|
0 responses
8 views
0 likes
|
Last Post
by seqadmin
06-01-2023, 07:33 AM
|
||
Unveiling Genetic Associations Through Transcription Factor Binding Quantitative Trait Loci
by seqadmin
Started by seqadmin, 05-31-2023, 07:50 AM
|
0 responses
4 views
0 likes
|
Last Post
by seqadmin
05-31-2023, 07:50 AM
|
||
Exploring French-Canadian Ancestry: Insights into Migration, Settlement Patterns, and Genetic Structure
by seqadmin
Started by seqadmin, 05-26-2023, 09:22 AM
|
0 responses
11 views
0 likes
|
Last Post
by seqadmin
05-26-2023, 09:22 AM
|
Comment