What is the best reference assembler to use with ion torrent data? I can only seem to find information on Ion Torrent de novo assemblies, which is not what I'm looking for. Thanks in advance!
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Hi - I'm working with a small bacterial genome ~2.0 Mbp but de novo (new species). Got data from an Ion 318 chip, about 480 Mbp. Ran it through Newbler 2.3 - 3,000+ contigs. Set up default MIRA assembly *six* days ago and it's still going. :-( I wouldn't use DNA* - ridiculously expensive for what it does. Roche RefMapper is OK for some of our other known bacterial genomes.
Comment
-
Originally posted by hengnck View PostHi - I'm working with a small bacterial genome ~2.0 Mbp but de novo (new species). Got data from an Ion 318 chip, about 480 Mbp. Ran it through Newbler 2.3 - 3,000+ contigs. Set up default MIRA assembly *six* days ago and it's still going. :-( I wouldn't use DNA* - ridiculously expensive for what it does. Roche RefMapper is OK for some of our other known bacterial genomes.
Are you using all 480 Mbp of data in the assembly or are you downsampling? I ask because many software packages (like those you mention) will grossly underperform with excessive coverage, and are reported to work best in the 30X to 50X range (and if this is DNA from pure culture you're at ~240X). Are you a Torrent Suite user and if so are you using the MIRA plugin? The newest (v2.2) version allows you to specify the amount of coverage to use (best results are typically see at ~50X):
Some have commented that they use Newbler at around 30X coverage for de novo assembly.
Comment
-
I concur with IT's comments about excessive coverage. You really need to scale back your input to ~30X. Also why Newbler 2.3? That is a very old version. Get version 2.6, they have made several improvements in the assembler.
Comment
-
Downsampling ion data
Hi All,
Thanks for your comments - you can all probably see that I'm more comfortable in the Sanger era. Unfortunately in my Faculty, I'm the "bioinformatics team".
My default SOP is to use all the data - the more the merrier - but I can see now that I've got way too much data than required. How do I downsample 480 Mbp of essentially random reads down to 100-150 Mbp?
All advice is greatly appreciated.
Comment
-
What size do you want the reads? You could cut out all the smaller reads. (Maybe 50bp or less?) To do this you would need to write up a script of some sort. Either perl or python to get the length of each read and then output the reads that "qualify" into an outfile.
Comment
-
Re: Downsampling
Originally posted by jdilts View PostWhat size do you want the reads? You could cut out all the smaller reads. (Maybe 50bp or less?) To do this you would need to write up a script of some sort. Either perl or python to get the length of each read and then output the reads that "qualify" into an outfile.
Comment
-
Not too complicated
The script isn't too complicated. It would be something similar to this. I hope this can be of some assistance.Code:#/usr/bin/perl use strict; use warnings; my $infile ="readFILE"; my $outfile = "quality_readsFILE"; #opens file with reads open (IN,<,$infile) || die $!; my @reads = <IN>; #stores each line in the file into an arrary close (IN); #don't need the file anymore, close it open (OUT,>,$outfile) || die $!; #open the out going file my $j = 0; #array index my $read_name; #iterate through array foreach my $i (@reads){ if ($j%2 = 1) && (length($i)>=75){ print OUT "$read_name\n$i\n";} } else{ $read_name = $i;} #stores read name } close (OUT);
Comment
-
Originally posted by jdilts View PostThe script isn't too complicated. It would be something similar to this. I hope this can be of some assistance.Code:#/usr/bin/perl use strict; use warnings; my $infile ="readFILE"; my $outfile = "quality_readsFILE"; #opens file with reads open (IN,<,$infile) || die $!; my @reads = <IN>; #stores each line in the file into an arrary close (IN); #don't need the file anymore, close it open (OUT,>,$outfile) || die $!; #open the out going file my $j = 0; #array index my $read_name; #iterate through array foreach my $i (@reads){ if ($j%2 = 1) && (length($i)>=75){ print OUT "$read_name\n$i\n";} } else{ $read_name = $i;} #stores read name } close (OUT);
Comment
-
Length limiting is a Great idea jdilts.
One quick note on that quick perl script...
The concept for length checking is a good one, but this script fetches and measures each line as a read. If you are using a specific file type of the sequencer's reads the it will depend on the format.
e.g. fastq uses (at least) 4 lines for each section; including name, sequence, quality and one optional line. This is assuming that the sequence is all one line. A complete read may be longer than one section of the fastq as well.
To parse a specific file type (as opposed to one that has one line per read) then I recommend you either write a new function/method or use a prewritten library that does that. I know that bioperl and biopython have packages that read many file types, fastq being just one of them.
-Benjamin-Benjamin
Jackson Laboratory for Genomic Medicine
Comment
-
Originally posted by RonanC View PostI just noticed that CLC bio are offering a free 6 month trial of their CLC genomics workbench to users with a benchtop NGS (i.e. 454 GS Jr, MiSeq or IonTorrent PGM). Anybody have any experience with the CLC software?
POSTEDIT: The original post mentioed "reference assembly" - perhaps I've crossed some wires. I thinking "read mapping." For de novo assembly - CLCbio is also very fast and accurate. We routinely get down to the sub-100 contigs with single Ion318 or MiSeq PE runs for a 5MB genome. (N50 is on average around 190K)Last edited by jonathanjacobs; 07-09-2012, 08:19 AM.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 05-02-2024, 08:06 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
05-02-2024, 08:06 AM
|
||
Started by seqadmin, 04-30-2024, 12:17 PM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
04-30-2024, 12:17 PM
|
||
Started by seqadmin, 04-29-2024, 10:49 AM
|
0 responses
25 views
0 likes
|
Last Post
by seqadmin
04-29-2024, 10:49 AM
|
||
Started by seqadmin, 04-25-2024, 11:49 AM
|
0 responses
28 views
0 likes
|
Last Post
by seqadmin
04-25-2024, 11:49 AM
|
Comment