Seqanswers Leaderboard Ad

**sklages** · 05-21-2012, 12:05 AM

If you have PE data, the assembler itself should be able to assemble the "short" contigs indepedently (if they are correctly recognized as forward or reverse).
If not you don't have PE data, you probably have a problem ..

But you may want to try MIRA3 assembler for assemblings reads (or contigs).

MIRA

https://sourceforge.net/apps/mediawiki/mira-assembler/index.php?title=Main_Page

Download MIRA for free. MIRA - Sequence assembler and sequence mapping for whole genome shotgun and EST / RNASeq sequencing data. Can use Sanger, 454, Illumina and IonTorrent data.

hth, Sven

**pag** · 05-21-2012, 06:20 AM

I don't have a paired end library for the gaps that I'm trying to close, but I'm Sanger sequencing from each end of the gap, so in that sense have paired ends.

I was able to do a bit of perl and shell scripting to allow the "miniassembly" process to work with my dataset and that appears to align much more quickly than does manually doing a full alignment for each of the pre-assembled contigs and new sanger sequences. However, this appears to "wipe out" my other assemblies involving those pre-assembled contigs, so if I continue to use phred/phrap/consed I will have to figure out how to manually duplicate the generated sanger+netgen contigs to use as a reference

I will look into Mira3 also though. Thanks!

**pag** · 05-21-2012, 09:13 AM

Trying to use MIRA now without luck. I don't have access to the raw sff 454 data and I consequently get the following error

Fatal error (may be due to problems of the input data or parameters):

"Read contig00001 is 284651 bp long and thus longer than MAXREADSIZEALLOWED (29900) bases. Skim cannot handle than, sorry."

My command line was$ mira --project=mira_hyb --job=denovo,genome,accurate,sanger,454 SANGER_SETTINGS -LR:ft=phd 454_SETTINGS -LR:ft=fasta

files are
mira_hyb_in.454.fasta
mira_hyb_in.phd.1 <-- cat'd all the phd files generated by p/p/c from Sanger reads into a single file
mira_hyb_in.454.fasta.qual
mira_hyb_traceinfo_in.454.xml

**sklages** · 05-21-2012, 09:39 AM

hmm, ok, i think I get it now: you have pre-assembled contigs a few gap-closing sanger reads (probably PCR amplified), so you know your contig order.

Then what's wrong with "MiniAssembly" provided by consed? That's just for that purpose ..

And, yes, MIRA can only handle up to ~29K "reads" .. as it is a shotgun assembler as does certain assumptions about input data.

**pag** · 05-21-2012, 10:30 AM

I believe I've figured out the contig order with the exception of one single unbridged (with PCR) gap. Since it's a circular chromosome, that gap is semi-irrelevant to the order. But consed has no knowledge of the contig order itself (that was accomplished by mapping vs a reference genome of another species and confirmed via PCR).

I have several of the contigs that are predicted to have multiple copies (based on relative read coverage), including on that is predicted to have 6 copies and which I see approximately 4 copies of thusfar. However getting consed to map/assemble everything is trying - I still have 21 contigs of an initial 36 by consed, whereas I've manually reduced the number of contigs to 9. Basically, consed is breaking things up when it enters into a repeat region as it doesn't know what the matching pairs of contigs are bridged by a repeat contig. It also doesn't look intelligently for "SNPs" of one "copy" vs the other.

MiniAssembly may indeed be my answer: I had been using old phrap scripts from before consed had the present form of miniassembly, but after updating and augmenting the old scripts, I got that to work last night. But when you miniassemble contig e.g. 30 with contig 1 with default settings, and then try to assemble 30 with 2, it first breaks the 30-1 association before forming the 30-2 association, so it does not appear to be increasing the length of contigs while shrinking the number of contigs.

**sklages** · 05-21-2012, 10:52 AM

Would you loose a lot if you'd try to assemble the reads with e.g. MIRA which does a pretty good job
on repetitive sequences? Feed MIRA with your NGS data and the sanger reads and see what you
get (assuming you don't have 500mio illumina reads ;-)

Concerning the mini assembly, maybe you should simply addnewReads and then join your contigs
manually (I think consed has a new option which may do this automatically). Seems you don't
have a massive amount of sanger data.

**pag** · 05-22-2012, 02:41 AM

As I said, I don't have the raw NGS data. I've been experimenting with splitting the assembled contigs into parts small enough for MIRA and then having it re-assemble. But my bash scripting skills appear to have left me temporarily.

In any case, there was a new version of consed released either yesterday or today...I'll install that then try the miniassembler

**pag** · 05-22-2012, 10:18 AM

I chopped down the large contigs (into segments that overlap the next by 75%) so mira would accept them. However I get
Loading project from PHD file.
Counting sequence data from PHD file:
tcmalloc: large alloc 1073741824 bytes == 0x61270000 @
tcmalloc: large alloc 2147483648 bytes == 0xc1570000 @
tcmalloc: large alloc 4294967296 bytes == (nil) @
...snip...
Dynamic allocs: 0
Align allocs: 0
Out of memory detected, exception message is: std::bad_alloc

command line: nice mira --job=denovo,genome,accurate,sanger -GE:not=4 -AS:sep=no SANGER_SETTINGS -LR:ft=phd

the initial 454 data was probably 70MBases (approx 300K reads with 90% overlap if I'm doing the math right)

**pag** · 05-22-2012, 01:38 PM

consed doesn't want to play nice
if I do "add reads" and accept the default filter of *.fof (and self-aligning my duplicated contigs against themselves ahead of time), it gives me the following error
The list of reads must be simple filenames--not including the pathname or directory as with ../chromat_dir/phrap.out

if I instead attempt to import scf files for the individual reads, I get
Fatal: Bad boy!
You are trying to add 2 copies of read in file /data/genome/type/name/chromat_dir/dups/chromat_dir/contig00129.scf

and I can't miniassemble anything that's not in the alignment

I can't figure out the language of the first error - is it saying that the fof has to have a file listing of files that are in the directory already? and not given as paths, but as just the basename?

**sklages** · 05-22-2012, 10:06 PM

consed wants you to just provide the names of the reads, without paths.
It will always look in ../chromat_dir/, ../phd_dir and if necessary in ../phdball_dir.

hmm, why don't you replicate your several copies of your contig just name them differently?
E.g. Contig1a, Contig1b etc. Then create an ACE file from your modified
fasta sequence and add the sanger gap-closing reads via 'addnewReads'.

Concerning MIRA, no need to work with phd. Split your initial fasta contigs in
"paired reads" and then assemble them using fasta/qual as input.

What system are you working on?

**pag** · 05-31-2012, 10:20 AM

Sorry for being so long in replying here.

I believe there was just a retain duplicates type flag I needed in my phred/phrap script used by consed to make it work. In any case, I think what I eventually did was manually duplicate the files, edit the >tag in the file as well as the file name (or just duplicate the phd file and rename it?) and re-ran my command-line reassembly from edit_dir.

Then I
1) did miniassemblies to integrate the duplicated contigs into their appropriate supercontigs.
2) went to the edit_dir directory to find the .contig file that was generated from my latest assembly (possibly as .screen.contig) and the .contig.qual file and copied these to a new directory
3) did a sed search and replace to "fix" the names of the contigs to have names other than "Contig2" etc using the following bash script (invoked by ". myscript")

Code:

#!/bin/bash
#put longest names first as Contig22 contains Contig2
list1="Contig10 Contig11 Contig15 Contig23 Contig2 Contig3 Contig4 Contig5 Contig8 Contig9"
list2="newname01 newname02 newname03 newname04 newname05 newname06 newname07 newname08 newname09 newname10"
frm=( $list1 )
to2=( $list2 )

count=${#frm[@]}
c2=${#to2[@]}
cp mydir.contigs mycnt
cp mydir.contigs.qual myqual

echo $count $c2
for i in `seq 1 $count`
do
    nfrm=${frm[$i-1]}
    nto=${to2[$i-1]}
    sed "s/$nfrm/$nto/g" mycnt >mydir.fasta
    cp mydir.fasta mycnt
#otherwise each change would be wiped out by the subsequent change
    sed "s/$nfrm/$nto/g" myqual >mydir.fasta.qual
    cp mydir.fasta.qual myqual
done

4) split_fasta.pl mydir.fasta.qual <--script not part of standard p/p/c package to my knowledge
5) for i in *.fa; do mv "$i" "`basename $i .fa`.fasta.qual"; done
6) split_fasta.pl mydir.fasta
7) for i in *.fa; do mv "$i" "`basename $i .fa`.fasta"; done
8) for i in *.fasta; do mktrace "$i" "`basename $i .fasta`.scf"; done <--- Makes both scf and phd.1 files, but annoyingly makes all peaks uniform height and 100% clean regardless of quality data such that you can't glance at the trace and "know" which regions are high quality vs moderate vs low
9) copy phds generated to ../phd_dir <---this retains the qual info. the scf loses the qual data so can be set aside. Also assembly from phds is quicker than from scfs.
10) rerun phred/phrap assembly script and see if any supercontigs generated in the last round now co-assemble into a new supercontig. do manual miniassemblies as needed.

I'm using the latest Ubuntu version

**pag** · 05-31-2012, 10:26 AM

If you are converting from an ace screen file, you may find the following modifications to ace2fasta.perl handy (diff output: < indicates new lines. > indicates old)

Code:

36d35
< printf ("format %d\n",$nFormatOfAceFile);
54c53
< #BQ
---
> 
58c57
<     open( filQuals, ">$szContigsFile.qual") || die "couldn't open qual file for writing";
---
> 
65c64
< 		printf("Found contig %s\n",$szContigName);
---
> 
85,115c84
<             } 
< 
< 	} elsif ( length( $_ ) >= 2 ) {
< 	   if ( substr( $_, 0, 2 ) eq "BQ" ) {
<                 # found a BQ=quality line - this corresponds with most recent CO line
< #                @aWords = split;
< #                $szContigName = $aWords[1];
<                 printf("Found qual %s\n",$szContigName);
< 
<                 print( filQuals ">$szContigName\n" );
< 
<                 while( <filAce>) {
<                     last if ( length( $_ ) == 1 );
< 
<                     # do an efficient check in case someone
<                     # has edited the ace file and put 
<                     # some whitespace on the line
< #                    if ( (substr( $_, 0, 1 ) eq " ") ||
< #                        (substr( $_, 0, 1 ) eq "\t" )) {
< #                        last if ( $_ =~ /^\s*$/ );
< #                    }
<                     # remove any pads
< #                    s/\*//g;
< 
< #leave pads in place - each is number<space>number
<                     print( filQuals $_ );
<                 }
< 
<                 # end of that contig
<                 print( filQuals "\n" );
< 	    }
---
>             }
120d88
<     close( filQuals );

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

phrap/consed - multiple copies during assembly

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News