Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • phrap/consed - multiple copies during assembly

    Is there an established command line parameter (or pre-written script) to specify that certain reads should be/could be present in multiple copies in an assembly? For instance, I have reason to suspect that the genome may have up to 8 copies of one of the short contigs from a previous round of assembly, but even if I copy the contig sequence file a couple times, it will only align with itself and the highest scoring bridging sequence from a gap closure PCR sequencing reaction. Sometimes the other bridges to other contigs co-assemble, with significant mismatch error when it doesn't match one of the flanking contigs, other times the lower quality scored sequencing reaction doesn't assemble with anything.

    to give an example. I have a circular genome with order of sequenced contigs (assembled from 454-type reads)
    A-B-C-D-E-C-D-F-G where the -s represent unsequenced intervening DNA (where I have PCR products for sanger-type gap-closure sequencing)
    The assembly (with the intervening DNA sequenced) will give an assembly of something like A-B-C-D-F-G using "default" options. I'd rather see an assembly something like
    A-B-C + C-D-E + C-D-F + D-F-G (+ represents separate assembly contigs) or even A-B-C-D-F-G + D-E-C

    something with the -preassemble command line? I don't think it's -retain_duplicates? My method so far has been to manually create a separate chromat_dir for each pre-assembled contig or supercontig, place all my putative repeated contigs in all subdirectories (e.g. A, B, C, and D in one directory, C, D, F, G in another), then run the phrap assembly program on each directory and see (manually in consed) if I have multiple subdirectories that have a "duplicate" contig assembled in a different location than the other directories. This is both computationally time-consuming as well as time-consuming on the review end of things. Note also that the phred_crossmatch scripts don't seem to run properly on symbolically linked files, so I end up with hundreds to thousands of unneeded sequencing files on my computer (ie I'd rather use ln -s to point to the relevant copies in the main edit_dir/chromat_dir rather than cp).

    I'm running 64-bit ubuntu on an intel i5. I'd be willing to use another free (for academic use) package on linux or PC (windows 7 or XP).
    Last edited by pag; 05-18-2012, 09:00 AM.

  • #2
    If you have PE data, the assembler itself should be able to assemble the "short" contigs indepedently (if they are correctly recognized as forward or reverse).
    If not you don't have PE data, you probably have a problem ..

    But you may want to try MIRA3 assembler for assemblings reads (or contigs).
    Download MIRA for free. MIRA - Sequence assembler and sequence mapping for whole genome shotgun and EST / RNASeq sequencing data. Can use Sanger, 454, Illumina and IonTorrent data.

    hth, Sven


    • #3
      I don't have a paired end library for the gaps that I'm trying to close, but I'm Sanger sequencing from each end of the gap, so in that sense have paired ends.

      I was able to do a bit of perl and shell scripting to allow the "miniassembly" process to work with my dataset and that appears to align much more quickly than does manually doing a full alignment for each of the pre-assembled contigs and new sanger sequences. However, this appears to "wipe out" my other assemblies involving those pre-assembled contigs, so if I continue to use phred/phrap/consed I will have to figure out how to manually duplicate the generated sanger+netgen contigs to use as a reference

      I will look into Mira3 also though. Thanks!


      • #4
        Trying to use MIRA now without luck. I don't have access to the raw sff 454 data and I consequently get the following error

        Fatal error (may be due to problems of the input data or parameters):

        "Read contig00001 is 284651 bp long and thus longer than MAXREADSIZEALLOWED (29900) bases. Skim cannot handle than, sorry."

        My command line was$ mira --project=mira_hyb --job=denovo,genome,accurate,sanger,454 SANGER_SETTINGS -LR:ft=phd 454_SETTINGS -LR:ft=fasta

        files are
        mira_hyb_in.454.fasta <-- cat'd all the phd files generated by p/p/c from Sanger reads into a single file


        • #5
          hmm, ok, i think I get it now: you have pre-assembled contigs a few gap-closing sanger reads (probably PCR amplified), so you know your contig order.

          Then what's wrong with "MiniAssembly" provided by consed? That's just for that purpose ..

          And, yes, MIRA can only handle up to ~29K "reads" .. as it is a shotgun assembler as does certain assumptions about input data.


          • #6
            I believe I've figured out the contig order with the exception of one single unbridged (with PCR) gap. Since it's a circular chromosome, that gap is semi-irrelevant to the order. But consed has no knowledge of the contig order itself (that was accomplished by mapping vs a reference genome of another species and confirmed via PCR).

            I have several of the contigs that are predicted to have multiple copies (based on relative read coverage), including on that is predicted to have 6 copies and which I see approximately 4 copies of thusfar. However getting consed to map/assemble everything is trying - I still have 21 contigs of an initial 36 by consed, whereas I've manually reduced the number of contigs to 9. Basically, consed is breaking things up when it enters into a repeat region as it doesn't know what the matching pairs of contigs are bridged by a repeat contig. It also doesn't look intelligently for "SNPs" of one "copy" vs the other.

            MiniAssembly may indeed be my answer: I had been using old phrap scripts from before consed had the present form of miniassembly, but after updating and augmenting the old scripts, I got that to work last night. But when you miniassemble contig e.g. 30 with contig 1 with default settings, and then try to assemble 30 with 2, it first breaks the 30-1 association before forming the 30-2 association, so it does not appear to be increasing the length of contigs while shrinking the number of contigs.
            Last edited by pag; 05-21-2012, 10:33 AM.


            • #7
              Would you loose a lot if you'd try to assemble the reads with e.g. MIRA which does a pretty good job
              on repetitive sequences? Feed MIRA with your NGS data and the sanger reads and see what you
              get (assuming you don't have 500mio illumina reads ;-)

              Concerning the mini assembly, maybe you should simply addnewReads and then join your contigs
              manually (I think consed has a new option which may do this automatically). Seems you don't
              have a massive amount of sanger data.


              • #8
                As I said, I don't have the raw NGS data. I've been experimenting with splitting the assembled contigs into parts small enough for MIRA and then having it re-assemble. But my bash scripting skills appear to have left me temporarily.

                In any case, there was a new version of consed released either yesterday or today...I'll install that then try the miniassembler


                • #9
                  I chopped down the large contigs (into segments that overlap the next by 75%) so mira would accept them. However I get
                  Loading project from PHD file.
                  Counting sequence data from PHD file:
                  tcmalloc: large alloc 1073741824 bytes == 0x61270000 @
                  tcmalloc: large alloc 2147483648 bytes == 0xc1570000 @
                  tcmalloc: large alloc 4294967296 bytes == (nil) @
                  Dynamic allocs: 0
                  Align allocs: 0
                  Out of memory detected, exception message is: std::bad_alloc

                  command line: nice mira --job=denovo,genome,accurate,sanger -GE:not=4 -AS:sep=no SANGER_SETTINGS -LR:ft=phd

                  the initial 454 data was probably 70MBases (approx 300K reads with 90% overlap if I'm doing the math right)


                  • #10
                    consed doesn't want to play nice
                    if I do "add reads" and accept the default filter of *.fof (and self-aligning my duplicated contigs against themselves ahead of time), it gives me the following error
                    The list of reads must be simple filenames--not including the pathname or directory as with ../chromat_dir/phrap.out

                    if I instead attempt to import scf files for the individual reads, I get
                    Fatal: Bad boy!
                    You are trying to add 2 copies of read in file /data/genome/type/name/chromat_dir/dups/chromat_dir/contig00129.scf

                    and I can't miniassemble anything that's not in the alignment

                    I can't figure out the language of the first error - is it saying that the fof has to have a file listing of files that are in the directory already? and not given as paths, but as just the basename?
                    Last edited by pag; 05-31-2012, 11:20 AM.


                    • #11
                      consed wants you to just provide the names of the reads, without paths.
                      It will always look in ../chromat_dir/, ../phd_dir and if necessary in ../phdball_dir.

                      hmm, why don't you replicate your several copies of your contig just name them differently?
                      E.g. Contig1a, Contig1b etc. Then create an ACE file from your modified
                      fasta sequence and add the sanger gap-closing reads via 'addnewReads'.

                      Concerning MIRA, no need to work with phd. Split your initial fasta contigs in
                      "paired reads" and then assemble them using fasta/qual as input.

                      What system are you working on?


                      • #12
                        Sorry for being so long in replying here.

                        I believe there was just a retain duplicates type flag I needed in my phred/phrap script used by consed to make it work. In any case, I think what I eventually did was manually duplicate the files, edit the >tag in the file as well as the file name (or just duplicate the phd file and rename it?) and re-ran my command-line reassembly from edit_dir.

                        Then I
                        1) did miniassemblies to integrate the duplicated contigs into their appropriate supercontigs.
                        2) went to the edit_dir directory to find the .contig file that was generated from my latest assembly (possibly as .screen.contig) and the .contig.qual file and copied these to a new directory
                        3) did a sed search and replace to "fix" the names of the contigs to have names other than "Contig2" etc using the following bash script (invoked by ". myscript")
                        #put longest names first as Contig22 contains Contig2
                        list1="Contig10 Contig11 Contig15 Contig23 Contig2 Contig3 Contig4 Contig5 Contig8 Contig9"
                        list2="newname01 newname02 newname03 newname04 newname05 newname06 newname07 newname08 newname09 newname10"
                        frm=( $list1 )
                        to2=( $list2 )
                        cp mydir.contigs mycnt
                        cp mydir.contigs.qual myqual
                        echo $count $c2
                        for i in `seq 1 $count`
                            sed "s/$nfrm/$nto/g" mycnt >mydir.fasta
                            cp mydir.fasta mycnt
                        #otherwise each change would be wiped out by the subsequent change
                            sed "s/$nfrm/$nto/g" myqual >mydir.fasta.qual
                            cp mydir.fasta.qual myqual
                        4) mydir.fasta.qual <--script not part of standard p/p/c package to my knowledge
                        5) for i in *.fa; do mv "$i" "`basename $i .fa`.fasta.qual"; done
                        6) mydir.fasta
                        7) for i in *.fa; do mv "$i" "`basename $i .fa`.fasta"; done
                        8) for i in *.fasta; do mktrace "$i" "`basename $i .fasta`.scf"; done <--- Makes both scf and phd.1 files, but annoyingly makes all peaks uniform height and 100% clean regardless of quality data such that you can't glance at the trace and "know" which regions are high quality vs moderate vs low
                        9) copy phds generated to ../phd_dir <---this retains the qual info. the scf loses the qual data so can be set aside. Also assembly from phds is quicker than from scfs.
                        10) rerun phred/phrap assembly script and see if any supercontigs generated in the last round now co-assemble into a new supercontig. do manual miniassemblies as needed.

                        I'm using the latest Ubuntu version


                        • #13
                          If you are converting from an ace screen file, you may find the following modifications to ace2fasta.perl handy (diff output: < indicates new lines. > indicates old)
                          < printf ("format %d\n",$nFormatOfAceFile);
                          < #BQ
                          <     open( filQuals, ">$szContigsFile.qual") || die "couldn't open qual file for writing";
                          < 		printf("Found contig %s\n",$szContigName);
                          <             } 
                          < 	} elsif ( length( $_ ) >= 2 ) {
                          < 	   if ( substr( $_, 0, 2 ) eq "BQ" ) {
                          <                 # found a BQ=quality line - this corresponds with most recent CO line
                          < #                @aWords = split;
                          < #                $szContigName = $aWords[1];
                          <                 printf("Found qual %s\n",$szContigName);
                          <                 print( filQuals ">$szContigName\n" );
                          <                 while( <filAce>) {
                          <                     last if ( length( $_ ) == 1 );
                          <                     # do an efficient check in case someone
                          <                     # has edited the ace file and put 
                          <                     # some whitespace on the line
                          < #                    if ( (substr( $_, 0, 1 ) eq " ") ||
                          < #                        (substr( $_, 0, 1 ) eq "\t" )) {
                          < #                        last if ( $_ =~ /^\s*$/ );
                          < #                    }
                          <                     # remove any pads
                          < #                    s/\*//g;
                          < #leave pads in place - each is number<space>number
                          <                     print( filQuals $_ );
                          <                 }
                          <                 # end of that contig
                          <                 print( filQuals "\n" );
                          < 	    }
                          >             }
                          <     close( filQuals );


                          Latest Articles


                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin

                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin

                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM





                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          Last Post seqadmin