Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • pag
    Member
    • May 2012
    • 72

    phrap/consed - multiple copies during assembly

    Is there an established command line parameter (or pre-written script) to specify that certain reads should be/could be present in multiple copies in an assembly? For instance, I have reason to suspect that the genome may have up to 8 copies of one of the short contigs from a previous round of assembly, but even if I copy the contig sequence file a couple times, it will only align with itself and the highest scoring bridging sequence from a gap closure PCR sequencing reaction. Sometimes the other bridges to other contigs co-assemble, with significant mismatch error when it doesn't match one of the flanking contigs, other times the lower quality scored sequencing reaction doesn't assemble with anything.

    to give an example. I have a circular genome with order of sequenced contigs (assembled from 454-type reads)
    A-B-C-D-E-C-D-F-G where the -s represent unsequenced intervening DNA (where I have PCR products for sanger-type gap-closure sequencing)
    The assembly (with the intervening DNA sequenced) will give an assembly of something like A-B-C-D-F-G using "default" options. I'd rather see an assembly something like
    A-B-C + C-D-E + C-D-F + D-F-G (+ represents separate assembly contigs) or even A-B-C-D-F-G + D-E-C

    something with the -preassemble command line? I don't think it's -retain_duplicates? My method so far has been to manually create a separate chromat_dir for each pre-assembled contig or supercontig, place all my putative repeated contigs in all subdirectories (e.g. A, B, C, and D in one directory, C, D, F, G in another), then run the phrap assembly program on each directory and see (manually in consed) if I have multiple subdirectories that have a "duplicate" contig assembled in a different location than the other directories. This is both computationally time-consuming as well as time-consuming on the review end of things. Note also that the phred_crossmatch scripts don't seem to run properly on symbolically linked files, so I end up with hundreds to thousands of unneeded sequencing files on my computer (ie I'd rather use ln -s to point to the relevant copies in the main edit_dir/chromat_dir rather than cp).

    I'm running 64-bit ubuntu on an intel i5. I'd be willing to use another free (for academic use) package on linux or PC (windows 7 or XP).
    Last edited by pag; 05-18-2012, 09:00 AM.
  • sklages
    Senior Member
    • May 2008
    • 628

    #2
    If you have PE data, the assembler itself should be able to assemble the "short" contigs indepedently (if they are correctly recognized as forward or reverse).
    If not you don't have PE data, you probably have a problem ..

    But you may want to try MIRA3 assembler for assemblings reads (or contigs).
    Download MIRA for free. MIRA V5 is available only on GitHub! The V4 version released here on SourceForge stay up as some automated release fetching packages rely on V4.


    hth, Sven

    Comment

    • pag
      Member
      • May 2012
      • 72

      #3
      I don't have a paired end library for the gaps that I'm trying to close, but I'm Sanger sequencing from each end of the gap, so in that sense have paired ends.

      I was able to do a bit of perl and shell scripting to allow the "miniassembly" process to work with my dataset and that appears to align much more quickly than does manually doing a full alignment for each of the pre-assembled contigs and new sanger sequences. However, this appears to "wipe out" my other assemblies involving those pre-assembled contigs, so if I continue to use phred/phrap/consed I will have to figure out how to manually duplicate the generated sanger+netgen contigs to use as a reference

      I will look into Mira3 also though. Thanks!

      Comment

      • pag
        Member
        • May 2012
        • 72

        #4
        Trying to use MIRA now without luck. I don't have access to the raw sff 454 data and I consequently get the following error

        Fatal error (may be due to problems of the input data or parameters):

        "Read contig00001 is 284651 bp long and thus longer than MAXREADSIZEALLOWED (29900) bases. Skim cannot handle than, sorry."

        My command line was$ mira --project=mira_hyb --job=denovo,genome,accurate,sanger,454 SANGER_SETTINGS -LR:ft=phd 454_SETTINGS -LR:ft=fasta

        files are
        mira_hyb_in.454.fasta
        mira_hyb_in.phd.1 <-- cat'd all the phd files generated by p/p/c from Sanger reads into a single file
        mira_hyb_in.454.fasta.qual
        mira_hyb_traceinfo_in.454.xml

        Comment

        • sklages
          Senior Member
          • May 2008
          • 628

          #5
          hmm, ok, i think I get it now: you have pre-assembled contigs a few gap-closing sanger reads (probably PCR amplified), so you know your contig order.

          Then what's wrong with "MiniAssembly" provided by consed? That's just for that purpose ..

          And, yes, MIRA can only handle up to ~29K "reads" .. as it is a shotgun assembler as does certain assumptions about input data.

          Comment

          • pag
            Member
            • May 2012
            • 72

            #6
            I believe I've figured out the contig order with the exception of one single unbridged (with PCR) gap. Since it's a circular chromosome, that gap is semi-irrelevant to the order. But consed has no knowledge of the contig order itself (that was accomplished by mapping vs a reference genome of another species and confirmed via PCR).

            I have several of the contigs that are predicted to have multiple copies (based on relative read coverage), including on that is predicted to have 6 copies and which I see approximately 4 copies of thusfar. However getting consed to map/assemble everything is trying - I still have 21 contigs of an initial 36 by consed, whereas I've manually reduced the number of contigs to 9. Basically, consed is breaking things up when it enters into a repeat region as it doesn't know what the matching pairs of contigs are bridged by a repeat contig. It also doesn't look intelligently for "SNPs" of one "copy" vs the other.

            MiniAssembly may indeed be my answer: I had been using old phrap scripts from before consed had the present form of miniassembly, but after updating and augmenting the old scripts, I got that to work last night. But when you miniassemble contig e.g. 30 with contig 1 with default settings, and then try to assemble 30 with 2, it first breaks the 30-1 association before forming the 30-2 association, so it does not appear to be increasing the length of contigs while shrinking the number of contigs.
            Last edited by pag; 05-21-2012, 10:33 AM.

            Comment

            • sklages
              Senior Member
              • May 2008
              • 628

              #7
              Would you loose a lot if you'd try to assemble the reads with e.g. MIRA which does a pretty good job
              on repetitive sequences? Feed MIRA with your NGS data and the sanger reads and see what you
              get (assuming you don't have 500mio illumina reads ;-)

              Concerning the mini assembly, maybe you should simply addnewReads and then join your contigs
              manually (I think consed has a new option which may do this automatically). Seems you don't
              have a massive amount of sanger data.

              Comment

              • pag
                Member
                • May 2012
                • 72

                #8
                As I said, I don't have the raw NGS data. I've been experimenting with splitting the assembled contigs into parts small enough for MIRA and then having it re-assemble. But my bash scripting skills appear to have left me temporarily.

                In any case, there was a new version of consed released either yesterday or today...I'll install that then try the miniassembler

                Comment

                • pag
                  Member
                  • May 2012
                  • 72

                  #9
                  I chopped down the large contigs (into segments that overlap the next by 75%) so mira would accept them. However I get
                  Loading project from PHD file.
                  Counting sequence data from PHD file:
                  tcmalloc: large alloc 1073741824 bytes == 0x61270000 @
                  tcmalloc: large alloc 2147483648 bytes == 0xc1570000 @
                  tcmalloc: large alloc 4294967296 bytes == (nil) @
                  ...snip...
                  Dynamic allocs: 0
                  Align allocs: 0
                  Out of memory detected, exception message is: std::bad_alloc

                  command line: nice mira --job=denovo,genome,accurate,sanger -GE:not=4 -AS:sep=no SANGER_SETTINGS -LR:ft=phd

                  the initial 454 data was probably 70MBases (approx 300K reads with 90% overlap if I'm doing the math right)

                  Comment

                  • pag
                    Member
                    • May 2012
                    • 72

                    #10
                    consed doesn't want to play nice
                    if I do "add reads" and accept the default filter of *.fof (and self-aligning my duplicated contigs against themselves ahead of time), it gives me the following error
                    The list of reads must be simple filenames--not including the pathname or directory as with ../chromat_dir/phrap.out

                    if I instead attempt to import scf files for the individual reads, I get
                    Fatal: Bad boy!
                    You are trying to add 2 copies of read in file /data/genome/type/name/chromat_dir/dups/chromat_dir/contig00129.scf

                    and I can't miniassemble anything that's not in the alignment

                    I can't figure out the language of the first error - is it saying that the fof has to have a file listing of files that are in the directory already? and not given as paths, but as just the basename?
                    Last edited by pag; 05-31-2012, 11:20 AM.

                    Comment

                    • sklages
                      Senior Member
                      • May 2008
                      • 628

                      #11
                      consed wants you to just provide the names of the reads, without paths.
                      It will always look in ../chromat_dir/, ../phd_dir and if necessary in ../phdball_dir.

                      hmm, why don't you replicate your several copies of your contig just name them differently?
                      E.g. Contig1a, Contig1b etc. Then create an ACE file from your modified
                      fasta sequence and add the sanger gap-closing reads via 'addnewReads'.

                      Concerning MIRA, no need to work with phd. Split your initial fasta contigs in
                      "paired reads" and then assemble them using fasta/qual as input.

                      What system are you working on?

                      Comment

                      • pag
                        Member
                        • May 2012
                        • 72

                        #12
                        Sorry for being so long in replying here.

                        I believe there was just a retain duplicates type flag I needed in my phred/phrap script used by consed to make it work. In any case, I think what I eventually did was manually duplicate the files, edit the >tag in the file as well as the file name (or just duplicate the phd file and rename it?) and re-ran my command-line reassembly from edit_dir.

                        Then I
                        1) did miniassemblies to integrate the duplicated contigs into their appropriate supercontigs.
                        2) went to the edit_dir directory to find the .contig file that was generated from my latest assembly (possibly as .screen.contig) and the .contig.qual file and copied these to a new directory
                        3) did a sed search and replace to "fix" the names of the contigs to have names other than "Contig2" etc using the following bash script (invoked by ". myscript")
                        Code:
                        #!/bin/bash
                        #put longest names first as Contig22 contains Contig2
                        list1="Contig10 Contig11 Contig15 Contig23 Contig2 Contig3 Contig4 Contig5 Contig8 Contig9"
                        list2="newname01 newname02 newname03 newname04 newname05 newname06 newname07 newname08 newname09 newname10"
                        frm=( $list1 )
                        to2=( $list2 )
                        
                        count=${#frm[@]}
                        c2=${#to2[@]}
                        cp mydir.contigs mycnt
                        cp mydir.contigs.qual myqual
                        
                        echo $count $c2
                        for i in `seq 1 $count`
                        do
                            nfrm=${frm[$i-1]}
                            nto=${to2[$i-1]}
                            sed "s/$nfrm/$nto/g" mycnt >mydir.fasta
                            cp mydir.fasta mycnt
                        #otherwise each change would be wiped out by the subsequent change
                            sed "s/$nfrm/$nto/g" myqual >mydir.fasta.qual
                            cp mydir.fasta.qual myqual
                        done
                        4) split_fasta.pl mydir.fasta.qual <--script not part of standard p/p/c package to my knowledge
                        5) for i in *.fa; do mv "$i" "`basename $i .fa`.fasta.qual"; done
                        6) split_fasta.pl mydir.fasta
                        7) for i in *.fa; do mv "$i" "`basename $i .fa`.fasta"; done
                        8) for i in *.fasta; do mktrace "$i" "`basename $i .fasta`.scf"; done <--- Makes both scf and phd.1 files, but annoyingly makes all peaks uniform height and 100% clean regardless of quality data such that you can't glance at the trace and "know" which regions are high quality vs moderate vs low
                        9) copy phds generated to ../phd_dir <---this retains the qual info. the scf loses the qual data so can be set aside. Also assembly from phds is quicker than from scfs.
                        10) rerun phred/phrap assembly script and see if any supercontigs generated in the last round now co-assemble into a new supercontig. do manual miniassemblies as needed.

                        I'm using the latest Ubuntu version

                        Comment

                        • pag
                          Member
                          • May 2012
                          • 72

                          #13
                          If you are converting from an ace screen file, you may find the following modifications to ace2fasta.perl handy (diff output: < indicates new lines. > indicates old)
                          Code:
                          36d35
                          < printf ("format %d\n",$nFormatOfAceFile);
                          54c53
                          < #BQ
                          ---
                          > 
                          58c57
                          <     open( filQuals, ">$szContigsFile.qual") || die "couldn't open qual file for writing";
                          ---
                          > 
                          65c64
                          < 		printf("Found contig %s\n",$szContigName);
                          ---
                          > 
                          85,115c84
                          <             } 
                          < 
                          < 	} elsif ( length( $_ ) >= 2 ) {
                          < 	   if ( substr( $_, 0, 2 ) eq "BQ" ) {
                          <                 # found a BQ=quality line - this corresponds with most recent CO line
                          < #                @aWords = split;
                          < #                $szContigName = $aWords[1];
                          <                 printf("Found qual %s\n",$szContigName);
                          < 
                          <                 print( filQuals ">$szContigName\n" );
                          < 
                          <                 while( <filAce>) {
                          <                     last if ( length( $_ ) == 1 );
                          < 
                          <                     # do an efficient check in case someone
                          <                     # has edited the ace file and put 
                          <                     # some whitespace on the line
                          < #                    if ( (substr( $_, 0, 1 ) eq " ") ||
                          < #                        (substr( $_, 0, 1 ) eq "\t" )) {
                          < #                        last if ( $_ =~ /^\s*$/ );
                          < #                    }
                          <                     # remove any pads
                          < #                    s/\*//g;
                          < 
                          < #leave pads in place - each is number<space>number
                          <                     print( filQuals $_ );
                          <                 }
                          < 
                          <                 # end of that contig
                          <                 print( filQuals "\n" );
                          < 	    }
                          ---
                          >             }
                          120d88
                          <     close( filQuals );

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Pathogen Surveillance with Advanced Genomic Tools
                            by seqadmin




                            The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                            03-24-2025, 11:48 AM
                          • seqadmin
                            New Genomics Tools and Methods Shared at AGBT 2025
                            by seqadmin


                            This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                            The Headliner
                            The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                            03-03-2025, 01:39 PM
                          • seqadmin
                            Investigating the Gut Microbiome Through Diet and Spatial Biology
                            by seqadmin




                            The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                            02-24-2025, 06:31 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 03-20-2025, 05:03 AM
                          0 responses
                          41 views
                          0 reactions
                          Last Post seqadmin  
                          Started by seqadmin, 03-19-2025, 07:27 AM
                          0 responses
                          46 views
                          0 reactions
                          Last Post seqadmin  
                          Started by seqadmin, 03-18-2025, 12:50 PM
                          0 responses
                          36 views
                          0 reactions
                          Last Post seqadmin  
                          Started by seqadmin, 03-03-2025, 01:15 PM
                          0 responses
                          191 views
                          0 reactions
                          Last Post seqadmin  
                          Working...