Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • consed issues with newbler generated hybrid assembly

    I'm working in Consed with a hybrid Sanger/454 assembly that I generated using gsAssembler. I'm pretty familiar with consed and would like to use it to join contigs and analyze SNPs as I have done with Sanger-only assemblies. However, I'm running into some problems:

    (1) When I try to view a read trace, consed calls "sff2scf" on Sanger reads as if they were 454 instead of reading the pre-existing scf file. This results in an error as there is no sff file for Sanger reads. This is causing problems when, for example, I want to extend or change the consensus sequence, since Consed requires this be done from the trace window.

    The chromat_file is renamed following the Newbler convention of adding suffixes to reads based on the location of their mate pairs. For example, for a Sanger read named "ABCD.g1" the relevant lines in the ace file look like this:

    DS CHROMAT_FILE: ABCD.g1.548-1.fm12429.pr12429 PHD_FILE: AB CD.g1.548-1.fm12429.pr12429.phd.1 TIME: Thu Jul 27 12:33:48 2000 CHEM: unknown DYE: unknown TEMPLATE: ABCD DIRECTION: rev.

    Perhaps changing the chromat_file path in the ace file would help, but not if consed always calls "sff2scf".

    (2) Lengthy "unaligned" regions are present at the start and end of contigs. To my eye at least some of these regions look quite well aligned and frequently contain sequence overlapping with other contigs, which is necessary to manually join them using the "Compare Contigs" command. Why are these considered "unaligned" by newbler? And how can I use them to join contigs, since consed won't allow unaligned regions to be used in the "compare contigs" window?

    Anyone have some insight into these issues, or tools for hybrid Sanger/454 assemblies in general?

  • #2
    Originally posted by greigite View Post
    I'm working in Consed with a hybrid Sanger/454 assembly that I generated using gsAssembler. I'm pretty familiar with consed and would like to use it to join contigs and analyze SNPs as I have done with Sanger-only assemblies. However, I'm running into some problems:
    Welcome aboard

    (1) When I try to view a read trace, consed calls "sff2scf" on Sanger reads as if they were 454 instead of reading the pre-existing scf file. This results in an error as there is no sff file for Sanger reads. This is causing problems when, for example, I want to extend or change the consensus sequence, since Consed requires this be done from the trace window.
    In this case you should use your own script to catch the type of read requested by consed and react accordingly.
    Or you just write a simple script which puts your sanger chromat to /tmp .
    Consed resources which you should consider:

    Code:
    consed.alwaysRunProgramToGetChromats
    consed.programToRunToGetChromats
    consed.uncompressedChromatDirectory
    consed.programToRunToGetChromatsOf454Reads
    The chromat_file is renamed following the Newbler convention of adding suffixes to reads based on the location of their mate pairs. For example, for a Sanger read named "ABCD.g1" the relevant lines in the ace file look like this:

    DS CHROMAT_FILE: ABCD.g1.548-1.fm12429.pr12429 PHD_FILE: AB CD.g1.548-1.fm12429.pr12429.phd.1 TIME: Thu Jul 27 12:33:48 2000 CHEM: unknown DYE: unknown TEMPLATE: ABCD DIRECTION: rev.

    Perhaps changing the chromat_file path in the ace file would help, but not if consed always calls "sff2scf".
    That is a typical newbler "problem". I never understood how newbler could break reads and put parts of these reads to different locations.
    That's a severe problem, especially for sanger reads, as you loose all your read pair info.

    Maybe you can change or copy your chromat file to ABCD.g1.548-1.fm12429.pr12429 to enable consed to open the file.

    (2) Lengthy "unaligned" regions are present at the start and end of contigs. To my eye at least some of these regions look quite well aligned and frequently contain sequence overlapping with other contigs, which is necessary to manually join them using the "Compare Contigs" command. Why are these considered "unaligned" by newbler? And how can I use them to join contigs, since consed won't allow unaligned regions to be used in the "compare contigs" window?
    Just another newbler thing ... you could manually extend the consensus ... but that's pretty annoying.

    All these problems may vanish if the software suite gets updated sometime this year (probably by the end of the year ;-) )

    ... or not.

    Anyone have some insight into these issues, or tools for hybrid Sanger/454 assemblies in general?
    We are using newbler for a quick overview for denovo assembly/mapping or as a reference assembly.

    For routine work we go for MIRA3 (ESTs, metagenome, wgs) or celera assembler (large wgs projects). Both assemblers work very well with 454/sanger hybrid data.

    For finishing we use either Consed (large projects) or Staden's Gap4, depending on what to do with the data ..

    hth,
    Sven

    Comment


    • #3
      Originally posted by greigite View Post
      (2) Lengthy "unaligned" regions are present at the start and end of contigs. To my eye at least some of these regions look quite well aligned and frequently contain sequence overlapping with other contigs, which is necessary to manually join them using the "Compare Contigs" command. Why are these considered "unaligned" by newbler? And how can I use them to join contigs, since consed won't allow unaligned regions to be used in the "compare contigs" window?
      Originally posted by sklages View Post
      Just another newbler thing ... you could manually extend the consensus ... but that's pretty annoying.

      All these problems may vanish if the software suite gets updated sometime this year (probably by the end of the year ;-) )

      ... or not.
      I've also seen this on Newbler 2.00.01 de novo assemblies of just 454 data (high coverage), but don't have an automated way of dealing with it yet.

      Comment


      • #4
        Thanks very much for this information, sklages. I've been playing around with getting this to work but there is still an issue with displaying sanger chromats.

        I wrote a small perl script to identify 454 vs Sanger chromats and redirect the chromat files to /tmp where Consed can find them (I can post this if there is interest). You have to set some consed parameters to find the script (chromat_redirect.pl):
        Code:
        consed.programToRunToGetChromats=chromat_redirect.pl
        consed.alwaysRunProgramToGetChromats=last
        consed.uncompressedChromatDirectory=/tmp
        The script works fine to display 454 chromats and to copy Sanger chromats to /tmp, but there is a new problem relating to discrepancies between the phd files created by newbler and the original Sanger chromat files:

        Code:
        ace file: 454Contigs.ace.1
        Version 19.0 (090206)
        Sorry--the chromatogram file /tmp/ABCD12783.b1.482-291.fm24208.to24208 has 10349 trace array points while the phd file ABCD12783.b1.482-291.fm24208.to24208.phd.1
        was made from a chromatogram with 15592.  This means that someone 
        overwrote the original chromatogram file.   Check the file dates on the 
        chromatogram file and the phd file.  To correct this, I would suggest deleting the phd file and running the phredPhrap script again.  To prevent this from happening again, find out who/why the chromatogram was switched.  Sorry.
        The source of the problem is this line in the phd file:
        Code:
        TRACE_ARRAY_MAX_INDEX
        which is written in by newbler, and differs from the info in the original chromat file.
        Possibly this is happening because I used phred to trim the Sanger reads before putting them into newbler, so now the original chromats are a different length (though if this is the case, it doesn't make sense that there are fewer trace points in the original chromat than in the phd created by newbler). I may try manually editing the trace point info in the newbler phd files to see if that helps. If not I think I may give up on fixing this issue.
        That is a typical newbler "problem". I never understood how newbler could break reads and put parts of these reads to different locations.
        That's a severe problem, especially for sanger reads, as you loose all your read pair info.
        I actually don't think it loses the pair info when it breaks the read- at least it still indicates the location of the mate pair in the read name using the suffix ".pr". The large "unaligned" regions are annoying though. Hopefully mira will work better once I figure out how to use it!

        Comment


        • #5
          Originally posted by greigite View Post
          Code:
          TRACE_ARRAY_MAX_INDEX
          which is written in by newbler, and differs from the info in the original chromat file.
          Possibly this is happening because I used phred to trim the Sanger reads before putting them into newbler, so now the original chromats are a different length (though if this is the case, it doesn't make sense that there are fewer trace points in the original chromat than in the phd created by newbler). I may try manually editing the trace point info in the newbler phd files to see if that helps. If not I think I may give up on fixing this issue.
          .... I am not really sure if it is worth fixing it ... you will probably run into the next problem ...

          I actually don't think it loses the pair info when it breaks the read- at least it still indicates the location of the mate pair in the read name using the suffix ".pr". The large "unaligned" regions are annoying though. Hopefully mira will work better once I figure out how to use it!
          One part of a read where it should be, the other part unaligned at this position but aligned to another contig .. so my mate pair only consists of parts of the reads? Not very convincing

          Depending of the kind of project your trying to assemble, MIRA is doing a very good job.

          There are two alignment formats created to further go for the most popular "finishing packages",

          a CAF file which easily is converted to Staden's Gap4 or Gap5 format and
          a ACE file which probably needs some fixing (as MIRA doesn't create any phd.ball files you probably need to fix the TIME stamps in the ACE file).

          cheers,
          Sven

          Comment


          • #6
            reason for unaligned regions

            Originally posted by maubp View Post
            I've also seen this on Newbler 2.00.01 de novo assemblies of just 454 data (high coverage), but don't have an automated way of dealing with it yet.
            I think this issue with extensive "unaligned" regions appearing in consed happens due to Newbler placing the same read in multiple locations. The read is renamed to indicate the portion found in a particular contig, and only that part is used to construct a consensus sequence, but the entire read is shown in the alignment. In consed this appears as greyed out sequence to the left of the contig start or the right of the contig end.
            For example, the read name FY3Z7SM02JNDKE.456-510.fm1369 in contig 1370 indicates that positions 456-510 of the read are considered part of contig 1370 and the rest of the read belongs in contig 1369. However, looking at contig 1370, this read extends to position -454 before the start of the contig.
            Presumably these overlaps are corrected in the scaffolds output by Newbler, but it makes manual contig joins by consed cumbersome for 454 reads (you have to change the consensus via the chromatogram) and impossible for Sanger reads (chromatogram can't be displayed).

            Comment


            • #7
              I have recently resolved this problem with the help of Jim Knight (the creator of Newbler).

              The solution to getting your Sanger read traces to pop up in consed is quite simple. The first step is to use a version of Newbler later than 2.3 (such as the 4/19/2010 release).

              Then, you must add the path to the chromat for each read into the headers of the fasta file.

              for example, if your read name is ABCDEFG and your chromat is located in user/bin/chromats/ then your fasta header must look like this:

              >ABCDEFG scf=/user/bin/chromats/ABCDEFG.scf

              Comment


              • #8
                Originally posted by Broadie View Post
                [...]
                >ABCDEFG scf=/user/bin/chromats/ABCDEFG.scf
                We do store our chromatograms in tarballs ... the solution is obviously just a quick hack ;-)

                cheers,
                Sven

                Comment


                • #9
                  duct tape has many uses

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Exploring the Dynamics of the Tumor Microenvironment
                    by seqadmin




                    The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                    07-08-2024, 03:19 PM
                  • seqadmin
                    Exploring Human Diversity Through Large-Scale Omics
                    by seqadmin


                    In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                    06-25-2024, 06:43 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Today, 06:53 AM
                  0 responses
                  11 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 07-10-2024, 07:30 AM
                  0 responses
                  33 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 07-03-2024, 09:45 AM
                  0 responses
                  203 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 07-03-2024, 08:54 AM
                  0 responses
                  213 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X