Header Leaderboard Ad

Collapse

How to align contigs?

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to align contigs?

    I'm probably asking a basic question, but I've searched for hours and can't seem to find a straight answer.

    We have recently sequenced the entire genome (~5 MB) of a Salmonella strain using a brand new 454 sequencer. Ours was one of the first sequences ran. Since this is new, no one here really knows what to do with the data.

    I ran the sff reads through gsAssembler (i.e. Newbler) and now have contigs. There are several strains of Salmonella that have been sequenced and fully annotated. Thus, I believe it would be easiest to compare the contigs to a reference strain to figure out what gaps need to be filled. I used Gs Reference Mapper to do this, but the data that comes out of Mapper is significantly less than what comes out of Assembler. Thus, I think Mapper might be chopping up the contigs to make them fit better.

    Is there a program where I can use the contigs produced from Assember (which are .ace files) and compare them to a reference sequence that I have in a .fasta format? I have access to Consed, but can't seem to add a .fasta file into Consed to use as a reference.

    Thanks for the help!

  • #2
    Have you tried Mauve Genome Aligner? It's available at http://gel.ahabs.wisc.edu/mauve/.

    Comment


    • #3
      blat software is pretty good to compare 2 sets of bacterial contigs.

      blat is faster than blast, and by default it generates an excel compatible tab delimited table. This is very easy to view from Excel, or parse for follow up reviews.

      blat is freeware for academic usage, and can be downloaded from web.

      Comment


      • #4
        Originally posted by azmicro View Post
        I'm probably asking a basic question, but I've searched for hours and can't seem to find a straight answer.

        We have recently sequenced the entire genome (~5 MB) of a Salmonella strain using a brand new 454 sequencer. Ours was one of the first sequences ran. Since this is new, no one here really knows what to do with the data.

        I ran the sff reads through gsAssembler (i.e. Newbler) and now have contigs. There are several strains of Salmonella that have been sequenced and fully annotated. Thus, I believe it would be easiest to compare the contigs to a reference strain to figure out what gaps need to be filled. I used Gs Reference Mapper to do this, but the data that comes out of Mapper is significantly less than what comes out of Assembler. Thus, I think Mapper might be chopping up the contigs to make them fit better.

        Is there a program where I can use the contigs produced from Assember (which are .ace files) and compare them to a reference sequence that I have in a .fasta format? I have access to Consed, but can't seem to add a .fasta file into Consed to use as a reference.

        Thanks for the help!
        I am working on something very similar. How large are your contigs? and is there some headway you made that you can share?
        --
        bioinfosm

        Comment


        • #5
          Actually you can use the Fasta file of your contigs instead of .ace file. There are a bunch of softwares that can be used to map your target contigs to the reference genome. OSLay is a pretty one (http://www-ab.informatik.uni-tuebing...y/welcome.html). PGA4genomics can also be used to assemble your contigs following one or more reference genome (http://nar.oxfordjournals.org/cgi/content/full/gkn168v1).
          You can also use MUMmer to layout the contigs.

          Comment


          • #6
            Originally posted by azmicro View Post
            Thus, I believe it would be easiest to compare the contigs to a reference strain to figure out what gaps need to be filled. I used Gs Reference Mapper to do this, but the data that comes out of Mapper is significantly less than what comes out of Assembler. Thus, I think Mapper might be chopping up the contigs to make them fit better.
            Just to clarify: are you sure you compared the contigs, and not the original reads, to the reference strains?

            But more to the point - if your strain is divergent enough from your reference strains, then it doesn't seem surprising to me that you'd get less coverage by mapping from one strain to another, than by assembling your new strain de novo ... i.e. your mapping is failing wherever there's enough divergence, whereas if you have good reads, your assembly will cover divergent regions as well as homologous regions.

            Comment


            • #7
              OSLay is brialliant for the purpose I wanted .. thanks much
              --
              bioinfosm

              Comment


              • #8
                I figured out what was wrong! I used Mauve to compare the 454ContigsAll.fna file that came out of Assembler to a reference .fasta genome I downloaded from GenBank. Mauve provides a really nice visualization of where the contigs match up. Through Mauve I found contigs that did not match the reference sequence. When I BLASTed these contigs, I discovered they matched up to a Salmonella plasmid. For the sequencing I just did a genomic prep and didn't even think to separate out the plasmid DNA. Thus, Assembler's output included contigs that matched up to a plasmid whereas Mapper only included contigs that matched the reference sequence. Hence, the discrepancy between the amount of data output. This definitely makes my life easier!

                And in response to jnfass: Mapper compares the reads to a reference sequences and assembles contigs based on that reading. Mapper then gives you much longer and thus far fewer contigs than Assembler.

                Comment


                • #9
                  Glad you found your solution, azmicro ..
                  but I'd have to quibble that the number and length of contigs you'll get, and whether you get better (de novo) assemblies or (mapped) assemblies, will definitely depend on how divergent your reference and sequenced species are ... yours must be pretty close (being different strains, but not different species? maybe?)

                  Comment


                  • #10
                    I have a multi-chromosome reference sequence, and I want to map my 454-generated contigs (not reads) from a closely-related species, against it. The contigs are in one large multi-record FASTA file, the chromosomes are in one large Genbank (.gbk) file, i.e., a single file with 15 sets of features plus sequence, ordered 1 through 15. I've tried Mauve Contig Mover but while it did what looks like a great mapping job, and nicely displays the contig and chromosome boundary information (and annotations of the reference sequece) in the final alignment graphic, none of the output files I see allow me to easily map contigs on a per-chromosome basis (e.g., "this set of contigs maps to chromosome 12 in this order and orientation...."). The .tab file in the output gives ordered contig coordinates on a single giant pseudochromosome, which is all but useless to me without an indication of how these relate to the chromosome boundaries of the reference sequence. The output also includes a contig directory... which is empty....?

                    Ultimately what I'm aiming for are synteny maps of each chromosome in my reference genome. I realize Mauve was developed mainly on prokaryotic (single chromosome) genomes, but am I missing something here? Is there an easy way to do what I want with Mauve, that I'm not seeing, short of running each chromosome separately as a reference sequence? If not, should I be trying a different contig mapper?

                    Comment


                    • #11
                      Since this is such an old thread (that I happened to be subscribed to), may I suggest starting a new one with your question...

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                        by seqadmin


                        ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                        01-24-2023, 01:19 PM
                      • seqadmin
                        Introduction to Single-Cell Sequencing
                        by seqadmin
                        Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                        The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                        ...
                        01-09-2023, 03:10 PM

                      ad_right_rmr

                      Collapse
                      Working...
                      X