Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • antifolate
    Member
    • Aug 2015
    • 52

    Extracting Unmapped Contigs

    Hello,

    I've assembled some contigs from a sequenced strain and aligned them back to the reference with Mauve. I'm interested in studying those contigs that did not align. How can I extract those contigs?

    Mauve can export a file with the positions of all those contigs in the input file so I can probably write a script to get the sequences, but I thought there's probably an easier way. Also, if you know a better software than Mauve, please tell me. My purpose is comparing several de-novo-assembled strains with each other and with the reference genome (~20MB) to see sequence variations.

    Thanks!
  • westerman
    Rick Westerman
    • Jun 2008
    • 1104

    #2
    Well, Mauve is usually used for genome-to-genome alignments. Although it can be used to map a draft genome with many contigs to a more finished genome. Depending on the size of your contigs and the reference perhaps a mapper such as BBMap might work better -- you would at least get a BAM file that you could work with.

    But since you are already using Mauve, well, it seems simple to me to write a short script via 'cut', 'grep', 'paste', 'diff' etc. to pull out the non-mapping contigs.

    I know my response isn't much help. Sorry about that. Just random thoughts at the end of a long day.

    Comment

    • antifolate
      Member
      • Aug 2015
      • 52

      #3
      I don't know why I went and installed Mauve while I already had the BB package. But now that I tried BBMap, I don't see how a BAM file is better (I'm not even sure this output is BAM). Is there an option in BBMap to retain the contigs that don't get mapped?

      I'll be doing this for several assemblies so it'd be easier and faster if there was a simple flag for it, rather than having to massage the data every time with unix tools.

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        BBMap can separate mapped and unmapped sequences, and keep them in fasta format, like this:

        mapPacBio.sh in=contigs.fasta ref=reference.fasta outm=mapped.fasta outu=unmapped.fasta

        That has a sequence length limit of 6000bp; longer sequences will be broken into fragments (bbmap.sh has a length limit of 600bp while mapPacBio is 6000bp). On the other hand, BBDuk is alignment-free and can do the same thing using kmer matching, without breaking up the input contigs:

        bbduk.sh in=contigs.fa ref=reference.fa outm=matching.fa outu=unmatching.fa k=31 mcf=0.7

        "mcf" means "min covered fraction", so this will make input sequences in which at least 70% of their bases are covered by reference kmers go to outm, and the others will go to outu.

        Another way of doing it is to shred the input (or reference) contigs and map them to the other, get a bed file of the resulting sam, and extract the covered regions. There are a lot of ways, I guess, depending on your specific goal.
        Last edited by Brian Bushnell; 10-13-2015, 02:09 PM.

        Comment

        • antifolate
          Member
          • Aug 2015
          • 52

          #5
          Thanks Brian! About BBMap, does that mean that I shouldn't be using it (or even mapPacBio) to map contigs with n50=91k into the reference (s. pombe)?

          Also could you please please go into the details of your last method?

          Comment

          • Brian Bushnell
            Super Moderator
            • Jan 2014
            • 2709

            #6
            Originally posted by antifolate View Post
            Thanks Brian! About BBMap, does that mean that I shouldn't be using it (or even mapPacBio) to map contigs with n50=91k into the reference (s. pombe)?
            It really depends on your purpose, but for anything that requires the contigs to remain intact, it's not ideal; Seal (or an aligner for unlimited-length reads) would be better.

            Also could you please please go into the details of your last method?
            For example -

            bbduk.sh in=contigs.fa ref=reference.fa out=masked_contigs.fa kmask=N k=31

            That would produce a file like the contigs.fa, but with all bases covered by reference kmers masked (converted to N). This is handy because it still works even if both fastas contain a complete, single-contig genome, without shredding anything or changing any coordinates. Filtering methods would not really work well for single-contig genomes.

            Comment

            • GenoMax
              Senior Member
              • Feb 2008
              • 7142

              #7
              @antifolate: You should post this question on Mauve list. Aaron will generally address mauve questions in a day or two. It may be possible to extract the info you need from one of the mauve files.

              Comment

              • antifolate
                Member
                • Aug 2015
                • 52

                #8
                @Brian Bushnell Is there a way to make bbduk.sh not count mismatches resulting from Ns towards the 0.7?

                Also, what is k?

                @GenoMax Can you link me to the Mauve list? I can't find it.

                Comment

                • Brian Bushnell
                  Super Moderator
                  • Jan 2014
                  • 2709

                  #9
                  Originally posted by antifolate View Post
                  @Brian Bushnell Is there a way to make bbduk.sh not count mismatches resulting from Ns towards the 0.7?
                  You cannot do that with the "mcf" (min covered fraction) flag, but you can with the "mkf" (min kmer fraction) flag.

                  mcf=0.7: reference kmers must cover 70% of the sequence's bases.

                  mkf=0.7: sequence must share 70% of its valid kmers with the reference.

                  The distinction is subtle, but mkf takes into account Ns - sequence with Ns do not have valid kmers, so it is ignored with respect to the 70%. mcf does not take Ns into account. Not currently, at least. I might change that in the future.

                  Also, what is k?
                  "k" is the kmer length. 31 is BBDuk's maximum; you can make it shorter if you want, but specificity starts declining rapidly once you drop below 20.

                  Comment

                  • antifolate
                    Member
                    • Aug 2015
                    • 52

                    #10
                    God bless Brian Bushnell! Thanks for the quick reply!

                    Comment

                    • GenoMax
                      Senior Member
                      • Feb 2008
                      • 7142

                      #11
                      Originally posted by antifolate View Post
                      @GenoMax Can you link me to the Mauve list? I can't find it.
                      mauve-users at lists.sourceforge.net

                      You may need to become a member first before you can post.

                      Comment

                      Latest Articles

                      Collapse

                      • GATTACAT
                        Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                        by GATTACAT
                        Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                        07-01-2026, 11:43 AM
                      • SEQadmin2
                        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                        by SEQadmin2


                        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                        Here are nine questions we think about, in roughly the order they matter, before...
                        06-18-2026, 07:11 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, 07-02-2026, 11:08 AM
                      0 responses
                      17 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-30-2026, 05:37 AM
                      0 responses
                      18 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-26-2026, 11:10 AM
                      0 responses
                      21 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-17-2026, 06:09 AM
                      0 responses
                      54 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...