Unconfigured Ad

**westerman** · 10-13-2015, 12:28 PM

Well, Mauve is usually used for genome-to-genome alignments. Although it can be used to map a draft genome with many contigs to a more finished genome. Depending on the size of your contigs and the reference perhaps a mapper such as BBMap might work better -- you would at least get a BAM file that you could work with.

But since you are already using Mauve, well, it seems simple to me to write a short script via 'cut', 'grep', 'paste', 'diff' etc. to pull out the non-mapping contigs.

I know my response isn't much help. Sorry about that. Just random thoughts at the end of a long day.

**antifolate** · 10-13-2015, 01:54 PM

I don't know why I went and installed Mauve while I already had the BB package. But now that I tried BBMap, I don't see how a BAM file is better (I'm not even sure this output is BAM). Is there an option in BBMap to retain the contigs that don't get mapped?

I'll be doing this for several assemblies so it'd be easier and faster if there was a simple flag for it, rather than having to massage the data every time with unix tools.

**Brian Bushnell** · 10-13-2015, 02:07 PM

BBMap can separate mapped and unmapped sequences, and keep them in fasta format, like this:

mapPacBio.sh in=contigs.fasta ref=reference.fasta outm=mapped.fasta outu=unmapped.fasta

That has a sequence length limit of 6000bp; longer sequences will be broken into fragments (bbmap.sh has a length limit of 600bp while mapPacBio is 6000bp). On the other hand, BBDuk is alignment-free and can do the same thing using kmer matching, without breaking up the input contigs:

bbduk.sh in=contigs.fa ref=reference.fa outm=matching.fa outu=unmatching.fa k=31 mcf=0.7

"mcf" means "min covered fraction", so this will make input sequences in which at least 70% of their bases are covered by reference kmers go to outm, and the others will go to outu.

Another way of doing it is to shred the input (or reference) contigs and map them to the other, get a bed file of the resulting sam, and extract the covered regions. There are a lot of ways, I guess, depending on your specific goal.

**antifolate** · 10-13-2015, 02:52 PM

Thanks Brian! About BBMap, does that mean that I shouldn't be using it (or even mapPacBio) to map contigs with n50=91k into the reference (s. pombe)?

Also could you please please go into the details of your last method?

**Brian Bushnell** · 10-13-2015, 05:09 PM

Originally posted by antifolate View Post

Thanks Brian! About BBMap, does that mean that I shouldn't be using it (or even mapPacBio) to map contigs with n50=91k into the reference (s. pombe)?

It really depends on your purpose, but for anything that requires the contigs to remain intact, it's not ideal; Seal (or an aligner for unlimited-length reads) would be better.

Also could you please please go into the details of your last method?

For example -

bbduk.sh in=contigs.fa ref=reference.fa out=masked_contigs.fa kmask=N k=31

That would produce a file like the contigs.fa, but with all bases covered by reference kmers masked (converted to N). This is handy because it still works even if both fastas contain a complete, single-contig genome, without shredding anything or changing any coordinates. Filtering methods would not really work well for single-contig genomes.

**GenoMax** · 10-14-2015, 03:28 AM

@antifolate: You should post this question on Mauve list. Aaron will generally address mauve questions in a day or two. It may be possible to extract the info you need from one of the mauve files.

**antifolate** · 10-27-2015, 12:59 PM

@Brian Bushnell Is there a way to make bbduk.sh not count mismatches resulting from Ns towards the 0.7?

Also, what is k?

@GenoMax Can you link me to the Mauve list? I can't find it.

**Brian Bushnell** · 10-27-2015, 01:08 PM

Originally posted by antifolate View Post

@Brian Bushnell Is there a way to make bbduk.sh not count mismatches resulting from Ns towards the 0.7?

You cannot do that with the "mcf" (min covered fraction) flag, but you can with the "mkf" (min kmer fraction) flag.

mcf=0.7: reference kmers must cover 70% of the sequence's bases.

mkf=0.7: sequence must share 70% of its valid kmers with the reference.

The distinction is subtle, but mkf takes into account Ns - sequence with Ns do not have valid kmers, so it is ignored with respect to the 70%. mcf does not take Ns into account. Not currently, at least. I might change that in the future.

Also, what is k?

"k" is the kmer length. 31 is BBDuk's maximum; you can make it shorter if you want, but specificity starts declining rapidly once you drop below 20.

**antifolate** · 10-27-2015, 01:48 PM

God bless Brian Bushnell! Thanks for the quick reply!

**GenoMax** · 10-27-2015, 02:03 PM

Originally posted by antifolate View Post

@GenoMax Can you link me to the Mauve list? I can't find it.

mauve-users at lists.sourceforge.net

You may need to become a member first before you can post.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 18 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 54 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Extracting Unmapped Contigs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News