Yes .. BBMap can do that!

AndrewP replied

04-03-2020, 08:40 AM
I'm trying to use BBmap to find all perfect hits or hits with an indel length 1.

Code:

bbmapskinner.sh in=kmer.fasta out=result.sam ambiguous=all strictmaxindel=1

I'm running a control experiment where I have a subsequence to a larger sequence in the reference. I can find the subsequence in the reference if it's an exact match. However, if I add an indel in either the subsequence or reference, BBmap is unable to map the reference. I thought by setting strictmaxindel to 1, it should be able to report an alignment with a single indel. I've tried setting strictmaxindel to 10 and it still doesn't find the alignment.

Is there something that I am doing wrong?
Leave a comment:

silask replied

03-11-2020, 04:53 AM

BBsketch alltoall is incomplete

Can I ask a question about bbsketch?

I want to compare the ANI between many genomes (1000+) to each other.
I did

Code:

bbsketch.sh perfile genome_folder/*.fasta out=sketch.gz k=31,24 threads=16

comparesketch.sh alltoall sketch.gz k=31,24 prealloc=0.75 format=3 threads=16 out=table.tsv

The log files seem correct:

Code:

Set threads to 16
Loading sketches.
Loaded 1157 sketches in 59.541 seconds.
Total Time:     59.784 seconds.

Code:

Set threads to 16
Loading sketches.
Executing kmer.KmerTableSet [ways=31, tabletype=10, prealloc=0.75]

Initial size set to 45218398
Initial:
Ways=31, initialSize=45218398, prefilter=f, prealloc=0.75
Memory: max=91268m, total=91268m, free=90848m, used=420m

3.713 seconds.
Indexed 2880884 unique and 10513099 total hashcodes.
Loaded 1157 sketches in 8.457 seconds.

Ran 1225005 comparisons in 9.344 seconds.
Total Time:     17.801 seconds.

In the final output, I'm missing some genome comparisons (which I get with mash). If I run bbsketch on a subset I get the expected comparisons.

- Genomes are highly similar.

#Query Ref ANI QSize RefSize QBases RBases QTaxID RTaxID KID WKID SSU
genome1.fasta genome2.fasta 94.223 1984118 1796930 1987598 1797650 -1 -1 24.952 27.523 .

- It is not simply due to the naming: I neither find "genome1 vs genome2" nor "genome 2 vs genome1"

Any idea?

Leave a comment:

roselaw27 replied

03-10-2020, 05:23 PM
Thank you for your respond. Unfortunately I tried using -F and filterbytile.sh still showed the same error message. The data were from HiSeq 2500.

Thank you again. I think I am going to try to find other ways to solve the problem.

Rose
Leave a comment:
GenoMax replied

03-10-2020, 04:13 AM
@roselaw27: You can use `-F` option with fastq-dump to try and recreate fastq headers in original illumina format.
Leave a comment:
roselaw27 replied

03-10-2020, 01:05 AM
Trouble parsing header

Dear BBMap team:

I tried to use filterbytile.sh to remove the reads with low quality, but I encountered an error message saying that there was a trouble parsing the header. I've read the description of the script and Brian Bushnell said that was possible when the reads were renamed (such as in SRA) and to contact him if such error happened.

I downloaded the sequencing data (SRA) from ncbi and used fastq-dump to get the fastq files. I wonder if there is a solution to this?

Thank you very much!
Rose
Leave a comment:
GenoMax replied

08-22-2019, 01:34 AM
@darthsequencer: You can use kmercountexact.sh from BBMap suite.
Leave a comment:
darthsequencer replied

08-21-2019, 03:43 PM
Is there a command to output the kmers of each sequence in a multifasta file?
Leave a comment:
bartn replied

08-05-2019, 11:07 PM
another program that is often used for these cases is DIAMOND

https://github.com/bbuchfink/diamond

But like GenoMax said you have to do some post processing of the output to get to the format you want..
Leave a comment:
GenoMax replied

08-05-2019, 10:28 AM
BBMap is not designed to work with protein data. You will have to find a different tool.

Take a look at blat from UCSC's Jim Kent. That should produce the stats in parse-able tab-delimited columner format. You will need to do post-processing to get the table you are looking for.
Leave a comment:
Illusive Man replied

08-05-2019, 09:42 AM
Dear BBMap users or Brian,

I have a database of about 800 proteins and all I would like to do is create an abundance table of "hits" to these 800 proteins for each one of my 30 samples. Samples as columns, genes as rows, cells as counts. Can BBMap help me create this abundance table?

Thanks
Leave a comment:
bartn replied

07-10-2019, 05:09 AM
That's indeed unfortunate. I might try sending an email. A previous bug report was solved very quickly!

One of the reasons I like about BBMap is indeed the wide range of statistics it can generate. Just noticed that final summary percentages like I showed, are to me the most useful thing to look at on first check was missing by default.

And thanks a lot for your answers and your time GenoMax!
Leave a comment:
GenoMax replied

07-10-2019, 03:40 AM
Unfortunately Brian would be the only person who can provide an authoritative answer and he no longer participates on this forum. You may want to try emailing him with your question directly.

BBMap provides a wide range of coverage/histogram/stats outputs. Default output (while voluminous does not contain coverage stats) but mainly contains % alignments which most users want to see by default. In-line help (run bbmap.sh without any options/inputs) details different options available for coverage/histogram/stats outputs.
Leave a comment:
bartn replied

07-09-2019, 11:13 PM
From the description of the option I understood it can be used for RNA as well:

metagenome=f Assign scaffolds a random exponential coverage level, to simulate a metagenomic or RNA coverage distribution

Which makes sense in my view since I just give multiple transcripts instead of genomes (albeit short compared to a genome)

So why are so many reads not mapping back even thought almost all of it with an even distributed coverage does map.

What I meant with the summary part, if I run just bbmap in its simplest form:

Code:

bbmap.sh ref=transcripts.fasta in1=read_1.fq in2=read_2.fq

the stderr does not contain this part:

Code:

Reads: 120000 Mapped reads: 111248 Mapped bases: 8357095 Ref scaffolds: 1 Ref bases: 592529 Percent mapped: 92.707 Percent proper pairs: 56.370 Average coverage: 14.104 Standard deviation: 46.411 Percent scaffolds with any coverage: 100.00 Percent of reference bases covered: 41.67

But it does with adding a covstats=cov.stats for example.
Leave a comment:
GenoMax replied

07-09-2019, 09:52 AM
Metagenome option is supposed to take in multiple (genome) sequences to generate an artificial metagenomic dataset.

BBMap writes stats to STDERR by default. So you can capture that.
Leave a comment:
bartn replied

07-09-2019, 07:31 AM
Originally posted by GenoMax View Post

Since you asked for 30K reads you got 30K paired-end reads.

I would sugegst trying without the metagenome=t option for a single small input fasta. If you do have the entire genome then use that in your randomread generation step along with the plasmid.

Yes, that does seem to increase the mapping percentages. Any idea why it is not working with meta option? Changes read length and insert size does not seem to do much..
The plasmid is about 0.5MB with 562 transcripts. Median transcript size is 830bp, so that could be a bit short, (but then I would expect the same with the meta option off)

I will also look for another suitable (small) genome. I don't want to add the complete genome because it should finish quick, since it's for testing purposes of tools and pipelines.

(btw, one other small thing. Adding any statistic file in the arguments will give the small summary in the stdout (like I posted before). Otherwise it will not be printed. Not sure if that is expected behavior)
Leave a comment:

Previous 1 2 3 4 11 17 template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News