Seqanswers Leaderboard Ad

**Brian Bushnell** · 05-30-2017, 09:02 AM

Originally posted by skatrinli View Post

Hey Brian,

Thanks for this great tool but I could not properly download it. I installed the latest version 37.25 but when i try to test the installation with the command
$ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz
it comes this error:
Exception in thread "main" java.lang.RuntimeException: Unknown parameter Downloads/bbmap/resources/phix174_ill.ref.fa.gz
at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

What did I do wrong?

You can't have spaces in the filenames without specific countermeasures like quotes. For example:

Code:

stats.sh in=foo bar.fa
Exception in thread "main" java.lang.RuntimeException: Unknown parameter bar.fa
        at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
        at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

That doesn't work...

Code:

stats.sh in="foo bar.fa"
Exception in thread "main" java.lang.RuntimeException: Unknown parameter bar.fa
        at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
        at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

That doesn't work either.

Code:

stats.sh in="foo\ bar.fa"
A       C       G       T       N       IUPAC   Other   GC      GC_stdev
NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     0.0000

Main genome scaffold total:             0
Main genome contig total:               0
Main genome scaffold sequence total:    0.000 MB
Main genome contig sequence total:      0.000 MB        NaN% gap
Main genome scaffold N/L50:             0/0
Main genome contig N/L50:               0/0
Main genome scaffold N/L90:             0/0
Main genome contig N/L90:               0/0
Max scaffold length:                    0
Max contig length:                      0
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%


Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------

That does work (though I ran it on an empty file).

The exact way to deal with spaces is system-specific. In the normal Windows shell you can just use quotes; in Linux bash it looks like you need quotes and an escape character (backslash); in Windows under a Linux emulator I'm not entirely sure. The easiest thing to do is to put files in a path that does not have any spaces (so, not in My Documents, but in C:\data\ or something like that.)

**TomHarrop** · 05-31-2017, 06:12 PM

Hi Brian,

I'm using reformat.sh to play around with some Nanopore reads. Is there any way to get the histograms (e.g. mhist, qhist, bhist) to track longer reads, like the `max` parameter in readlength.sh?

Thanks,

Tom

**Brian Bushnell** · 06-01-2017, 04:13 PM

Originally posted by TomHarrop View Post

Hi Brian,

I'm using reformat.sh to play around with some Nanopore reads. Is there any way to get the histograms (e.g. mhist, qhist, bhist) to track longer reads, like the `max` parameter in readlength.sh?

Thanks,

Tom

Sorry, the max lengths are currently constants. I'll add support for changing them. Generally I didn't find them all that useful for variable-length reads since I kind of designed them to find position-related anomalies, but it's fairly easy to change.

**jazz710** · 08-12-2017, 12:29 AM

Re: reformat.sh

Is there now, or could there be in the future, be a way to specify multiple sets of paired read inputs (ie. different libraries) which could be randomly sampled and output to a single FASTQ file?

Ie) in=FASTQ_A_R1,FASTQ_B_R1,FASTQ_C_R1 in2=FASTQ_A_R2,FASTQ_B_R2,FASTQ_C_R2 out=Subset_R1.fastq out2=Subset_R2.fastq

Can do it via a previous cat command (A+B+C -> reformat) but with large files cat can be an I/O issue.

Best and Thanks,
Bob

**SNPsaurus** · 08-12-2017, 07:47 AM

Originally posted by jazz710 View Post

Re: reformat.sh

Is there now, or could there be in the future, be a way to specify multiple sets of paired read inputs (ie. different libraries) which could be randomly sampled and output to a single FASTQ file?

Ie) in=FASTQ_A_R1,FASTQ_B_R1,FASTQ_C_R1 in2=FASTQ_A_R2,FASTQ_B_R2,FASTQ_C_R2 out=Subset_R1.fastq out2=Subset_R2.fastq

Can do it via a previous cat command (A+B+C -> reformat) but with large files cat can be an I/O issue.

Best and Thanks,
Bob

I take a subset from each file and then cat the subsets together. That would be faster than combining the inputs with cat, at least. If the subset size is a significant fraction of most of the files it would still be slow, but if you are just collecting a small fraction of reads it is fast.

**Brian Bushnell** · 08-14-2017, 09:58 AM

Originally posted by SNPsaurus View Post

I take a subset from each file and then cat the subsets together. That would be faster than combining the inputs with cat, at least.

True, this works well. Alternatively, for interleaved files, you can do this:

Code:

cat a.fq b.fq c.fq | reformat.sh in=stdin.fq out=sampled.fq interleaved samplerate=0.1

...which avoids writing temp files. Won't work for twin paired files, though, unless you do something tricky with named pipes.

To avoid wasting disk space and bandwidth, I normally keep all fastq files gzipped at all times. Using pigz for parallel compression, or compression level 2 (zl=2 flag), will eliminate much of the speed penalty from dealing with compressed files; and if you are I/O limited, compressed files tend to speed things up.

I don't have any plans at present to add multiple input file support (or wildcard support) to Reformat, but I'll put it on my list and consider it. It's something I've occasionally wanted also.

**TomHarrop** · 09-11-2017, 02:21 PM

Can BBmap remove reads containing homopolymers?

Hi BBmap-ers,

Say I want to remove reads with more than 5 consecutive identical nucleotides. Is there a homopolymer/ polyX filtering option with BBDuk or reformat.sh?

Thanks!

Tom

**Brian Bushnell** · 09-11-2017, 02:24 PM

There is no homopolymer filter, per se, but you can accomplish that like this:

bbduk.sh in=reads.fq out=clean.fq literal=AAAAAA,CCCCCC,GGGGGG,TTTTTT k=6 mm=f

**TomHarrop** · 09-11-2017, 02:35 PM

Great, thanks!

**gokhulkrishnakilaru** · 09-23-2017, 01:56 PM

Hi Brian-

1. We are currently using bedtools BAM to BED and then extending reads for chip visualization. Does BBMAP suite has an option to extend reads in BAM based on the fragment length estimate from MACS or automatically from BAM?

2. Is there an option/tool to remove chip duplicates based on the read ID in BAM instead of co-ordinates?

**Brian Bushnell** · 09-23-2017, 07:02 PM

1) No, it does not. Tadpole can extend reads based on kmer-counts but it does not make use of mapping information.

2) Clumpify can remove optical duplicate based on a combination of sequence (which indicates whether they are duplicate) and position from the read ID (which indicates if they are very close). However, it cannot handle bam files, only fasta and fastq.

-Brian

**dho** · 09-26-2017, 05:26 PM

Comprehensive reporting of mapped reads

Hi Brian,

I am trying to map 100bp Illumina sequences to a collection of very similar, 80bp reference sequences (multiple alleles of a single exon) using bbmap.

When mapping to a single reference sequence from the collection in isolation using settings:

'ambiguous=all',
'maxsites=1000000',
'vslow',
'subfilter=0'

An expected number of sequences (in this case, ~270) map and fully cover the reference sequence.

When using the same parameters and mapping to a collection of sequences containing the same reference sequence only ~120 reads map.

Is there another parameter(s) I need to be setting to map my reads exhaustively against all reference sequences?

Thanks,

dave

**Brian Bushnell** · 09-27-2017, 01:30 PM

Hi Dave,

BBMap has some heuristics that may make it non-ideal for the situation with a large number of near-identical sequences, particularly when the reads don't map glocally well to any of them (because the reads are longer than the reference sequences) and the reference sequences are tiny. You might also try the flag "excludefraction=0" to see if this changes anything. To clarify, there are perfect alignments (with zero mismatches) that are getting missed, correct?

Many of the heuristics related to ignoring extremely common, uninformative reference kmers are disabled in bbmapskimmer.sh, which is designed specifically for a high degree of multimapping. The syntax is the same as BBMap, so please give that a try and let me know if it works better. You'll need to additionally add the flag "minscaf=1" or really short scaffolds get discarded, so something like:

Code:

bbmapskimmer.sh in=reads.fq ref=ref.fa maxsites=1000000 vslow ambig=all maxindel=0 subfilter=0 excludefraction=0 out=mapped.sam minscaf=1

Please let me know whether that changes the situation.

-Brian

**dho** · 09-27-2017, 05:37 PM

Hi Brian,

No, unfortunately it didn't help. I think you understand the question correctly:

When I map 100bp reads against a single allele of a gene with four exons using:

Code:

maxsites=1000000 vslow ambig=all maxindel=0 subfilter=0 excludefraction=0 out=mapped.sam minscaf=1 covstats=stdout | grep 'IPD0001613'

I get 100% read support for all four exons.

Code:

Mafa-DPA1*07:02|IPD0001613_2_MHC-II-DPA	288.0902	244	0.5000	100.0000	244	421	388	0.4788	297	35.43
Mafa-DPA1*07:02|IPD0001613_3_MHC-II-DPA	235.8719	281	0.5658	100.0000	281	408	343	0.5554	247	40.92
Mafa-DPA1*07:02|IPD0001613_4_MHC-II-DPA	218.1935	155	0.6323	100.0000	155	227	220	0.5913	228	27.77
Mafa-DPA1*07:02|IPD0001613_1_MHC-II-DPA	199.3457	81	0.5679	100.0000	81	147	111	0.5713	200	35.48

When I use the same parameters but include the allele in a larger reference sequence that contains many other alleles, the number of mapped reads decreases for all four exons and no longer fully covers exon 1:

Code:

Mafa-DPA1*07:02|IPD0001613_2_MHC-II-DPA	242.6475	244	0.5000	100.0000	244	343	327	0.4746	233	43.23
Mafa-DPA1*07:02|IPD0001613_3_MHC-II-DPA	58.2847	281	0.5658	100.0000	281	81	93	0.5509	63	19.81
Mafa-DPA1*07:02|IPD0001613_4_MHC-II-DPA	194.1226	155	0.6323	100.0000	155	201	181	0.5935	204	33.17
Mafa-DPA1*07:02|IPD0001613_1_MHC-II-DPA	38.3333	81	0.5679	97.5309	79	34	17	0.6038	51	17.89

Any other thoughts?

Thanks,

dave

**gokhulkrishnakilaru** · 09-28-2017, 07:42 AM

Originally posted by Brian Bushnell View Post

1) No, it does not. Tadpole can extend reads based on kmer-counts but it does not make use of mapping information.

2) Clumpify can remove optical duplicate based on a combination of sequence (which indicates whether they are duplicate) and position from the read ID (which indicates if they are very close). However, it cannot handle bam files, only fasta and fastq.

-Brian

Thanks Brian.

How about BED?

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News