Unconfigured Ad

TomHarrop · 09-11-2017, 02:21 PM

Can BBmap remove reads containing homopolymers?

Hi BBmap-ers,

Say I want to remove reads with more than 5 consecutive identical nucleotides. Is there a homopolymer/ polyX filtering option with BBDuk or reformat.sh?

Thanks!

Tom

Brian Bushnell · 08-14-2017, 09:58 AM

Originally posted by SNPsaurus View Post

I take a subset from each file and then cat the subsets together. That would be faster than combining the inputs with cat, at least.

True, this works well. Alternatively, for interleaved files, you can do this:

Code:

cat a.fq b.fq c.fq | reformat.sh in=stdin.fq out=sampled.fq interleaved samplerate=0.1

...which avoids writing temp files. Won't work for twin paired files, though, unless you do something tricky with named pipes.

To avoid wasting disk space and bandwidth, I normally keep all fastq files gzipped at all times. Using pigz for parallel compression, or compression level 2 (zl=2 flag), will eliminate much of the speed penalty from dealing with compressed files; and if you are I/O limited, compressed files tend to speed things up.

I don't have any plans at present to add multiple input file support (or wildcard support) to Reformat, but I'll put it on my list and consider it. It's something I've occasionally wanted also.

SNPsaurus · 08-12-2017, 07:47 AM

Originally posted by jazz710 View Post

Re: reformat.sh

Is there now, or could there be in the future, be a way to specify multiple sets of paired read inputs (ie. different libraries) which could be randomly sampled and output to a single FASTQ file?

Ie) in=FASTQ_A_R1,FASTQ_B_R1,FASTQ_C_R1 in2=FASTQ_A_R2,FASTQ_B_R2,FASTQ_C_R2 out=Subset_R1.fastq out2=Subset_R2.fastq

Can do it via a previous cat command (A+B+C -> reformat) but with large files cat can be an I/O issue.

Best and Thanks,
Bob

I take a subset from each file and then cat the subsets together. That would be faster than combining the inputs with cat, at least. If the subset size is a significant fraction of most of the files it would still be slow, but if you are just collecting a small fraction of reads it is fast.

jazz710 · 08-12-2017, 12:29 AM

Re: reformat.sh

Is there now, or could there be in the future, be a way to specify multiple sets of paired read inputs (ie. different libraries) which could be randomly sampled and output to a single FASTQ file?

Ie) in=FASTQ_A_R1,FASTQ_B_R1,FASTQ_C_R1 in2=FASTQ_A_R2,FASTQ_B_R2,FASTQ_C_R2 out=Subset_R1.fastq out2=Subset_R2.fastq

Can do it via a previous cat command (A+B+C -> reformat) but with large files cat can be an I/O issue.

Best and Thanks,
Bob

Brian Bushnell · 06-01-2017, 04:13 PM

Originally posted by TomHarrop View Post

Hi Brian,

I'm using reformat.sh to play around with some Nanopore reads. Is there any way to get the histograms (e.g. mhist, qhist, bhist) to track longer reads, like the `max` parameter in readlength.sh?

Thanks,

Tom

Sorry, the max lengths are currently constants. I'll add support for changing them. Generally I didn't find them all that useful for variable-length reads since I kind of designed them to find position-related anomalies, but it's fairly easy to change.

TomHarrop · 05-31-2017, 06:12 PM

Hi Brian,

I'm using reformat.sh to play around with some Nanopore reads. Is there any way to get the histograms (e.g. mhist, qhist, bhist) to track longer reads, like the `max` parameter in readlength.sh?

Thanks,

Tom

Brian Bushnell · 05-30-2017, 09:02 AM

Originally posted by skatrinli View Post

Hey Brian,

Thanks for this great tool but I could not properly download it. I installed the latest version 37.25 but when i try to test the installation with the command
$ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz
it comes this error:
Exception in thread "main" java.lang.RuntimeException: Unknown parameter Downloads/bbmap/resources/phix174_ill.ref.fa.gz
at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

What did I do wrong?

You can't have spaces in the filenames without specific countermeasures like quotes. For example:

Code:

stats.sh in=foo bar.fa
Exception in thread "main" java.lang.RuntimeException: Unknown parameter bar.fa
        at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
        at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

That doesn't work...

Code:

stats.sh in="foo bar.fa"
Exception in thread "main" java.lang.RuntimeException: Unknown parameter bar.fa
        at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
        at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

That doesn't work either.

Code:

stats.sh in="foo\ bar.fa"
A       C       G       T       N       IUPAC   Other   GC      GC_stdev
NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     0.0000

Main genome scaffold total:             0
Main genome contig total:               0
Main genome scaffold sequence total:    0.000 MB
Main genome contig sequence total:      0.000 MB        NaN% gap
Main genome scaffold N/L50:             0/0
Main genome contig N/L50:               0/0
Main genome scaffold N/L90:             0/0
Main genome contig N/L90:               0/0
Max scaffold length:                    0
Max contig length:                      0
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%


Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------

That does work (though I ran it on an empty file).

The exact way to deal with spaces is system-specific. In the normal Windows shell you can just use quotes; in Linux bash it looks like you need quotes and an escape character (backslash); in Windows under a Linux emulator I'm not entirely sure. The easiest thing to do is to put files in a path that does not have any spaces (so, not in My Documents, but in C:\data\ or something like that.)

GenoMax · 05-30-2017, 04:26 AM

There is no "installation" needed for BBMap. Just uncompress and run (as long as you have Java available for your OS). Are you using Java 1.7 or 1.8?
@Brian no longer tests against Java 1.6 (which is what you may be using) if I recall.

I get the following when I run stats.sh.

Code:

$ stats.sh in=/path_to/bbmap/resources/phix174_ill.ref.fa.gz 
A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2399  0.2144  0.2326  0.3130  0.0000  0.0000  0.0000  0.4471  0.0000

Main genome scaffold total:             1
Main genome contig total:               1
Main genome scaffold sequence total:    0.005 MB
Main genome contig sequence total:      0.005 MB        0.000% gap
Main genome scaffold N/L50:             1/5.386 KB
Main genome contig N/L50:               1/5.386 KB
Main genome scaffold N/L90:             1/5.386 KB
Main genome contig N/L90:               1/5.386 KB
Max scaffold length:                    5.386 KB
Max contig length:                      5.386 KB
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%


Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig  
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                      1               1           5,386           5,386   100.00%
   5 KB                      1               1           5,386           5,386   100.00%

skatrinli · 05-30-2017, 01:29 AM

Hey Brian,

Thanks for this great tool but I could not properly download it. I installed the latest version 37.25 but when i try to test the installation with the command
$ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz
it comes this error:
Exception in thread "main" java.lang.RuntimeException: Unknown parameter Downloads/bbmap/resources/phix174_ill.ref.fa.gz
at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

What did I do wrong?

GenoMax · 05-08-2017, 07:02 AM

How did that happen?

Unless you have a defined reason I would suggest not trusting that file.

blsfoxfox · 05-08-2017, 06:58 AM

Originally posted by GenoMax View Post

Code:

reformat.sh in=seq.fq out=seq.fa

will convert fastq sequences to fasta. If your problem is malformed fastq records then you need to find and delete those records manually.

Hi,

Thanks for your response!

My problem is that my file is mixed of fasta and fastq, and I don't know how to identify those fasta sequences and extract them.

GenoMax · 05-08-2017, 06:47 AM

Code:

reformat.sh in=seq.fq out=seq.fa

will convert fastq sequences to fasta. If your problem is malformed fastq records then you need to find and delete those records manually.

blsfoxfox · 05-08-2017, 06:39 AM

Originally posted by Brian Bushnell View Post

The problem here is that you are using a fasta sequence named "in.fastq". BBTools is sensitive to filenames, and *.fastq will be processed as a fastq file. If the file has no extension it will usually look at the contents to try to figure out what it is, but when there is a known extension, it assumes it is correct.

So, just rename the input file to "in.fasta" or add the flag "extin=.fasta" to override filename. Although for most uses of PacBio data I do recommend that you go back and get the original fastq file and use that, because the quality scores are often useful.

Hi Brian,

Sorry for that I should double checked my input.fastq. After grep multiple lines after that read, I find some fasta sequences in that fastq file. As they are raw sequences, I don't have a 'clean' backup for it. So my problem now is to find a tool which could extract fasta sequences from fastq.

Thanks,

blsfoxfox · 05-07-2017, 10:42 PM

Hi Brian,

Thanks for your quick response!

Actually, I am doing that on purpose. I found there is format error in my fastq file while using another software, but I don't know where is it. By using reformat.sh with in=in.fastq out=out.fastq, bbmap will report where is that error. In this case, read >m150430_235943_42146_c100804572550000001823173810081565_s1_p0/108988/0_3288 RQ=0.868 causes a corruption of the output. However, I still don't know what's wrong and the only solution I can think of now is using filterbyname tool to exclude that read from my fastq.
Thus, I really appreciate if you could provide any suggestion on how could I identify and fix that format error.

Thanks again for your time,

Brian Bushnell · 05-07-2017, 10:33 PM

The problem here is that you are using a fasta sequence named "in.fastq". BBTools is sensitive to filenames, and *.fastq will be processed as a fastq file. If the file has no extension it will usually look at the contents to try to figure out what it is, but when there is a known extension, it assumes it is correct.

So, just rename the input file to "in.fasta" or add the flag "extin=.fasta" to override filename. Although for most uses of PacBio data I do recommend that you go back and get the original fastq file and use that, because the quality scores are often useful.

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 18 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 52 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 110 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News