Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TomHarrop
    replied
    Can BBmap remove reads containing homopolymers?

    Hi BBmap-ers,

    Say I want to remove reads with more than 5 consecutive identical nucleotides. Is there a homopolymer/ polyX filtering option with BBDuk or reformat.sh?

    Thanks!

    Tom

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by SNPsaurus View Post
    I take a subset from each file and then cat the subsets together. That would be faster than combining the inputs with cat, at least.
    True, this works well. Alternatively, for interleaved files, you can do this:

    Code:
    cat a.fq b.fq c.fq | reformat.sh in=stdin.fq out=sampled.fq interleaved samplerate=0.1
    ...which avoids writing temp files. Won't work for twin paired files, though, unless you do something tricky with named pipes.

    To avoid wasting disk space and bandwidth, I normally keep all fastq files gzipped at all times. Using pigz for parallel compression, or compression level 2 (zl=2 flag), will eliminate much of the speed penalty from dealing with compressed files; and if you are I/O limited, compressed files tend to speed things up.

    I don't have any plans at present to add multiple input file support (or wildcard support) to Reformat, but I'll put it on my list and consider it. It's something I've occasionally wanted also.

    Leave a comment:


  • SNPsaurus
    replied
    Originally posted by jazz710 View Post
    Re: reformat.sh

    Is there now, or could there be in the future, be a way to specify multiple sets of paired read inputs (ie. different libraries) which could be randomly sampled and output to a single FASTQ file?

    Ie) in=FASTQ_A_R1,FASTQ_B_R1,FASTQ_C_R1 in2=FASTQ_A_R2,FASTQ_B_R2,FASTQ_C_R2 out=Subset_R1.fastq out2=Subset_R2.fastq

    Can do it via a previous cat command (A+B+C -> reformat) but with large files cat can be an I/O issue.

    Best and Thanks,
    Bob
    I take a subset from each file and then cat the subsets together. That would be faster than combining the inputs with cat, at least. If the subset size is a significant fraction of most of the files it would still be slow, but if you are just collecting a small fraction of reads it is fast.

    Leave a comment:


  • jazz710
    replied
    Re: reformat.sh

    Is there now, or could there be in the future, be a way to specify multiple sets of paired read inputs (ie. different libraries) which could be randomly sampled and output to a single FASTQ file?

    Ie) in=FASTQ_A_R1,FASTQ_B_R1,FASTQ_C_R1 in2=FASTQ_A_R2,FASTQ_B_R2,FASTQ_C_R2 out=Subset_R1.fastq out2=Subset_R2.fastq

    Can do it via a previous cat command (A+B+C -> reformat) but with large files cat can be an I/O issue.

    Best and Thanks,
    Bob

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by TomHarrop View Post
    Hi Brian,

    I'm using reformat.sh to play around with some Nanopore reads. Is there any way to get the histograms (e.g. mhist, qhist, bhist) to track longer reads, like the `max` parameter in readlength.sh?

    Thanks,

    Tom
    Sorry, the max lengths are currently constants. I'll add support for changing them. Generally I didn't find them all that useful for variable-length reads since I kind of designed them to find position-related anomalies, but it's fairly easy to change.

    Leave a comment:


  • TomHarrop
    replied
    Hi Brian,

    I'm using reformat.sh to play around with some Nanopore reads. Is there any way to get the histograms (e.g. mhist, qhist, bhist) to track longer reads, like the `max` parameter in readlength.sh?

    Thanks,

    Tom

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by skatrinli View Post
    Hey Brian,

    Thanks for this great tool but I could not properly download it. I installed the latest version 37.25 but when i try to test the installation with the command
    $ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz
    it comes this error:
    Exception in thread "main" java.lang.RuntimeException: Unknown parameter Downloads/bbmap/resources/phix174_ill.ref.fa.gz
    at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
    at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

    What did I do wrong?
    You can't have spaces in the filenames without specific countermeasures like quotes. For example:

    Code:
    stats.sh in=foo bar.fa
    Exception in thread "main" java.lang.RuntimeException: Unknown parameter bar.fa
            at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
            at jgi.AssemblyStats2.main(AssemblyStats2.java:39)
    That doesn't work...
    Code:
    stats.sh in="foo bar.fa"
    Exception in thread "main" java.lang.RuntimeException: Unknown parameter bar.fa
            at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
            at jgi.AssemblyStats2.main(AssemblyStats2.java:39)
    That doesn't work either.

    Code:
    stats.sh in="foo\ bar.fa"
    A       C       G       T       N       IUPAC   Other   GC      GC_stdev
    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     0.0000
    
    Main genome scaffold total:             0
    Main genome contig total:               0
    Main genome scaffold sequence total:    0.000 MB
    Main genome contig sequence total:      0.000 MB        NaN% gap
    Main genome scaffold N/L50:             0/0
    Main genome contig N/L50:               0/0
    Main genome scaffold N/L90:             0/0
    Main genome contig N/L90:               0/0
    Max scaffold length:                    0
    Max contig length:                      0
    Number of scaffolds > 50 KB:            0
    % main genome in scaffolds > 50 KB:     0.00%
    
    
    Minimum         Number          Number          Total           Total           Scaffold
    Scaffold        of              of              Scaffold        Contig          Contig
    Length          Scaffolds       Contigs         Length          Length          Coverage
    --------        --------------  --------------  --------------  --------------  --------
    That does work (though I ran it on an empty file).

    The exact way to deal with spaces is system-specific. In the normal Windows shell you can just use quotes; in Linux bash it looks like you need quotes and an escape character (backslash); in Windows under a Linux emulator I'm not entirely sure. The easiest thing to do is to put files in a path that does not have any spaces (so, not in My Documents, but in C:\data\ or something like that.)

    Leave a comment:


  • GenoMax
    replied
    There is no "installation" needed for BBMap. Just uncompress and run (as long as you have Java available for your OS). Are you using Java 1.7 or 1.8?
    @Brian no longer tests against Java 1.6 (which is what you may be using) if I recall.

    I get the following when I run stats.sh.

    Code:
    $ stats.sh in=/path_to/bbmap/resources/phix174_ill.ref.fa.gz 
    A       C       G       T       N       IUPAC   Other   GC      GC_stdev
    0.2399  0.2144  0.2326  0.3130  0.0000  0.0000  0.0000  0.4471  0.0000
    
    Main genome scaffold total:             1
    Main genome contig total:               1
    Main genome scaffold sequence total:    0.005 MB
    Main genome contig sequence total:      0.005 MB        0.000% gap
    Main genome scaffold N/L50:             1/5.386 KB
    Main genome contig N/L50:               1/5.386 KB
    Main genome scaffold N/L90:             1/5.386 KB
    Main genome contig N/L90:               1/5.386 KB
    Max scaffold length:                    5.386 KB
    Max contig length:                      5.386 KB
    Number of scaffolds > 50 KB:            0
    % main genome in scaffolds > 50 KB:     0.00%
    
    
    Minimum         Number          Number          Total           Total           Scaffold
    Scaffold        of              of              Scaffold        Contig          Contig  
    Length          Scaffolds       Contigs         Length          Length          Coverage
    --------        --------------  --------------  --------------  --------------  --------
        All                      1               1           5,386           5,386   100.00%
       5 KB                      1               1           5,386           5,386   100.00%

    Leave a comment:


  • skatrinli
    replied
    Hey Brian,

    Thanks for this great tool but I could not properly download it. I installed the latest version 37.25 but when i try to test the installation with the command
    $ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz
    it comes this error:
    Exception in thread "main" java.lang.RuntimeException: Unknown parameter Downloads/bbmap/resources/phix174_ill.ref.fa.gz
    at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
    at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

    What did I do wrong?

    Leave a comment:


  • GenoMax
    replied
    How did that happen?

    Unless you have a defined reason I would suggest not trusting that file.

    Leave a comment:


  • blsfoxfox
    replied
    Originally posted by GenoMax View Post
    Code:
    reformat.sh in=seq.fq out=seq.fa
    will convert fastq sequences to fasta. If your problem is malformed fastq records then you need to find and delete those records manually.
    Hi,

    Thanks for your response!

    My problem is that my file is mixed of fasta and fastq, and I don't know how to identify those fasta sequences and extract them.

    Leave a comment:


  • GenoMax
    replied
    Code:
    reformat.sh in=seq.fq out=seq.fa
    will convert fastq sequences to fasta. If your problem is malformed fastq records then you need to find and delete those records manually.

    Leave a comment:


  • blsfoxfox
    replied
    Originally posted by Brian Bushnell View Post
    The problem here is that you are using a fasta sequence named "in.fastq". BBTools is sensitive to filenames, and *.fastq will be processed as a fastq file. If the file has no extension it will usually look at the contents to try to figure out what it is, but when there is a known extension, it assumes it is correct.

    So, just rename the input file to "in.fasta" or add the flag "extin=.fasta" to override filename. Although for most uses of PacBio data I do recommend that you go back and get the original fastq file and use that, because the quality scores are often useful.
    Hi Brian,

    Sorry for that I should double checked my input.fastq. After grep multiple lines after that read, I find some fasta sequences in that fastq file. As they are raw sequences, I don't have a 'clean' backup for it. So my problem now is to find a tool which could extract fasta sequences from fastq.

    Thanks,

    Leave a comment:


  • blsfoxfox
    replied
    Hi Brian,

    Thanks for your quick response!

    Actually, I am doing that on purpose. I found there is format error in my fastq file while using another software, but I don't know where is it. By using reformat.sh with in=in.fastq out=out.fastq, bbmap will report where is that error. In this case, read >m150430_235943_42146_c100804572550000001823173810081565_s1_p0/108988/0_3288 RQ=0.868 causes a corruption of the output. However, I still don't know what's wrong and the only solution I can think of now is using filterbyname tool to exclude that read from my fastq.
    Thus, I really appreciate if you could provide any suggestion on how could I identify and fix that format error.

    Thanks again for your time,

    Leave a comment:


  • Brian Bushnell
    replied
    The problem here is that you are using a fasta sequence named "in.fastq". BBTools is sensitive to filenames, and *.fastq will be processed as a fastq file. If the file has no extension it will usually look at the contents to try to figure out what it is, but when there is a known extension, it assumes it is correct.

    So, just rename the input file to "in.fasta" or add the flag "extin=.fasta" to override filename. Although for most uses of PacBio data I do recommend that you go back and get the original fastq file and use that, because the quality scores are often useful.
    Last edited by Brian Bushnell; 05-07-2017, 10:35 PM.

    Leave a comment:

Latest Articles

Collapse

  • GATTACAT
    Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
    by GATTACAT
    Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
    Yesterday, 11:43 AM
  • SEQadmin2
    Nine Things a Sample Prep Scientist Thinks About Before Sequencing
    by SEQadmin2


    I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

    Here are nine questions we think about, in roughly the order they matter, before...
    06-18-2026, 07:11 AM
  • SEQadmin2
    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
    by SEQadmin2


    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
    ...
    06-02-2026, 10:05 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by SEQadmin2, 06-30-2026, 05:37 AM
0 responses
9 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-26-2026, 11:10 AM
0 responses
18 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-17-2026, 06:09 AM
0 responses
52 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-09-2026, 11:58 AM
0 responses
110 views
0 reactions
Last Post SEQadmin2  
Working...