Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GenoMax
    replied
    @Manuel: Are these 20000 sequences related (contigs)?

    @Brian: Thanks for the additional explanation.

    Leave a comment:


  • Brian Bushnell
    replied
    Well... the easiest way to use BBSplit is to have it split input into one output file per reference file. Once you concatenate them, you lose that ability. It's technically possible to specify sets of named sequences within a file but that way is too complicated. Using a tiered approach is possible (though a pain):

    for references a, b, c, d, e, f:

    abc=cat a b c
    def=cat d e f

    bbsplit ref=abc,def

    then re-split the abc pile between a, b, and c, and the def pile between d, e, and f. That will basically cut the command line lengths (and open file handles) by a factor of the square root of the number of references.

    But, I will add this feature quickly since it's hard to do otherwise.

    Note that if all you care about are statistics of which reads mapped to which scaffolds, then if all the ref sequences have unique names, you can just run BBMap on the concatenated file with the "scafstats" output flag. But for actually splitting the files easily, the references currently need to be in individual files.

    Leave a comment:


  • GenoMax
    replied
    Brian: Concatenating some files to make the number smaller should make it work now, correct?

    Leave a comment:


  • Brian Bushnell
    replied
    Hi Manuel,

    I will add the possibility to point "ref=" to a directory in the next release; that's a pretty good idea.

    -Brian

    Leave a comment:


  • manuelkleiner
    replied
    Problem with argument length

    Dear Brian,
    I have run into a problem when trying to create an index for bbsplit using a large number of files (~20,000). In the example below the variable $files contains 20,000 comma delimited, file names.

    Code:
    kleiner@breck-Precision-T7610[BBMap] bbsplit.sh -Xmx100g threads=20 ref=$files                                                                                                                         [ 2:47PM]
    zsh: argument list too long: bbsplit.sh
    What I have found out so far is that the linux kernel has a upper limit for the argument length in the shell, which causes this problem. I have tried various workarounds e.g. trying to pipe the file names in from STDIN or creating a bash script. But nothing has worked.
    Any suggestions how to get bbsplit to indext such a large number of files before running the splitting?
    Maybe one solution would be if one could hand the ref= argument a folder containing all fasta files and then bbsplit writes an index for each .fasta file in the folder!?

    Thanks a lot in advance for your help,
    Manuel

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by vingomez View Post
    (1) Is there a flag/command to generate paired reads for mapped reads (e.g. out_ecoli_r1.fq out_ecoli_r2.fq, out_salmonella_r1 out_salmonella_r2.fq.fq)?
    Hi Vicente,

    I released a new version of BBTools, 34.64. BBSplit now has the ability to output paired reads in dual files using the # symbol. For example:

    bbsplit.sh ref=x.fa,y.fa in1=read1.fq in2=read2.fq basename=o%_#.fq

    will produce ox_1.fq, ox_2.fq, oy_1.fq, and oy_2.fq

    You can use the # symbol for input also, like "in=read#.fq", and it will get expanded into 1 and 2.

    Leave a comment:


  • vingomez
    replied
    Hi Brian,


    Originally posted by Brian Bushnell View Post
    Hi Vicente,

    There appears to be something wrong with the index, or possibly fasta files. Would it be possible for you to zip the references together and email them to me so I can try to replicate this?

    -Brian

    Definitely, the index files were corrupted. Just deleted the ref directory, repeat index and map - no problem.

    Thanks again

    Leave a comment:


  • Brian Bushnell
    replied
    Hi Vicente,

    There appears to be something wrong with the index, or possibly fasta files. Would it be possible for you to zip the references together and email them to me so I can try to replicate this?

    -Brian

    Leave a comment:


  • vingomez
    replied
    Hi Brian,


    Sorry for the lack of information. When I said "without success": the program stop and the prompt appear (see below - last 9 lines) . Again only when I used 4 ref files. I used the same commands listed in the previous post (with and without the ambig2=split flag) and used the flags outu1 and outu2 (as you suggested) with the same outcome. Maybe a JAVA conflict?



    HTML Code:
    Loaded Reference:       0.010 seconds.
    Loading index for chunk 1-0, build 1
    Generated Index:        0.039 seconds.
    Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
            at align2.BBIndex.analyzeIndex(BBIndex.java:115)
            at align2.BBMap.loadIndex(BBMap.java:405)
            at align2.BBMap.main(BBMap.java:32)
            at align2.BBSplitter.main(BBSplitter.java:45)
    sol1:/work/MP/bbmap%
    Thanks
    Vicente
    Last edited by vingomez; 02-26-2015, 01:22 PM.

    Leave a comment:


  • Brian Bushnell
    replied
    Hi Vicente,

    I just tested BBSplit and it worked fine for me:
    Code:
    bbsplit.sh -Xmx1g ref=Clostridium_perfringensATCC_13124.fasta,Desulfotomaculum_gibsoniae_DSM_7213.fasta,Hirschia_baltica_ATCC_49814.fasta,Nocardiopsis_dassonvillei_DSM_43111.fasta
    
    randomreads.sh -Xmx1g reads=1000 out=reads.fq
    
    bbsplit.sh -Xmx1g in=reads.fq basename=o_%.sam
    produced 4 output files:
    Code:
    bushnell@gpint109:/global/scratch2/sd/bushnell/splittest$ ls -l
    total 38144
    -rw-rw-r-- 1 bushnell bushnell  3311027 Feb 26 10:24 Clostridium_perfringensATCC_13124.fasta
    -rw-rw-r-- 1 bushnell bushnell  4952723 Feb 26 10:24 Desulfotomaculum_gibsoniae_DSM_7213.fasta
    -rw-rw-r-- 1 bushnell bushnell  3599251 Feb 26 10:23 Hirschia_baltica_ATCC_49814.fasta
    -rw-rw-r-- 1 bushnell bushnell  6652569 Feb 26 10:23 Nocardiopsis_dassonvillei_DSM_43111.fasta
    -rw-rw-r-- 1 bushnell bushnell    81824 Feb 26 11:43 o_Clostridium_perfringensATCC_13124.sam
    -rw-rw-r-- 1 bushnell bushnell   125594 Feb 26 11:43 o_Desulfotomaculum_gibsoniae_DSM_7213.sam
    -rw-rw-r-- 1 bushnell bushnell    81422 Feb 26 11:43 o_Hirschia_baltica_ATCC_49814.sam
    -rw-rw-r-- 1 bushnell bushnell   181854 Feb 26 11:43 o_Nocardiopsis_dassonvillei_DSM_43111.sam
    -rw-rw-r-- 1 bushnell bushnell   352650 Feb 26 11:42 reads.fq
    drwxrwsr-x 4 bushnell bushnell      512 Feb 26 10:24 ref
    When you say "without success", what exactly happens? And by the way, to catch unmapped reads, you need to set "outu1" and "outu2", not "out1" and "out2".

    As for your other question - you can do quality-trimming before or after, but error-correction should be done after. Generally, I would do quality-trimming before if you use BBSplit's default settings, which require high identity, or after if you reduce the identity threshold to, say, minid=0.75 (which is closer to BBMap's default). But error-correction is best done after, on each file individually, so that homologous areas shared between the genomes don't affect each other.

    Leave a comment:


  • vingomez
    replied
    BBSplit: limit amount of ref

    Hi Brian,

    I tried BBSplit with four ref files without any success. Is there any limit to the number of ref files?


    Code:
    java -Xmx29g -cp /path/to/current align2.BBSplitter in1=clean_ecc_100_r1.fastq.gz in2=clean_ecc_100_r2.fastq.gz ref=/path/to/resources/ref1.fasta,/path/to/resources/ref2.fasta,/path/to/resources/ref3.fasta,/path/to/resources/ref4.fasta basename=out_%.fastq.gz out1=unmapped_r1.fastq.gz out2=unmapped_r2.fastq.gz ambig2=split

    However it work fine with 3 ref files. I tried with interactions of various group of 3 ref files with success.



    P.S. When do you think is better to perform BBSplit after or before ecc/quality trim?


    Thanks again
    Vicente
    Last edited by vingomez; 02-26-2015, 08:29 AM. Reason: Add a question

    Leave a comment:


  • vingomez
    replied
    Thanks for you dedication and rapid response.

    Vicente

    Leave a comment:


  • Brian Bushnell
    replied
    Hi Vicente,

    The output will come out interleaved (alternating read1/read2) if the input was paired. There's currently not a way to make BBSplit write twin output files for the "basename" output streams (though I'll make a note to add that feature). You can de-interleave the output like this:

    reformat.sh in=out_ecoli.fq out1=out_ecoli_r1.fq out2=out_ecoli_r2.fq

    BBMap does not understand the "basename" flag so it doesn't work for splitting, but BBMap's "out", "outu", and "outm" all allow output of twin files using "out1" and "out2", etc.

    Leave a comment:


  • vingomez
    replied
    Hi Brian,


    Originally posted by Brian Bushnell View Post

    For example, if you had a library of something that was contaminated with e.coli and salmonella, you could do this:

    For paired reads in 2 files, you would do this:

    bbsplit.sh in1=reads1.fq in2=reads2.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu1=clean1.fq outu2=clean2.fq

    This will produce 3 output files:
    out_ecoli.fq (ecoli reads)
    out_salmonella.fq (salmonella reads)
    clean1.fq and clean2.fq (unmapped reads)


    For the example provide above (minor edit for clarity):

    (1) Is there a flag/command to generate paired reads for mapped reads (e.g. out_ecoli_r1.fq out_ecoli_r2.fq, out_salmonella_r1 out_salmonella_r2.fq.fq)?

    (2) Will this flag/command work with BBMap?


    Thanks
    Vicente

    Leave a comment:


  • Brian Bushnell
    replied
    Hi arkilis,

    The standard way to split files with bbsplit is like this:

    (index)
    bbsplit.sh ref=x.fa,y.fa,z.fa,...

    (map)
    bbsplit.sh in=reads.fq basename=out_%.sam

    If you change the name of the output files to ".fq", it will come out in fastq format, which is often more useful, though reformat.sh can convert sam to fastq. Anyway, when you run bbsplit like this, it will produce one output file per reference, containing all of the reads that match that reference better than the others. In this case, it would produce 3 output files - out_x.sam, out_y.sam, and out_z.sam. You could examine the contents of the specific sam files to see exactly which contig/scaffold the read hit, but if you just need the reads split by reference file, that's how you do it.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 11:49 AM
0 responses
15 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
61 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X