Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hi Brian,
    New week...new question. I'm doing something where I have a bunch of 15-mer segments of several larger sequences and I'd like to match the mers up to two separate references. The references may be as large as two entire mammalian genomes. The fact that bbmap can conveniently and rapidly generate an index on the fly and map reads super fast in 'perfectmode' makes it look like a good tool for this. I'm just wondering if aligning 15-mers with bbmap is reliable or would I be better of making my own program that essentially just hashes all 15-mers from the two references and runs a simple look-up.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • Originally posted by Brian Bushnell View Post
      Yes, I put in an option for that because some programs require it. The flag is trd=t (or trimreaddescriptions=true), which will truncate all names (both read names and reference names) at the first whitespace symbol.
      This isn't working for me. I've built and index like this:

      Code:
      bbmap.sh ref=hg19e.fa path=./bbmap trimreaddescriptions=t
      however in the output alignments to this index I still see full-length reference names:

      Code:
      uc011kko.2_1158_1257_0_0_1      0       7 dna_sm:chromosome chromosome:GRCh37:7:1:159138663:1 REF  101957711        44      100=    *       0       0       ACCCATGGTTCCACGGGACACTGTCCCGGGTCAAGGCTGCTCAACTGGTTCTGGCAGGGGGGCCCCGGAACCACGGCCTCTTCGTGATCCGCCAAAGTGA        ????????????????????????????????????????????????????????????????????????????????????????????????????        NM:i:0  AM:i:44
      I downloaded the current version last week (Thursday or Friday).
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment


      • You can run BBMap with the "k=15" flag to use 15mers for indexing (15 is the max). It will be reliable except with very-low-complexity kmers - polymonomers (e.g. AAAAAAAAAAAAAAA) and polydimers (e.g. ACACACACACACACA) are not indexed. You can also use BBSplit (sort of a wrapper for BBMap that maps to multiple references simultaneously) to output the reads into different files depending on which reference they hit.

        If you don't care about the mapping location, there's also BBDuk, which simply associates kmers with sequences (ignoring position information). So you can do something like this:

        bbduk.sh ref=human.fa in=reads.fq k=15 mm=f outm=matched.fq outu=unmatched.fq

        It uses more memory than BBMap (around 15 bytes per unique kmer at best) but should be faster, and does not exclude homopolymer kmers. Optionally, you can allow a hamming distance or edit distance in the kmers, though that increases memory usage by a factor of 3*K (for hdist=1). I'm working on a version that does binning like BBSplit but it's not quite done yet.

        Comment


        • Originally posted by sdriscoll View Post
          This isn't working for me. I've built and index like this:

          Code:
          bbmap.sh ref=hg19e.fa path=./bbmap trimreaddescriptions=t
          however in the output alignments to this index I still see full-length reference names:

          Code:
          uc011kko.2_1158_1257_0_0_1      0       7 dna_sm:chromosome chromosome:GRCh37:7:1:159138663:1 REF  101957711        44      100=    *       0       0       ACCCATGGTTCCACGGGACACTGTCCCGGGTCAAGGCTGCTCAACTGGTTCTGGCAGGGGGGCCCCGGAACCACGGCCTCTTCGTGATCCGCCAAAGTGA        ????????????????????????????????????????????????????????????????????????????????????????????????????        NM:i:0  AM:i:44
          I downloaded the current version last week (Thursday or Friday).
          The "trd" flag needs to be set when mapping, not indexing. Please let me know if you used it when mapping - if so, there's a bug; but if it was only when indexing, that behavior is expected.

          Comment


          • Ah, thanks. I did not use it when mapping. Also thanks for the primer on bbsplit and bbduk. Fun tools!
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • bbduk and bbsplit are super close to what I need except I actually want a lower level result (i.e. the number of mers per query sequence that match to mers in each of the two references). I don't wany any assignment decisions to be made...just counting.
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment


              • Coincidentally, I'm making a new version of BBDuk designed for RNA-seq which will be able to do just that. The current version only stores one value per kmer so if both references have the same kmer, it will only be considered a match to the first one. The new version will store an arbitrary number of values per kmer, so it will be able to correctly resolve situations where multiple references share a kmer (as in alternatively spliced transcriptomes). I'll build in the ability to note the counts per reference, but it probably won't be ready for a few days.

                Comment


                • Thanks, that sounds great. BBDuk is looking like a really handy tool. For example even taking two kmer FASTA files and doing a "set difference" between them is possible here. I dig it.
                  /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                  Salk Institute for Biological Studies, La Jolla, CA, USA */

                  Comment


                  • bbmap error

                    Hi,

                    any idea what could cause this error? Thanks:

                    $ ~/bin/bbmap/bbmap.sh ref=plasmids.fna

                    java -Djava.library.path=/OSM/HOME-MEL/all29c/bin/bbmap/jni/ -ea -Xmx33004m -cp /OSM/HOME-MEL/all29c/bin/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=plasmids.fna
                    Exception in thread "main" java.lang.UnsupportedClassVersionError: align2/BBMap : Unsupported major.minor version 51.0
                    at java.lang.ClassLoader.defineClass1(Native Method)
                    at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
                    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
                    at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
                    at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
                    at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
                    at java.security.AccessController.doPrivileged(Native Method)
                    at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
                    at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
                    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
                    at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
                    Could not find the main class: align2.BBMap. Program will exit.

                    Comment


                    • Hi Susan,

                      That means you are using an old version of Java. BBMap requires Java 7 or higher. You can install it yourself if you want by downloading it from here, though the easiest way would simply be to ask your sysadmin to install it, if you have one. Java 7 has been out since 2011, and it's backwards compatible with software made for older versions, so I highly recommend installing it and removing any older versions.

                      Java 8 is out now, too, which will also work fine.

                      Comment


                      • Hi,

                        ok, thanks apparently 7 or 8 not an option on the cluster I use. Stuck with 6 for now. I mostly have to avoid Java programs because of this. Thanks for your efforts anyway.

                        S.

                        Comment


                        • Susan,

                          I will put up a Java 6-compiled version up tomorrow.

                          -Brian

                          Comment


                          • Thanks very much. That's what I call service!

                            S.

                            Comment


                            • Susan,

                              The Java 6-compiled version is up now. It's available here, the second file down (BBMap_33.89_java6.tar.gz). Please let me know if you have any problems with it.

                              Comment


                              • Hi Brian,

                                Since I live off of command line help with all of the tools I use I had to reformat your usage function in the bbmap.sh script. I thought I'd share with you in case you'd like to build off of this one and keep it a little cleaner.

                                Code:
                                usage(){
                                echo "
                                BBMap v33.x
                                Written by Brian Bushnell, from Dec. 2010 - present
                                Last modified October 1, 2014
                                
                                Description:  Fast and accurate short-read aligner for DNA and RNA.
                                
                                To index:   bbmap.sh ref=<reference fasta>
                                To map:     bbmap.sh in=<reads> out=<output sam>
                                To map without an index:
                                    bbmap.sh ref=<reference fasta> in=<reads> out=<output sam> nodisk
                                
                                in=stdin will accept reads from standard in, and out=stdout will write to 
                                standard out, but file extensions are still needed to specify the format of the 
                                input and output files e.g. in=stdin.fa.gz will read gzipped fasta from 
                                standard in; out=stdout.sam.gz will write gzipped sam.
                                
                                Indexing Parameters (required when building the index):
                                
                                nodisk=f                Set to true to build index in memory and write nothing 
                                                        to disk except output.
                                ref=<file>              Specify the reference sequence.  Only do this ONCE, 
                                                        when building the index (unless using 'nodisk').
                                build=1                 If multiple references are indexed in the same directory, 
                                                        each needs a unique numeric ID (unless using 'nodisk').
                                k=13                    Kmer length, range 8-15.  Longer is faster but uses more 
                                                        memory.  Shorter is more sensitive.
                                path=<.>                Specify the location to write the index, if you don't 
                                                        want it in the current working directory.
                                usemodulo=f             Throw away ~80% of kmers based on remainder modulo a 
                                                        number (reduces RAM by 50% and sensitivity slightly).
                                                        Should be enabled both when building the index AND 
                                                        when mapping.
                                
                                Input Parameters:
                                
                                build=1                 Designate index to use.  Corresponds to the number 
                                                        specified when building the index.
                                in=<file>               Primary reads input; required parameter.
                                in2=<file>              For paired reads in two files.
                                interleaved=auto        True forces paired/interleaved input; false forces 
                                                        single-ended mapping. If not specified, interleaved 
                                                        status will be autodetected from read names.
                                fastareadlen=500        Break up FASTA reads longer than this.  Max is 500.  
                                                        Only works for FASTA input.
                                unpigz=f                Spawn a pigz (parallel gzip) process for faster 
                                                        decompression than using Java.  
                                                        Requires pigz to be installed.
                                
                                Sampling Parameters:
                                
                                reads=-1                Set to a positive number N to only process the first 
                                                        N reads (or pairs), then quit.  -1 means use all reads.
                                skipreads=0             Set to a number N to skip the first N reads (or pairs), 
                                                        then map the rest.
                                touppercase=t           (tuc) Convert lowercase letters in reads to upper case 
                                                        (otherwise they will not match the reference).
                                
                                Mapping Parameters:
                                
                                fast=f                  This flag is a macro which sets other paramters to run 
                                                        faster, at reduced sensitivity.  Bad for RNA-seq.
                                maxindel=16000          Don't look for indels longer than this. Lower is faster.
                                                        Set to >=100k for RNAseq with long introns like mammals.
                                strictmaxindel=f        When enabled, do not allow indels longer than 'maxindel'.
                                                        By default these are not sought, but may be found anyway.
                                minid=0.76              Approximate minimum alignment identity to look for.  Higher 
                                                        is faster and less sensitive.
                                idfilter=0              Independant of minid; sets exact minimum identity allowed 
                                                        for alignments.  Range is 0 to 1.
                                minhits=1               Minimum number of seed hits required for candidate sites.
                                                        Higher is faster.
                                k=13                    Kmer length of index.  Higher is faster (for large genomes),
                                                        less sensitive, and uses more RAM.  Max is 15.
                                local=f                 Set to true to use local, rather than global, alignments.
                                                        This will soft-clip ugly ends of poor alignments.
                                perfectmode=f           Allow only perfect mappings when set to true (very fast).
                                semiperfectmode=f       Allow only perfect and semiperfect (perfect except for N's 
                                                        in reference) mappings.
                                threads=auto            (t) Set to number of threads desired.  By default, uses 
                                                        all cores available.
                                ambiguous=best          (ambig) Set behavior on ambiguously-mapped reads (with 
                                                        multiple top-scoring mapping locations).
                                                            best    (use the first best site)
                                                            toss    (consider unmapped)
                                                            random  (select one top-scoring site randomly)
                                                            all     (retain all top-scoring sites)
                                samestrandpairs=f       (ssp) Specify whether paired reads should map to the same 
                                                        strand or opposite strands.
                                requirecorrectstrand=t  (rcs) Forbid pairing of reads without correct strand 
                                                        orientation.  Set to false for long-mate-pair libraries.
                                killbadpairs=f          (kbp) If a read pair is mapped with an inappropriate insert
                                                        size or orientation, the read with the lower mapping quality 
                                                        is marked unmapped.
                                pairedonly=f            (po) Treat unpaired reads as unmapped.  Thus they will be 
                                                        sent to 'outu' but not 'outm'.
                                rcompmate=f             Reverse complement second read in each pair prior to 
                                                        mapping.
                                pairlen=32000           Set max allowed distance between paired reads.  
                                                        (insert size)=(pairlen)+(read1 length)+(read2 length)
                                bandwidthratio=0        (bwr) If above zero, restrict alignment band to this 
                                                        fraction of read length.  Faster but less accurate.
                                usejni=f                (jni) Do alignments in C code, which is faster.  Requires 
                                                        compiling the C code; details are in /jni/README.txt.
                                maxsites2=800           Don't analyze (or print) more than this many alignments 
                                                        per read.
                                
                                Quality and Trimming Parameters:
                                
                                qin=auto                Set to 33 or 64 to specify input quality value ASCII
                                                        offset. 33 is Sanger, 64 is old Solexa.
                                qout=auto               Set to 33 or 64 to specify output quality value ASCII 
                                                        offset (only if output format is fastq).
                                qtrim=f                 Quality-trim ends before mapping.  Options are 'f' (false), 
                                                        'l' (left), 'r' (right), and 'lr' (both).
                                untrim=f                Undo trimming after mapping.  Untrimmed bases will be 
                                                        soft-clipped in cigar strings.
                                trimq=7                 Trim regions with average quality below this 
                                                        (phred algorithm).
                                mintl=60                (mintrimlength) Don't trim reads to be shorter than this.
                                fakequality=-1          Set to a positive number 1-50 to generate fake quality 
                                                        strings for fasta input reads.
                                ignorebadquality=f      (ibq) Keep going, rather than crashing, if a read has 
                                                        out-of-range quality values.
                                
                                Output Parameters:
                                
                                outputunmapped=t        Set to false if unmapped reads should not be printed to 
                                                        'out=' target (saves time and disk space).
                                out=<file>              Write all reads to this file (unless outputunmapped=t).
                                outu=<file>             Write only unmapped reads to this file.  Does not include 
                                                        unmapped paired reads with a mapped mate.
                                outm=<file>             Write only mapped reads to this file.  Includes unmapped 
                                                        paired reads with a mapped mate.
                                bamscript=<file>        (bs) Write a shell script to <file> that will turn the sam 
                                                        output into a sorted, indexed bam file.
                                ordered=f               Set to true to output reads in same order as input.  
                                                        Slower and uses more memory.
                                overwrite=f             (ow) Allow process to overwrite existing files.
                                secondary=f             Print secondary alignments.
                                sssr=0.95               (secondarysitescoreratio) Print only secondary alignments 
                                                        with score of at least this fraction of primary.
                                ssao=f                  (secondarysiteasambiguousonly) Only print secondary 
                                                        alignments for ambiguously-mapped reads.
                                maxsites=5              Maximum number of total alignments to print per read.
                                                        Only relevant when secondary=t.
                                quickmatch=f            Generate cigar strings more quickly.  Must be true to 
                                                        generate secondary site cigar strings.
                                trimreaddescriptions=f  (trd) Truncate read names at the first whitespace, 
                                                        assuming that the remaineder is a comment or description.
                                ziplevel=2              (zl) Compression level for zip or gzip output.
                                pigz=f                  Spawn a pigz (parallel gzip) process for faster compression 
                                                        than using Java.  Requires pigz to be installed.
                                machineout=f            Set to true to output statistics in machine-friendly 
                                                        'key=value' format.
                                
                                Sam flags and settings:
                                
                                sam=1.4                 Set to 1.4 to write Sam version 1.4 cigar strings, 
                                                        with = and X, or 1.3 to use M.
                                saa=t                   (secondaryalignmentasterisks) Use asterisks instead of
                                                        bases for sam secondary alignments.
                                cigar=t                 Set to 'f' to skip generation of cigar strings (faster).
                                keepnames=f             Keep original names of paired reads, rather than ensuring 
                                                        both reads have the same name.
                                intronlen=999999999     Set to a lower number like 10 to change 'D' to 'N' in 
                                                        cigar strings for deletions of at least that length.
                                rgid=                   Set readgroup ID.  All other readgroup fields can be set 
                                                        similarly, with the flag rgXX=
                                mdtag=f                 Write MD tags.
                                nhtag=f                 Write NH tags.
                                xmtag=f                 Write XM tags (may only work correctly with ambig=all).
                                xstag=f                 Set to 'xs=fs', 'xs=ss', or 'xs=us' to write XS tags 
                                                        for RNAseq using firststrand, secondstrand, or 
                                                        unstranded libraries.  Needed by Cufflinks.  
                                                        JGI mainly uses 'firststrand'.
                                stoptag=f               Write a tag indicating read stop location, prefixed by YS:i:
                                idtag=f                 Write a tag indicating percent identity, prefixed by YI:f:
                                inserttag=f             Write a tag indicating insert size, prefixed by X8:Z:
                                scoretag=f              Write a tag indicating BBMap's raw score, prefixed by YR:i:
                                noheader=f              Disable generation of header lines.
                                
                                Histogram and statistics output parameters:
                                
                                scafstats=<file>        Statistics on how many reads mapped to which scaffold 
                                                        to this file.
                                bhist=<file>            Base composition histogram by position.
                                qhist=<file>            Quality histogram by position.
                                aqhist=<file>           Histogram of average read quality.
                                bqhist=<file>           Quality histogram designed for box plots.
                                lhist=<file>            Read length histogram.
                                ihist=<file>            Write histogram of insert sizes (for paired reads).
                                ehist=<file>            Errors-per-read histogram.
                                qahist=<file>           Quality accuracy histogram of error rates versus 
                                                        quality score.
                                indelhist=<file>        Indel length histogram.
                                mhist=<file>            Histogram of match, sub, del, and ins rates by 
                                                        read location.
                                gchist=<file>           Read GC content histogram.
                                gcbins=100              Set the number of bins in the GC histogram.
                                idhist=<file>           Histogram of read count versus percent identity.
                                idbins=100              Set the number of bins in the identity histogram.
                                
                                Coverage output parameters (these may reduce speed and use more RAM):
                                
                                covstats=<file>         Per-scaffold coverage info.
                                covhist=<file>          Histogram of # occurrences of each depth level.
                                basecov=<file>          Coverage per base location.
                                bincov=<file>           Print binned coverage per location (one line per X bases).
                                covbinsize=1000         Set the binsize for binned coverage output.
                                nzo=f                   Only print scaffolds with nonzero coverage.
                                twocolumn=f             Change to true to print only ID and Avg_fold instead of 
                                                        all 6 columns to the 'out=' file.
                                uscov=t                 Include secondary alignments when calculating coverage.
                                32bit=f                 Set to true if you need per-base coverage over 64k.
                                strandedcov=f           Track coverage for plus and minus strand independently.
                                startcov=f              Only track start positions of reads.
                                
                                Java Parameters:
                                -Xmx                    This will be passed to Java to set memory usage, 
                                                        overriding the program's automatic memory detection.
                                                        -Xmx20g will specify 20 gigs of RAM, and -Xmx800m 
                                                        will specify 800 megs.  The max is typically 85% of 
                                                        physical memory.
                                
                                This list is not complete.  For more information, please consult 
                                $DIR""docs/readme.txt. 
                                Please contact Brian Bushnell at [email protected] if you encounter 
                                any problems.
                                "   
                                }
                                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                                Salk Institute for Biological Studies, La Jolla, CA, USA */

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X