Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #91
    Yes, "only matching that sequence" is what I mean. So in seal if you use ambig=all then that read (hit) is added to all targets it matched (if there were multiple best matches)?

    Or say...what happens when a read can match one sequence with 20 kmers and a second with 18 kmers?
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #92
      Originally posted by sdriscoll View Post
      So in seal if you use ambig=all then that read (hit) is added to all targets it matched (if there were multiple best matches)?
      That's correct.
      Or say...what happens when a read can match one sequence with 20 kmers and a second with 18 kmers?
      Seal will only consider something ambiguous if it matches multiple sequences with the same number of kmers, which is also to the highest number. However, I could add a flag that changes it to allow any number of kmers.

      Comment


      • #93
        Hey Brian,

        bbduk and seal are certainly great programs and since I have downloaded these tools I've been using them pretty much daily. While most kmer tools ignore the idea of allowing a mismatch (since obviously this complicates hash lookups) I like that you have included this however using it imposes such a severe spike in memory usage that it becomes basically unusable. Case in point - using seal to produce hits to genes in a full transcriptome while allowing a mismatch per kmer. Is there a reason that you put the burden of additional kmers on the reference and not on the read? It seems to me that there would be a performance hit but I would much rather have that than the massive amount of memory needed to produce a kmer index of a full transcriptome allowing even one mismatch. So during matching the read could be kmer'd into all possible single mismatch variants and matched to the normal 0-mismatch kmer index. Or does that bring in additional issues that are not easily resolved?
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • #94
          I really kind of like the "mm=t" option because it's similar to allowing a mismatch, but with no speed or memory penalty. It's also possible, though, to generally reduce memory consumption when allowing mismatches using the "speed" or "rskip" flag. "speed=8", for example, would hash only 50% of the kmer space, so it would halve memory consumption; "speed=12" would quarter memory consumption. Of course these reduce sensitivity, but "speed=12 hdist=1" would probably be more sensitive than "speed=0 hdist=0" if there are a lot of mismatches.

          There is no reason for me to not generate read kmers with mismatches in them, other than the speed penalty. But, that speed penalty is hefty - with 31mers, allowing 1 mismatch might make the program 93x slower (particularly if most reads do not match the reference; not so much if they mostly do).

          I will plan to add that capability in, though I'm not convinced that it will be viable on large datasets (raw reads). However, it WOULD be viable for matching a handful of sequences to a very large reference.

          Comment


          • #95
            Yeah that last point is totally fair. I'm also not totally sure that for most applications allowing a mismatch really tells me what I want to know anyways.

            On another subject, are you familiar with the program Sailfish which uses kmer matching and the EM algorithm for gene expression? It is pretty much seal with an EM step that probabilistically estimates abundance.
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • #96
              I have heard of expectation maximization but not actually studied what it does. I should probably read about it to see if it seems useful.

              I was not really aware of Sailfish until today or yesterday, when the author posted on SeqAnswers that it now has a successor named Salmon. I should probably do a comparative benchmark sometime.
              Last edited by Brian Bushnell; 12-19-2014, 05:33 PM.

              Comment


              • #97
                Yeah they came up with some kind of so-called perfect hash so that every kmer lookup is O(1). The slowest part of their program is probably the EM part (it slows down with increasing size of reference). I think these days if a program intends to put out expression levels for a set of references with ambiguity of mapping between features the EM is the standard. Until something better is figured out...
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #98
                  Originally posted by Brian Bushnell View Post
                  I'd like to introduce another member of the BBTools package, BBDuk. "Duk" stands for Decontamination Using Kmers. BBDuk is extremely fast, scalable, and memory-efficient, while maintaining greater sensitivity and specificity than other tools. It can do lots of different things, for example -

                  Adapter trimming:
                  bbduk.sh -Xmx1g in=reads.fq out=clean.fq ref=adapters.fa ktrim=r k=28 mink=12 hdist=1

                  ...where "ktrim=r" is for right-trimming (3' adapters), "ktrim=l" is for left-trimming (5' adapters), and "ktrim=N" just masks the kmers with "N". "k" specifies the kmer size to use (must be at most the length of the adapters) while "mink" allows it to use shorter kmers at the ends of the read (for example, k=12 for the last 12 bases). "hdist" means "hamming distance"; this allows one mismatch. Instead of "ref=file" you can alternately (or additionally) say "literal=ACTGGT,TTTGGTG" for those two literal strings. The BBTools package currently includes Truseq and Nextera adapters sequences in /bbmap/resources/truseq.fa.gz and /bbmap/resources/nextera.fa.gz.
                  Hi Brian,

                  Got a bit confused with these commands, so I did RNA sequencing using Truseq Stranded mRNA Kit, 200-bp paired-end.

                  I understand that I'm supposed to trim the adapters off

                  *Read 1: I will have to trim (AGATCGGAAGAGCACACGTCTGAACTCCAGTCA)

                  *Read 2: I will have to trim (AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT)

                  How should the command be? Could you please advise?

                  Thank you. Merry Christmas!

                  Comment


                  • #99
                    Basically something like this:

                    Code:
                    $ replace_windows_OS_specific_bbduk_command_here -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_no_adapters.fastq out2=/path/to/sample_R2_no_adapters.fastq ref=/path/to/bbmap/resources/truseq.fa.gz ktrim=r [and other options for bbduk to follow]

                    Comment


                    • Originally posted by GenoMax View Post
                      Basically something like this:

                      Code:
                      $ replace_windows_OS_specific_bbduk_command_here -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_no_adapters.fastq out2=/path/to/sample_R2_no_adapters.fastq ref=/path/to/bbmap/resources/truseq.fa.gz ktrim=r [and other options for bbduk to follow]
                      Hi Genomax,

                      Sorry, but I don't understand what is "/path/to/", if my files are all in C: drive Windows, will it be

                      jave -ea -Xmx2g in1=C:\sequencing\ST131_1_R1.fastq in2=C:\sequencing\ST131_1_R2 out1=??? out2=??? ref=C:\bbmap\resources\truseq.fa.gz ktrim=r k=???? mink=???? hdist=????

                      ??? For out1 and out2 do I need to create some kind of new file?
                      ??? I don't know what to put for k, mink and hdist, should k be 33 because the adapter sequences are 33 bases? not sure about mink and hdist.

                      Thank you. Merry Christmas

                      Comment


                      • Originally posted by michaellim View Post
                        Hi Genomax,

                        Sorry, but I don't understand what is "/path/to/", if my files are all in C: drive Windows, will it be

                        Code:
                        $ java -ea -Xmx2g in1=C:\sequencing\ST131_1_R1.fastq in2=C:\sequencing\ST131_1_R2 out1=??? out2=??? ref=C:\bbmap\resources\truseq.fa.gz ktrim=r
                        ??? For out1 and out2 do I need to create some kind of new file?
                        ??? I don't know what to put for k, mink and hdist, should k be 33 because the adapter sequences are 33 bases? not sure about mink and hdist.

                        Thank you. Merry Christmas
                        Assuming your BBMap directory is at c:\ level the following command gives you a general idea of how to structure your own.

                        out1 and out2 represent two new files names that you provide that will contain the adapter trimmed (cleaned) reads from R1 and R2 read files.

                        Code:
                        $ java -ea -Xmx1g -cp C:\bbmap\current\ jgi.BBDukF in1=C:\sequencing\ST131_1_R1.fastq in2=C:\sequencing\ST131_1_R2.fastq out1=c:\sequencing\ST131_1_R1_cleaned.fastq out2=c:\sequencing\ST131_1_R2_cleaned.fastq ref=C:\bbmap\resources\truseq.fa.gz ktrim=r
                        If is possible that your reads are already adapter trimmed so no reads may get thrown out/trimmed after running BBDuk tool. Look at the stats produced at the end to see what happens.

                        Look at the contents of bbduk.sh file (open it in wordpad) to understand the meaning of optional parameters. You may or may not want/need to provide additional parameters.
                        Last edited by GenoMax; 12-25-2014, 07:15 AM.

                        Comment


                        • Originally posted by sdriscoll View Post
                          Hey Brian,

                          bbduk and seal are certainly great programs and since I have downloaded these tools I've been using them pretty much daily. While most kmer tools ignore the idea of allowing a mismatch (since obviously this complicates hash lookups) I like that you have included this however using it imposes such a severe spike in memory usage that it becomes basically unusable. Case in point - using seal to produce hits to genes in a full transcriptome while allowing a mismatch per kmer. Is there a reason that you put the burden of additional kmers on the reference and not on the read? It seems to me that there would be a performance hit but I would much rather have that than the massive amount of memory needed to produce a kmer index of a full transcriptome allowing even one mismatch. So during matching the read could be kmer'd into all possible single mismatch variants and matched to the normal 0-mismatch kmer index. Or does that bring in additional issues that are not easily resolved?
                          I just released a new version of BBTools (34.48) which adds this capability. You can use it with the "qhdist" flag (queryhammingdistance). It doesn't really slow down very much as long the vast majority of the reads are expected to have kmer matches; but if the majority of reads don't have kmer matches, there will be a ~3*K-fold slowdown. But even then the speed is still acceptable for most purposes.

                          Here's a sensitivity analysis. I generated 100k 100bp reads from E.coli, and gave them each a random number of SNPs from 0 to 20 (probably 15 on average). Then I tested the percentage that were correctly identified as E.coli by BBDuk:

                          bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=f
                          Time: 0.8 seconds
                          Filtered: 33.51%

                          bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=t
                          Time: 0.8 seconds
                          Filtered: 39.83%

                          bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=f hdist=1
                          Time: 30 seconds (29 of which were load time)
                          Filtered: 48.62%

                          bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=f qhdist=1
                          Time: 10 seconds
                          Filtered: 48.62%

                          bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=t qhdist=1
                          Time: 10 seconds
                          Filtered: 53.51%

                          bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=t qhdist=1 hdist=1
                          Time: 42 seconds (29 of which were load time)
                          Filtered: 63.06%

                          So "mm=t" is not really as good as hdist=1 if there is a really high error rate, but it's free.

                          P.S. Incidentally, BBMap mapped around 55% on defaults and 66% with high sensitivity (k=12 minratio=0.1).
                          Last edited by Brian Bushnell; 02-23-2015, 08:13 PM.

                          Comment


                          • @Brian: With paired end reads out1m= and out2m= don't seem to work (only outm=). Is that by design?
                            Last edited by GenoMax; 02-27-2015, 11:31 AM. Reason: Corrected

                            Comment


                            • Hi GenoMax,

                              I just tried it, and it seems to work fine for me. Are you getting an error message, or is one of the files simply not being produced? Also, what version are you running?

                              Comment


                              • Originally posted by Brian Bushnell View Post
                                Hi GenoMax,

                                I just tried it, and it seems to work fine for me. Are you getting an error message, or is one of the files simply not being produced? Also, what version are you running?
                                I am using v. 34.33.

                                Code:
                                Exception in thread "main" java.lang.RuntimeException: Unknown parameter out1m=xxxx_AACCAGCT_L001_R1_001_match.fastq.gz
                                	at jgi.BBDukF.<init>(BBDukF.java:400)
                                	at jgi.BBDukF.main(BBDukF.java:62)

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Best Practices for Single-Cell Sequencing Analysis
                                  by seqadmin



                                  While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                                  06-06-2024, 07:15 AM
                                • seqadmin
                                  Latest Developments in Precision Medicine
                                  by seqadmin



                                  Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                                  Somatic Genomics
                                  “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                                  05-24-2024, 01:16 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 06-21-2024, 07:49 AM
                                0 responses
                                14 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 06-20-2024, 07:23 AM
                                0 responses
                                14 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 06-17-2024, 06:54 AM
                                0 responses
                                16 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 06-14-2024, 07:24 AM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X