Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • KevinLam
    Senior Member
    • Nov 2009
    • 204

    SOLiD Accuracy Enhancement Tool (SAET) multicore and ref genome size questions

    Just dl SAET to try

    one of the options jumped out at me

    refLength – expected length of the assembled sequence, e.g., 4600000 for E.Coli
    4.6Mb genome.


    it is not explained in detail how this info is used and what happens if you are dealing with a new organism or mixed transcriptome sample/ environmental sample? Do we just throw in a value?

    another question is how to I use > 4 cores using PBS to use multi threads?
    http://kevin-gattaca.blogspot.com/
  • Rao
    Member
    • Oct 2008
    • 36

    #2
    based on genome size... i think it decides k-mer size that used for error correction

    Comment

    • javijevi
      Member
      • Jan 2010
      • 38

      #3
      refLength parameter

      Maybe does it (also) use the refLength parameter for calculating the estimated frequency cutoff of trusted seeds in the spectrum construction step?

      I think it is a very crucial data with a very strong influence in the quality of the results, and its use should be further clarified...

      Does any developer can help?

      Comment

      • kevleb
        Member
        • Jun 2009
        • 10

        #4
        Does anybody have an idea about the interest to run saet tool concerning smallRNA sequencing ?
        If yes what could be the "expected length of the assembled sequence" ?
        For human sample it could be :
        - 70.000 for mirs
        - 8.000.000 for human ncRNAs
        - 110.000.000 for refseq human mRNAs
        - 2.900.000.000 for whole genome

        Comment

        • westerman
          Rick Westerman
          • Jun 2008
          • 1104

          #5
          An interesting question for which I have no answer. Since I have been wondering this myself let us hope that someone can answer. The entire SAET modification is a black-box mystery to me.

          Comment

          • KevinLam
            Senior Member
            • Nov 2009
            • 204

            #6
            I think ABI finally bothered to release more information.
            Basically it works like Softgenetics condensation tool.

            it finds reads that have a single mismatch with all the similar reads and based on that it corrects the basecalls.

            I leave it up to you to decide if this is a good thing
            http://kevin-gattaca.blogspot.com/

            Comment

            • westerman
              Rick Westerman
              • Jun 2008
              • 1104

              #7
              Since a single mismatch can only be a sequencing error then, in theory, correcting the error is a Good Thing. In practice I find that running SAET can take a long time. My last SAET took 100 CPU hours (about 12 hours wall time) but before that I ran a SAET that was heading off into multiple days of wall time. I eventually broke down the read file into two parts and ran SAET on each part. But that weakens the algorithm.

              I would almost rather have the mapping part of the algorithm deal with single mismatch errors via throwing away the reads or doing as good of a mapping as it can. I might take my current project (a full plate, F3/R3 pairs so lots of data) and process it both with and without SAET to see if makes any difference.

              My field specialist recommends running SAET only on small genomes (e.g., bacterial).

              Comment

              • schmima
                Member
                • Apr 2010
                • 56

                #8
                'expected length of the assembled sequence'

                Maybe I'm wrong, but I understand it like this:
                Assume you would use your reads (!) to assemble a genome/sequence (de novo). Then - how long would the sequence be? => use this as the parameter.
                Ergo - in a nice world - if you sequence...
                ...genomic DNA use the genome size.
                ...mRNA use the transcriptome size (count only one gene variant per locus)
                ... and so on

                Sounds somehow easy - but a 'near to correct' estimate seems - to me - hard to get in some cases... are the reads only from the type of nucleic acid you were interested in? I guess this is often not the case...
                as an example - mRNA-seq with amplified material (poly-A or random priming):
                => origin of reads:
                DNAse digest not 100% efficient (what is normally the case):
                the expected length of the assembled sequence would increase (genomic contamination).
                => systematic coverage bias:
                3' bias (for poly-A priming):
                the expected length of the assembled sequence would drop (in some cases dramatically).
                => coverage in general:
                assume you have reads of totally 1 gigabases from human genomic DNA. Using the human genome size would not be a good idea as you are not able to assemble 2.900.000.000 bases with 1 gigabases. And also 1 giga is not a good estimate as you would not expect a continuous coverage of 1.

                Hm - for smallRNAs in your 'human'-case I guess that you would have to think about where your reads (! this is equal to small RNAs if you are sure that all reads are coming from small RNAs) could come from and use the size of this sequence.

                However... I don't dare to give an estimate
                and anyway: it's called estimate ^^

                added later on:
                From the bioscope manual (why don't they write this inside the SAET readme?):
                'The expected length of the assembled sequence, for example: 4,600,000
                for E.coli, 4.6 Mb genome, or 30,000,000 for Whole Human Transcriptome.'

                So seems that I was not completely wrong...

                I just still don't know what for the parameter is used during the correction...
                checked SAET run with refLength = 120'000'000 compared to 13'000'000.
                Number of changed reads:
                refLength = 120'000'000: 19'703'256
                refLength = 13'000'000: 23'926'281
                total reads: 43'740'114
                => the parameter is definitely used in the calculation...
                via taking into account the probability of random read-overlaps? or changed seedlength?

                Anyone knows?
                Last edited by schmima; 07-20-2010, 02:00 AM. Reason: adding something

                Comment

                • poisson200
                  Member
                  • Feb 2010
                  • 63

                  #9
                  Re-open the SAET debate

                  From the Bioscope 1.3 manual;
                  "The set of such trusted k-mers approximates the set of all k-mers in the
                  genome".....So

                  If I understand SAET, it is looking for K-mers that don't exist in a population of K-mers and corrects them (because they are not trusted)?

                  If you are working with reads from a genome or transcriptome, from a model species, that is going to be close to complete (like human or mouse); Why not use the transcriptome or genome to generate the model kmers?

                  It possibly could be this option, saet.outspecbin=1

                  Has anyone tried this? I am about to and not sure whether it is
                  a) possible
                  b) should I convert base space genome to colourspace

                  Comment

                  • golharam
                    Member
                    • Dec 2009
                    • 55

                    #10
                    Is SAET still available on solidsoftwaretools? I don't seem to have access to it anymore.

                    Comment

                    • schmima
                      Member
                      • Apr 2010
                      • 56

                      #11
                      Yes - but it's within the denovo2 package which you can download at:

                      Comment

                      • jbdorr
                        Junior Member
                        • Aug 2012
                        • 4

                        #12
                        download denovo

                        denovo2 doesnt exist in the solid software tools page (http://solidsoftwaretools.com/gf/project/denovo/frs/).

                        can downloaded of my Master's Degree Page www.inf.ufpr.br/jbdorr/denovo2.tgz

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          06-02-2026, 10:05 AM
                        • SEQadmin2
                          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                          by SEQadmin2


                          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                          Introduction

                          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                          05-22-2026, 06:42 AM
                        • SEQadmin2
                          Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                          by SEQadmin2

                          Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                          Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                          05-06-2026, 09:04 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, 06-02-2026, 12:03 PM
                        0 responses
                        20 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-02-2026, 11:40 AM
                        0 responses
                        14 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-28-2026, 11:40 AM
                        0 responses
                        29 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-26-2026, 10:12 AM
                        0 responses
                        31 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...