Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Short read benchmark data

    I'm interested in running baseline benchmarks on several short-read alignment tools to see their differences especially as machine specs are changed (CPU type, count, speeds, RAM speeds, caches, etc.) There can be all kinds of subtle effects especially on large 24 or 32-core servers which have unique memory access problems.

    I'm hoping to run Bowtie, SOAP, and Maq as my baseline tools. I'm not really comparing the three tools against each other, I'm actually more of comparing hardware effects on short read alignment in general. I'll probably focus on Bowtie just because it's so powerful (and much faster!)

    My problem is (as a CS guy, not a bio guy) what test data to use as my baseline. I don't want to make my own synthetic data, I want to run a typical problem that true users of these tools submit.

    Could someone recommend where I could find or how I could create a test suite of data? It would probably consist of a few tens of millions of short reads and one or more larger databases to match it against. I would probably run multiple cases, perhaps 4 runs allowing k=0,1,2,3 mismatches or something.

    I notice the 2008 Bowtie paper takes samples from the 1000 genome project, trims them to 35 bases, and aligns them against the human genome reference. Is this a good test case typical of real use of aligners? Would typical uses use longer input sequences, shorter, a mix? Again I'm just looking for typical workloads where the software speed is measurable and I can just see where hardware sensitivities are.

    I appreciate any help in getting these baseline benchmarks run on my hardware!

  • #2
    Try here for the readset



    I suggest typing "yoruban" in search to get some human data.

    If you use bowtie try

    36bp isn't really state of the art any more. Try 100bp paired end data.

    Novoalign, Shrimp2 or bwa are also very fast and nice aligners which can handle gapped alignments of reads (typical with longer reads), in contrast to bowtie.

    Comment


    • #3
      If you look at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ you can see what data is available currently from 1000genomes. The sequence.index file will be most useful, if you look at the read and base count columns you can work out read length aswell

      As the previous responder mentioned you probably want to go for 75-100bp reads now to be closer to how things work now

      You might want to decide if you want a mix of technologies or just stick with one

      Comment


      • #4
        Thanks for the suggestions! Especially about the fact that using longer reads is more typical of actual use.

        Laura, thanks also for the FTP.. looks like a ton of info there.


        Colin, gapped short read alignments are usually due to paired end reads? So your short read you're trying to search for in the database would be something like a known 50bp, an unknown sized gap of 300-600 bp, and another known 50bp? [My numbers there are made up, I just want to test my understanding.]


        And another question, what match error level is typically are used in reads? I notice Bowtie's paper shows with k=0,1, or 2, but it becomes exponentially slower as k increases. What k is typically used? Would higher k>2 matches be useful or are they so noisy that they wouldn't be used even if they were fast? Finally, are all of these mismatch choices based on k transcription differences (a single bp mismatch with the reference) or are they an edit difference, additionally allowing spurious insertions and deletions?

        And a final "what's typical use?" question:
        How many reads are typical for a researcher to run? 10 million a day or something? [I have no idea, that's a wild guess, and may be orders of magnitude off in either direction.]


        Thanks again!

        Comment


        • #5
          You raise some interesting questions, but it would be useful to do the evaluation with someone actually working with the data. It becomes really difficult to run the tools and compare at various levels.

          I have been looking at ways to compare aligners as well, using simulated data or the 1kg sequenced sample with genotyped variants.. but at multiple levels it does not remain an apples vs apples comparison due to various limitations, like a tool does not call variants, etc.
          --
          bioinfosm

          Comment


          • #6
            Originally posted by bioinfosm View Post
            but at multiple levels it does not remain an apples vs apples comparison due to various limitations, like a tool does not call variants, etc.
            Agreed. I found this thread very useful in case you want to go for the simulated option.
            -drd

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Understanding Genetic Influence on Infectious Disease
              by seqadmin




              During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

              Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
              09-09-2024, 10:59 AM
            • seqadmin
              Addressing Off-Target Effects in CRISPR Technologies
              by seqadmin






              The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
              08-27-2024, 04:44 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 09-11-2024, 02:44 PM
            0 responses
            11 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 09-06-2024, 08:02 AM
            0 responses
            145 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 09-03-2024, 08:30 AM
            0 responses
            152 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 08-27-2024, 04:40 AM
            0 responses
            161 views
            0 likes
            Last Post seqadmin  
            Working...
            X