Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Endiel
    Junior Member
    • Sep 2012
    • 6

    Profiling Bowtie Performance: Good Reference Dataset?

    Good morning,

    I posted this already in the Genomic Resequencing forum but can't figure out how to move it, so I am reposting it here. Sorry for the double post.

    I am an HPC computer engineer interested in doing some Bowtie profiling, using both single threaded, multi-threaded (-p option) and using pMap on our cluster to distribute reads across multiple nodes/processors. To do so, I'm trying to find a good reference dataset/index combo that is both representative of an actual meaningful result, and that will take a long time to complete so that I can get maximum variance between runs as I vary threads and other parameters.

    However, as I am not a geneticist, I was wondering if I could get a sanity check here on whether my methodology is correct.

    I downloaded the Homo Sapien index from the Illumina iGenomes project here:



    in both the Homo_sapiens_Ensembl_GRCh37 and the Homo_sapiens_NCBI_build37.2 formats. Both contain a BowtieIndex directory, and I've been using the NCBI version for my runs.

    Next, I needed a read file for the test. After some digging, I found the NCBI page with reads in SRA format:



    and looked for a large one [hoping that the bigger the file, the longer the test would run) and ended up going with a SRA file out of the "BySample" section that was approximately 1.66 Gb:

    ftp://ftp-trace.ncbi.nlm.nih.gov/sra.../DRR000363.sra

    (I will admit: I have absolutely no idea what this is, and have been unable to ascertain this data's exact meaning, so I'm not sure if it's even something you would try to align against the NCBI homo sapien bowtie index. :-)

    I downloaded SRAToolkit and used fastq-dump to convert the SRA file to a fastq file, which took the 1.7 GB SRA file and produced a 7.6 GB fastq file. Nice.

    So at this point I figured I was all set. I did a test run on my Macbook using the (-p 4) just to make sure I had all the parameters correct. The first run gave me a bunch of memory errors so I increased the --chunkmb param which seemed to have solved it. Here is the output:

    $ ./bowtie -t --best --chunkmbs 256 -p 4 genome ~/genomics/DRR000363.fastq ~/genomics/experiment3.map
    Time loading forward index: 00:00:16
    Time loading mirror index: 00:00:30
    Seeded quality full-index search: 01:02:46
    # reads processed: 43900909
    # reads with at least one reported alignment: 27703616 (63.10%)
    # reads that failed to align: 16197293 (36.90%)
    Reported 27703616 alignments to 1 output stream(s)
    Time searching: 01:03:33
    Overall time: 01:03:33


    Now, here is where I need your help. Again, I'm not even sure if I'm using the right kind of reads to reference against the homo sapien bowtie index-- and I'm not sure exactly what it means that 36.90% of the reads didn't align. From what I've read, that could be because those reads were of poor quality, but I don't know how to validate this and want to make sure I'm doing it right.

    I was hoping somebody could comment on my methodology so that I know I'm doing a run representative of what an actual geneticist might do, and not running some fantastical data experiment that doesn't mimic a real-world scenario. I plan on making this profile study public on our website when done, along with reproducible downloads/commands.

    If anybody has suggestions on different read files I could use against these indexes that would be representative of large, long runs, I would be much appreciative. I also don't know if I should be using the Ensemble or the NCBI index, as by all accounts [according to the index file sizes] they are almost identical, so I decided to go with NCBI.

    Anyways-- thanks for your time. I really appreciate the help and look forward to hopefully being able to contribute in the near future with some helpful profiling numbers for Bowtie and eventually other tools like bamtools, SOAP, etc. Suggestions are also welcome!
  • Endiel
    Junior Member
    • Sep 2012
    • 6

    #2
    Just wanted to update, I think I found my own answer after digging a bit more on the ncbi website.

    I dug into the Homo Sapien reads project, and found a read sequence with about 60M spots:



    The sample is described as:

    "BromoUridine-labeled HeLa TO ARE cells transfected with RRP46-1 siRNA were incubated for 24 hours in the absence of BrU (chasing), followed by isolation of BrU-RNA by anti-BrU antibody. Purified BrU-RNAs were analyzed."

    "Species: Homo Sapien"

    So at least now I know I'm doing something relevant. Here's the download SRA:

    ftp://ftp-trace.ncbi.nlm.nih.gov/sra.../DRR000886.sra

    Is 60M spots a pretty high number? For my study, the more the better. If anybody knows of some that are much larger than this, please let me know.

    Thanks.

    Comment

    • srasdk
      Member
      • Jun 2011
      • 19

      #3
      NA12878 is the most studied individual human genome so you may be able to cross-check your results if you need.
      Look for "Whole genome sequencing" to get anticipated even coverage (not so in real life). Look also for HiSeq 2000 instrument to get very large number of reads.
      As an example, check this experiment out (I did not spend much time researching it):

      It has 4 runs, and 170Gbases, which is 60x+ of human genome. This is as high as it gets now in sequencing of non-cancer human genomes.

      Comment

      • srasdk
        Member
        • Jun 2011
        • 19

        #4
        Even bigger one by Broad (MIT and Harvard):


        ...
        And it contains Broad's alignments, so you can compare the results.
        Last edited by srasdk; 09-21-2012, 07:43 PM.

        Comment

        • Endiel
          Junior Member
          • Sep 2012
          • 6

          #5
          Originally posted by srasdk View Post
          NA12878 is the most studied individual human genome so you may be able to cross-check your results if you need.
          Look for "Whole genome sequencing" to get anticipated even coverage (not so in real life). Look also for HiSeq 2000 instrument to get very large number of reads.
          As an example, check this experiment out (I did not spend much time researching it):

          It has 4 runs, and 170Gbases, which is 60x+ of human genome. This is as high as it gets now in sequencing of non-cancer human genomes.
          sradsk-- this is absolutely perfect and what I was looking for. I will post back what my numbers look like once kick off a bunch of bowtie runs.

          Question-- and you may or may not know how to answer this-- but according to the ncbi site for this project, these are paired end reads... which means that I think when I run fastq-dump I have to specify "--split-file" so that it creates two paired complement fastq files to give to bowtie. Does that sound right?

          My second question is how best to run bowtie with this data. This is how I was planning on running it to get maximum coverage and alignment:

          bowtie -t --best --tryhard --chunkmbs 2048 -X 1000 -p 8 genome -1 file1.fastq -2 file2.fastq genome.map

          Should I specify a -S and have it output SAM data? Sorry if these are dumb questions-- I've read the documentation and searched extensively for examples but it's good to have a sanity check.

          Thanks again for your help.

          Comment

          • Endiel
            Junior Member
            • Sep 2012
            • 6

            #6
            Just wanted to update with my progress. Got some pretty good results.

            Comment

            • Chipper
              Senior Member
              • Mar 2008
              • 323

              #7
              Thanks for the report. For 100 bp reads you are likely to get better results with bowtie2. If you have access to a decent graphics card you could test the soap3-dp aligner as well, it's 10x faster than bowtie2 on a GTX580:

              Comment

              • Endiel
                Junior Member
                • Sep 2012
                • 6

                #8
                Chipper-- good to know. I will check it out.

                As far as the soap3-dp aligner, I'm curious how well the memory transfer to/from the card with two 69 GB paired-end read files would work. I will look into it though. Thanks!

                Comment

                Latest Articles

                Collapse

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                12 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-04-2026, 08:59 AM
                0 responses
                23 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 12:03 PM
                0 responses
                28 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 11:40 AM
                0 responses
                22 views
                0 reactions
                Last Post SEQadmin2  
                Working...