Good morning,
I posted this already in the Genomic Resequencing forum but can't figure out how to move it, so I am reposting it here. Sorry for the double post.
I am an HPC computer engineer interested in doing some Bowtie profiling, using both single threaded, multi-threaded (-p option) and using pMap on our cluster to distribute reads across multiple nodes/processors. To do so, I'm trying to find a good reference dataset/index combo that is both representative of an actual meaningful result, and that will take a long time to complete so that I can get maximum variance between runs as I vary threads and other parameters.
However, as I am not a geneticist, I was wondering if I could get a sanity check here on whether my methodology is correct.
I downloaded the Homo Sapien index from the Illumina iGenomes project here:
in both the Homo_sapiens_Ensembl_GRCh37 and the Homo_sapiens_NCBI_build37.2 formats. Both contain a BowtieIndex directory, and I've been using the NCBI version for my runs.
Next, I needed a read file for the test. After some digging, I found the NCBI page with reads in SRA format:
and looked for a large one [hoping that the bigger the file, the longer the test would run) and ended up going with a SRA file out of the "BySample" section that was approximately 1.66 Gb:
ftp://ftp-trace.ncbi.nlm.nih.gov/sra.../DRR000363.sra
(I will admit: I have absolutely no idea what this is, and have been unable to ascertain this data's exact meaning, so I'm not sure if it's even something you would try to align against the NCBI homo sapien bowtie index. :-)
I downloaded SRAToolkit and used fastq-dump to convert the SRA file to a fastq file, which took the 1.7 GB SRA file and produced a 7.6 GB fastq file. Nice.
So at this point I figured I was all set. I did a test run on my Macbook using the (-p 4) just to make sure I had all the parameters correct. The first run gave me a bunch of memory errors so I increased the --chunkmb param which seemed to have solved it. Here is the output:
$ ./bowtie -t --best --chunkmbs 256 -p 4 genome ~/genomics/DRR000363.fastq ~/genomics/experiment3.map
Time loading forward index: 00:00:16
Time loading mirror index: 00:00:30
Seeded quality full-index search: 01:02:46
# reads processed: 43900909
# reads with at least one reported alignment: 27703616 (63.10%)
# reads that failed to align: 16197293 (36.90%)
Reported 27703616 alignments to 1 output stream(s)
Time searching: 01:03:33
Overall time: 01:03:33
Now, here is where I need your help. Again, I'm not even sure if I'm using the right kind of reads to reference against the homo sapien bowtie index-- and I'm not sure exactly what it means that 36.90% of the reads didn't align. From what I've read, that could be because those reads were of poor quality, but I don't know how to validate this and want to make sure I'm doing it right.
I was hoping somebody could comment on my methodology so that I know I'm doing a run representative of what an actual geneticist might do, and not running some fantastical data experiment that doesn't mimic a real-world scenario. I plan on making this profile study public on our website when done, along with reproducible downloads/commands.
If anybody has suggestions on different read files I could use against these indexes that would be representative of large, long runs, I would be much appreciative. I also don't know if I should be using the Ensemble or the NCBI index, as by all accounts [according to the index file sizes] they are almost identical, so I decided to go with NCBI.
Anyways-- thanks for your time. I really appreciate the help and look forward to hopefully being able to contribute in the near future with some helpful profiling numbers for Bowtie and eventually other tools like bamtools, SOAP, etc. Suggestions are also welcome!
I posted this already in the Genomic Resequencing forum but can't figure out how to move it, so I am reposting it here. Sorry for the double post.
I am an HPC computer engineer interested in doing some Bowtie profiling, using both single threaded, multi-threaded (-p option) and using pMap on our cluster to distribute reads across multiple nodes/processors. To do so, I'm trying to find a good reference dataset/index combo that is both representative of an actual meaningful result, and that will take a long time to complete so that I can get maximum variance between runs as I vary threads and other parameters.
However, as I am not a geneticist, I was wondering if I could get a sanity check here on whether my methodology is correct.
I downloaded the Homo Sapien index from the Illumina iGenomes project here:
in both the Homo_sapiens_Ensembl_GRCh37 and the Homo_sapiens_NCBI_build37.2 formats. Both contain a BowtieIndex directory, and I've been using the NCBI version for my runs.
Next, I needed a read file for the test. After some digging, I found the NCBI page with reads in SRA format:
and looked for a large one [hoping that the bigger the file, the longer the test would run) and ended up going with a SRA file out of the "BySample" section that was approximately 1.66 Gb:
ftp://ftp-trace.ncbi.nlm.nih.gov/sra.../DRR000363.sra
(I will admit: I have absolutely no idea what this is, and have been unable to ascertain this data's exact meaning, so I'm not sure if it's even something you would try to align against the NCBI homo sapien bowtie index. :-)
I downloaded SRAToolkit and used fastq-dump to convert the SRA file to a fastq file, which took the 1.7 GB SRA file and produced a 7.6 GB fastq file. Nice.
So at this point I figured I was all set. I did a test run on my Macbook using the (-p 4) just to make sure I had all the parameters correct. The first run gave me a bunch of memory errors so I increased the --chunkmb param which seemed to have solved it. Here is the output:
$ ./bowtie -t --best --chunkmbs 256 -p 4 genome ~/genomics/DRR000363.fastq ~/genomics/experiment3.map
Time loading forward index: 00:00:16
Time loading mirror index: 00:00:30
Seeded quality full-index search: 01:02:46
# reads processed: 43900909
# reads with at least one reported alignment: 27703616 (63.10%)
# reads that failed to align: 16197293 (36.90%)
Reported 27703616 alignments to 1 output stream(s)
Time searching: 01:03:33
Overall time: 01:03:33
Now, here is where I need your help. Again, I'm not even sure if I'm using the right kind of reads to reference against the homo sapien bowtie index-- and I'm not sure exactly what it means that 36.90% of the reads didn't align. From what I've read, that could be because those reads were of poor quality, but I don't know how to validate this and want to make sure I'm doing it right.
I was hoping somebody could comment on my methodology so that I know I'm doing a run representative of what an actual geneticist might do, and not running some fantastical data experiment that doesn't mimic a real-world scenario. I plan on making this profile study public on our website when done, along with reproducible downloads/commands.
If anybody has suggestions on different read files I could use against these indexes that would be representative of large, long runs, I would be much appreciative. I also don't know if I should be using the Ensemble or the NCBI index, as by all accounts [according to the index file sizes] they are almost identical, so I decided to go with NCBI.
Anyways-- thanks for your time. I really appreciate the help and look forward to hopefully being able to contribute in the near future with some helpful profiling numbers for Bowtie and eventually other tools like bamtools, SOAP, etc. Suggestions are also welcome!
Comment