Here I will present a small comparison between different read alignment programs, including eland, soap, rmap and maq. The comparison is designed as follows. From human reference chrX, I simulated 1 million pairs of 32bp reads (i.e. 2 million reads) with maq. Maq first simulated a diploid sequence (two haploid sequences) by adding 135351 substitutions, 7650 1bp insertions and 7510 1bp deletions, and then generated reads from the diploid sequence. The base quality were generated based on real data. Sequencing errors were added based on qualities. As we know the exact position where the reads come from, we can exactly count the number of reads that are wrongly mapped by a certain alignment program.
To facilitate comparison, I wrote a Perl script 'paf_utils.pl' that converts various alignment format to the socalled PAF format (Pairwise Alignment Format), which is only for my own use at the moment.
Important Notes
===============
* This benchmark favours maq because I have not squeezed the best performance out of other software. Here are the reasons.
* eland: eland is able to make use of paired end (PE) information and generate accurate mapping qualities. However, these functionality are assisted by several scripts in the GAPipeline-0.3.0 and for the moment I do not know how to run these scripts independently without invoking the full Gerald module of the pipeline. I was running eland in the single end mode without mapping qualities.
* rmap: rmapq is able to use base qualities to improve the alignment. However, it requires _prb.txt format which is rarely used at the Sanger Institute and in Illumina as well. Maq does not support this format in simulation and therefore rmapq is not evaluated. I am sure rmapq would work much better than rmap if base qualities were available.
* soap: soap also comes with the PE alignment mode. Unfortunately, it segfaults when I try to use this mode. Perhaps my input file is buggy or I was doing stupid things, but anyway, I could not evaluate soap in PE alignment mode. If this mode worked, soap would get higher mapping accuracy.
* shrimp: I tuned the parameters for faster speed but with lower sensitivity. The overall accuracy would be affected. In addition, I was using version 1.04 instead of the latest 1.05. The latest version has been optimized a bit for mapping Illumina reads.
* In simulation we assume that base qualities are accurate, but this is not the case on real data. When base qualities are inaccurate, maq's mapping qualities will be less accurate, which will also reduce the accuracy of alignments. rmapq will suffer from the similar problem more or less.
* The CPU time is only a rough approximate. It is subjected to the conditions (e.g. memory and cache uses) of the machine where the program runs. On different conditions, the speed may vary up to 30%.
Results
=======
* eland (GAPipeline-0.3.0, mapping quality ROUGHLY estimated assuming uniform quality Q25)
- CMD: ./eland_32 r12.fa chrX eland.txt
- CPU time: 284.64 sec
- res memory: 383 MB
- # Q0;Q10;Q20 mapped reads: 1,686,129;1678776;1622577
- # Q0;Q10;Q20 wrongly mapped reads: 7,406;4234;819 (p_err: 0.439%;0.252%;0.0505%)
- Other features: PE alignment (not evaluated); mapping qualities (not evaluated); counts of the alternative places.
* rmap (0.2)
- CMD: ./rmap r12.fa -m 2 -o rmap.txt -c chrX.fa -w 32 -v
- CPU time: 2230.07 sec
- res memory: 894 MB
- # mapped reads: 1,686,129
- # wrongly mapped reads: 7,408 (p_err: 0.4393%)
- Other features: mapping using base qualities (not evaluated); 3 or more mismatches (not evaluated)
* maq-se (0.6.3-33, almost identical to 0.6.4)
- CMD: maq map maq_se.map chrX.bfa r12.bfq
- CPU time: 1039.89 sec
- res memory: 819 MB
- # Q0;Q10;Q20 mapped reads: 1946152; 1665959; 1614122
- # Q0;Q10;Q20 wrongly mapped reads: 200563; 1317; 320 (p_err: 10.33%; 0.0790%; 0.0198%)
* maq-pe (0.6.3-33, almost identical to 0.6.4)
- CMD: maq map maq_pe.map chrX.bfa r1.bfq r2.bfq
- CPU time: 1256.13 sec
- res memory: 674 MB
- # Q0;Q10;Q20 mapped reads: 1949177; 1756368; 1728480
- # Q0;Q10;Q20 wrongly mapped reads: 68542; 281; 153 (p_err: 3.516%; 0.0160%; 0.0088%)
- # reads mapped with indel: 2277
- # wrongly mapped indel reads: 0 (p_err = 0/2277 = 0%)
* soap-c0r0g0s (1.07, CPU time is measured with 1.05)
- CMD: ./soap -c 0 -r 0 -g 0 -a r12.fa -d chrX.fa -o soap_c0r0g0s.sop
- CPU time: 1228.120 sec (or 1417.29 user time/372.44 real time if '-p 4' is applied)
- res memory: 963 MB
- # mapped reads: 1,686,034
- # wrongly mapped reads: 7,840 (p_err: 0.4650%)
* soap-c0r0g3s (1.07, CPU time with 1.05)
- CMD: ./soap -c 0 -r 0 -g 3 -a r12.fa -d chrX.fa -o soap_c0r0g3s.sop
- CPU time: 1398.95 sec
- res memory: 963 MB
- # mapped reads: 1,687,887
- # wrongly mapped reads: 7,854 (p_err: 0.4653%)
- # reads mapped with indel: 1853
- # wrongly mapped indel reads: 14 (p_err: 0.756%)
* soap-c42r0g3s (1.07, CPU time with 1.05)
- CMD: ./soap -c 42 -r 0 -g 3 -a r12.fa -d chrX.fa -o soap_c42r0g3s.sop
- CPU time: 1517.41 sec
- res memory: 963 MB
- # mapped reads: 1,698,212
- # wrongly mapped reads: 8,131 (p_err: 0.4788%)
- # reads mapped with indel: 2207
- # wrongly mapped indel reads: 41 (p_err: 1.86%)
* shrimp (1.04)
- CMD: ./rmapper-ls -s 111111011111 -w 35 -n 3 -r 32 -o 20 -d -1 -h 2675 r12.fa chrX.fa
- CPU time: 11875.75 sec
- res memory: 3085MB
- # all;normodds>0.5 mapped reads: 1930350;1587003
- # all;normodds>0.5 wrongly mapped reads: 204355;2414 (p_err: 10.6%;0.152%)
- # all;normodds>0.5 reads mapped with indel: 2062;1817
- # all;normodds>0.5 wrongly mapped indel reads: 212;24 (p_err: 10.3%;1.38%)
Additional Notes
================
* eland: eland is definitely the fastest, much faster than all the competitors. What is more important, eland gives the number of alternative places, which makes it possible for you to get further information about the repetitive structures of the genome and to select reads that can be really confidently mapped. In addition, with the help of additional scripts, Eland IS able to map reads longer than 32bp. Eland is one of the best software I ever used. It would be even superior if Tony could make it easier to use for a user, like me, who wants to run eland independently of the GAPipeline.
* rmap: the strength of rmap is to use base qualities to improve the alignment accuracy. I believe it can produce better alignment than maq-se because maq trades accuracy for speed at this point (technically it is a bit hard to explain here). Nonetheless, I think rmap would be more popular if its authors could add support for fastq-like quality string which is now the standard in both Illumina and the Sanger Institute (although maybe not elsewhere). rmap supports longer reads, which is also a gain. Furthermore, I did learn a lot from its way to count the number of mismatches.
* soap: soap is a versatile program. It supports iterative-trimmed alignment, long reads, gapped alignment, TAG alignment and PE mode. Its PE mode is easier to use than eland. In principle, soap and eland should give almost the same number of wrong alignments. However, soap gives 442 more wrong alignments. Further investigation shows that most of these 442 wrong ones are flagged as R? (repeat) by eland.
* shrimp: Actually I was not expecting that a program using seeding+Smith-Waterman could be that fast. So far as I know, all the other software in the list do not do Smith-Waterman (maq does for PE data only), which is why they are fast. SHRiMP's normodds score has similar meaning to mapping quality. Such score helps to determine whether an alignment is reliable. The most obvious advantage is SHRiMP can map long reads (454/capillary) with the standard gapped alignment. If you only work with small genomes, SHRiMP is a worthy choice. I think SHRiMP would be better if it could make use of paired end information; it would be even better if it could calculate mapping quality. The current normodds score helps but is not exactly the mapping quality. In addition, I also modified probcalc codes because in 1.04 underflow may occur to long reads and leads to "nan" normodds. However, although my revision fixes the underflow, it may lead to some inaccurate normodds.
* maq: at the moment maq is easier to use than eland. Supporting SNP calling is maq's huge gain. Its paired end mode is also highly helpful to recover some repetitive regions. Maq's random mapping, which is frequently misused by users who have not noticed mapping qualities, may be useful to some people, too, and at least it helps to call SNPs at the verge of repeats.
Update
======
* 2008-03-13
- Use maq version 0.6.3-33, which is faster than the public version 0.6.3.
- Compile all the programs with Intel C/C++ compiler and run all the programs on the same 8-core 64bit machine (Xeon-L5320 1.86GHz) with 8GB memory. Multi-threading is NOT used. Previously I run some programs on a faster machine (Xeon-5150 2.66GHz) and some on a slower machine (1.86GHz), which is unfair.
- I invoked two programs at a time. The combination is: (maq_se,maq_pe), (eland,eland-multi), (rmap,soap_c0r0g0s), (soap_c0r0g3s,soap_c42r0g3s). Note that processes on the same machine may compete with the memory bandwidth and cache uses, which may affect the speed.
* 2008-03-14
- Test multithreading of SOAP.
* 2008-03-22
- Use soap_1.07. But PE mode did not work for me unfortunately.
- Add comments for the additional wrong alignments of soap (in comparison to eland).
- Add "fake" eland mapping quality (done by my paf_utils.pl script).
- Evaluate indel alignment. Note that I am using soap's single-end mode. PE mode would give better accuracy.
* 2008-03-27
- Evaluate SHRiMP-1.04
To facilitate comparison, I wrote a Perl script 'paf_utils.pl' that converts various alignment format to the socalled PAF format (Pairwise Alignment Format), which is only for my own use at the moment.
Important Notes
===============
* This benchmark favours maq because I have not squeezed the best performance out of other software. Here are the reasons.
* eland: eland is able to make use of paired end (PE) information and generate accurate mapping qualities. However, these functionality are assisted by several scripts in the GAPipeline-0.3.0 and for the moment I do not know how to run these scripts independently without invoking the full Gerald module of the pipeline. I was running eland in the single end mode without mapping qualities.
* rmap: rmapq is able to use base qualities to improve the alignment. However, it requires _prb.txt format which is rarely used at the Sanger Institute and in Illumina as well. Maq does not support this format in simulation and therefore rmapq is not evaluated. I am sure rmapq would work much better than rmap if base qualities were available.
* soap: soap also comes with the PE alignment mode. Unfortunately, it segfaults when I try to use this mode. Perhaps my input file is buggy or I was doing stupid things, but anyway, I could not evaluate soap in PE alignment mode. If this mode worked, soap would get higher mapping accuracy.
* shrimp: I tuned the parameters for faster speed but with lower sensitivity. The overall accuracy would be affected. In addition, I was using version 1.04 instead of the latest 1.05. The latest version has been optimized a bit for mapping Illumina reads.
* In simulation we assume that base qualities are accurate, but this is not the case on real data. When base qualities are inaccurate, maq's mapping qualities will be less accurate, which will also reduce the accuracy of alignments. rmapq will suffer from the similar problem more or less.
* The CPU time is only a rough approximate. It is subjected to the conditions (e.g. memory and cache uses) of the machine where the program runs. On different conditions, the speed may vary up to 30%.
Results
=======
* eland (GAPipeline-0.3.0, mapping quality ROUGHLY estimated assuming uniform quality Q25)
- CMD: ./eland_32 r12.fa chrX eland.txt
- CPU time: 284.64 sec
- res memory: 383 MB
- # Q0;Q10;Q20 mapped reads: 1,686,129;1678776;1622577
- # Q0;Q10;Q20 wrongly mapped reads: 7,406;4234;819 (p_err: 0.439%;0.252%;0.0505%)
- Other features: PE alignment (not evaluated); mapping qualities (not evaluated); counts of the alternative places.
* rmap (0.2)
- CMD: ./rmap r12.fa -m 2 -o rmap.txt -c chrX.fa -w 32 -v
- CPU time: 2230.07 sec
- res memory: 894 MB
- # mapped reads: 1,686,129
- # wrongly mapped reads: 7,408 (p_err: 0.4393%)
- Other features: mapping using base qualities (not evaluated); 3 or more mismatches (not evaluated)
* maq-se (0.6.3-33, almost identical to 0.6.4)
- CMD: maq map maq_se.map chrX.bfa r12.bfq
- CPU time: 1039.89 sec
- res memory: 819 MB
- # Q0;Q10;Q20 mapped reads: 1946152; 1665959; 1614122
- # Q0;Q10;Q20 wrongly mapped reads: 200563; 1317; 320 (p_err: 10.33%; 0.0790%; 0.0198%)
* maq-pe (0.6.3-33, almost identical to 0.6.4)
- CMD: maq map maq_pe.map chrX.bfa r1.bfq r2.bfq
- CPU time: 1256.13 sec
- res memory: 674 MB
- # Q0;Q10;Q20 mapped reads: 1949177; 1756368; 1728480
- # Q0;Q10;Q20 wrongly mapped reads: 68542; 281; 153 (p_err: 3.516%; 0.0160%; 0.0088%)
- # reads mapped with indel: 2277
- # wrongly mapped indel reads: 0 (p_err = 0/2277 = 0%)
* soap-c0r0g0s (1.07, CPU time is measured with 1.05)
- CMD: ./soap -c 0 -r 0 -g 0 -a r12.fa -d chrX.fa -o soap_c0r0g0s.sop
- CPU time: 1228.120 sec (or 1417.29 user time/372.44 real time if '-p 4' is applied)
- res memory: 963 MB
- # mapped reads: 1,686,034
- # wrongly mapped reads: 7,840 (p_err: 0.4650%)
* soap-c0r0g3s (1.07, CPU time with 1.05)
- CMD: ./soap -c 0 -r 0 -g 3 -a r12.fa -d chrX.fa -o soap_c0r0g3s.sop
- CPU time: 1398.95 sec
- res memory: 963 MB
- # mapped reads: 1,687,887
- # wrongly mapped reads: 7,854 (p_err: 0.4653%)
- # reads mapped with indel: 1853
- # wrongly mapped indel reads: 14 (p_err: 0.756%)
* soap-c42r0g3s (1.07, CPU time with 1.05)
- CMD: ./soap -c 42 -r 0 -g 3 -a r12.fa -d chrX.fa -o soap_c42r0g3s.sop
- CPU time: 1517.41 sec
- res memory: 963 MB
- # mapped reads: 1,698,212
- # wrongly mapped reads: 8,131 (p_err: 0.4788%)
- # reads mapped with indel: 2207
- # wrongly mapped indel reads: 41 (p_err: 1.86%)
* shrimp (1.04)
- CMD: ./rmapper-ls -s 111111011111 -w 35 -n 3 -r 32 -o 20 -d -1 -h 2675 r12.fa chrX.fa
- CPU time: 11875.75 sec
- res memory: 3085MB
- # all;normodds>0.5 mapped reads: 1930350;1587003
- # all;normodds>0.5 wrongly mapped reads: 204355;2414 (p_err: 10.6%;0.152%)
- # all;normodds>0.5 reads mapped with indel: 2062;1817
- # all;normodds>0.5 wrongly mapped indel reads: 212;24 (p_err: 10.3%;1.38%)
Additional Notes
================
* eland: eland is definitely the fastest, much faster than all the competitors. What is more important, eland gives the number of alternative places, which makes it possible for you to get further information about the repetitive structures of the genome and to select reads that can be really confidently mapped. In addition, with the help of additional scripts, Eland IS able to map reads longer than 32bp. Eland is one of the best software I ever used. It would be even superior if Tony could make it easier to use for a user, like me, who wants to run eland independently of the GAPipeline.
* rmap: the strength of rmap is to use base qualities to improve the alignment accuracy. I believe it can produce better alignment than maq-se because maq trades accuracy for speed at this point (technically it is a bit hard to explain here). Nonetheless, I think rmap would be more popular if its authors could add support for fastq-like quality string which is now the standard in both Illumina and the Sanger Institute (although maybe not elsewhere). rmap supports longer reads, which is also a gain. Furthermore, I did learn a lot from its way to count the number of mismatches.
* soap: soap is a versatile program. It supports iterative-trimmed alignment, long reads, gapped alignment, TAG alignment and PE mode. Its PE mode is easier to use than eland. In principle, soap and eland should give almost the same number of wrong alignments. However, soap gives 442 more wrong alignments. Further investigation shows that most of these 442 wrong ones are flagged as R? (repeat) by eland.
* shrimp: Actually I was not expecting that a program using seeding+Smith-Waterman could be that fast. So far as I know, all the other software in the list do not do Smith-Waterman (maq does for PE data only), which is why they are fast. SHRiMP's normodds score has similar meaning to mapping quality. Such score helps to determine whether an alignment is reliable. The most obvious advantage is SHRiMP can map long reads (454/capillary) with the standard gapped alignment. If you only work with small genomes, SHRiMP is a worthy choice. I think SHRiMP would be better if it could make use of paired end information; it would be even better if it could calculate mapping quality. The current normodds score helps but is not exactly the mapping quality. In addition, I also modified probcalc codes because in 1.04 underflow may occur to long reads and leads to "nan" normodds. However, although my revision fixes the underflow, it may lead to some inaccurate normodds.
* maq: at the moment maq is easier to use than eland. Supporting SNP calling is maq's huge gain. Its paired end mode is also highly helpful to recover some repetitive regions. Maq's random mapping, which is frequently misused by users who have not noticed mapping qualities, may be useful to some people, too, and at least it helps to call SNPs at the verge of repeats.
Update
======
* 2008-03-13
- Use maq version 0.6.3-33, which is faster than the public version 0.6.3.
- Compile all the programs with Intel C/C++ compiler and run all the programs on the same 8-core 64bit machine (Xeon-L5320 1.86GHz) with 8GB memory. Multi-threading is NOT used. Previously I run some programs on a faster machine (Xeon-5150 2.66GHz) and some on a slower machine (1.86GHz), which is unfair.
- I invoked two programs at a time. The combination is: (maq_se,maq_pe), (eland,eland-multi), (rmap,soap_c0r0g0s), (soap_c0r0g3s,soap_c42r0g3s). Note that processes on the same machine may compete with the memory bandwidth and cache uses, which may affect the speed.
* 2008-03-14
- Test multithreading of SOAP.
* 2008-03-22
- Use soap_1.07. But PE mode did not work for me unfortunately.
- Add comments for the additional wrong alignments of soap (in comparison to eland).
- Add "fake" eland mapping quality (done by my paf_utils.pl script).
- Evaluate indel alignment. Note that I am using soap's single-end mode. PE mode would give better accuracy.
* 2008-03-27
- Evaluate SHRiMP-1.04
Comment