Hi everyone,
I'm starting a project to compare the various ngs alignment tools on the market for speed,accuracy, cpu consumption. Before I head down this path I 've found a few but enlightening papers on the subject but wanted some input from folks with a lot more experience in the field.
I am thinking of taking a two prong approach. A layered analysis with increasing level of complexities. So the first 1/2 would be simulation data based on chromosome 16 hg19 reference sequence. Stick to one coverage but add sequencing errors, repeats, mismatches (1-3nt), and SNP introductions.
I would then follow this by using real data from ENCODE. Something that has been published. Probably also using only one aligned chromosome.
Tools:
Simulation: (please suggest ones that can add repeats /multimaps)
Alignment tools: (if I'm missing some please let me know)
Potential Metrics : (please add ones that you use or think would be useful)
I plan on doing some QC on each simulated read prior to running as a confirmation.
Finally the tools will be available publicly and I may create some galaxy workflows. I 'd like this analysis to be useful to less technical people and to create a better decision tree for scientists to decide which tools are best for their applications.
Thanks in advance and I hope others could help me fill in some of the gaps with their experience. Are there and simulation tools you use that would be good here? What alignment tools have I missed that you use? etc..
Holes in my metrics or theory.
Thanks,
I'm starting a project to compare the various ngs alignment tools on the market for speed,accuracy, cpu consumption. Before I head down this path I 've found a few but enlightening papers on the subject but wanted some input from folks with a lot more experience in the field.
I am thinking of taking a two prong approach. A layered analysis with increasing level of complexities. So the first 1/2 would be simulation data based on chromosome 16 hg19 reference sequence. Stick to one coverage but add sequencing errors, repeats, mismatches (1-3nt), and SNP introductions.
I would then follow this by using real data from ENCODE. Something that has been published. Probably also using only one aligned chromosome.
Tools:
Simulation: (please suggest ones that can add repeats /multimaps)
- GemSIM
- samtools
Alignment tools: (if I'm missing some please let me know)
- BWA
- Bowtie
- Novoalign
- SOAP
- ELAND
- RMAP
- SHRiMP
- MAQ
Potential Metrics : (please add ones that you use or think would be useful)
- Really Mapped Genes (easier for Sim) is the core set of genes mapped by all alignment tools as true
- Negative mapped genes (unmapped): Total number of genes that were not mapped by any tool
- Comparisons between tools (mapped/unmapped)
- False Positives
- False Negatives
- MultiMap scoring
I plan on doing some QC on each simulated read prior to running as a confirmation.
Finally the tools will be available publicly and I may create some galaxy workflows. I 'd like this analysis to be useful to less technical people and to create a better decision tree for scientists to decide which tools are best for their applications.
Thanks in advance and I hope others could help me fill in some of the gaps with their experience. Are there and simulation tools you use that would be good here? What alignment tools have I missed that you use? etc..
Holes in my metrics or theory.
Thanks,