I'm just jumping into bioinformatics, especially the algorithms and analysis tools end of the pool (my background is databases and compression.). I'm now getting most interested in massive alignment (both to a reference and denovo). So I'm simultaneously reading forum posts here, studying algorithms texts like Gusfield, breezing through the Cartoon Guide to Genetics, but mostly reading the scientific papers for all the assembly tools. (Side note, the biology community has a lot better organization and accessibility of scientific papers! The CS community has a big ACM and IEEE bias, which both discourage online paper sharing since they want their download fees per paper..)
Anyway, here's a ton of questions from my notes, and I'd love for any simple answers or pointers. It's fine to just give some answers, or a pointer or link, or even quick four word answers which would be enough for me to branch my search off of. I appreciate any help!
In no particular order, here's my shotgun blast of newbie questions. These are mostly about new generation sequencing, and mapping them to a genome and/or doing a denovo assembly with them.
Yes, I know my questions are all over the map. I really appreciate your help getting me at least initially oriented.
Anyway, here's a ton of questions from my notes, and I'd love for any simple answers or pointers. It's fine to just give some answers, or a pointer or link, or even quick four word answers which would be enough for me to branch my search off of. I appreciate any help!
In no particular order, here's my shotgun blast of newbie questions. These are mostly about new generation sequencing, and mapping them to a genome and/or doing a denovo assembly with them.
- When a sequencer spits out its sequences, they must be huge.. are they just saved as big flat files in ASCII ATTCGTAGCA characters, or are they compressed (two bits per bp, with a header)? Does each sequencer (Solid, 454, etc) have its own format? Or is SAM/BAM now standardized and most common?
Are many runs put into one file (probably.. or you'd have 1B files of a few hundred bytes, ugh!) - If you've made a big sequencing run, how do you move the data around? Isn't it tens of gigabytes for a reasonably complex run? Everyone just uses FTP over their fiber net connections?
- Do people tend to try to compress these big lists of sequences using standard tools like gzip or 7zip? Or is it not worth it?
- Are the sequencers themselves driven by a standard PC.. maybe running Linux or something? Is it possible for users to add their own processing steps in the sequencer's workflow? Ie, do some analysis/compression when the data has been read in but before the data files are saved out? I realize this may have different answers for different machines.
- How many next generation sequencers are out there? I mean absolute machine counts. Are there like 100 machines in the world, 1000, 10000, 100000?
- In assembly/matching what are typical error rates for single bp reads? (not counting bad ends.) is it like 0.1%? 1%? Are there ways of changing the sequencer's behavior, maybe getting faster reads but with more error?
This again is very likely different for the various machines. - When you do get single bp errors, does each sequencing strategy have its own error behavior? Maybe some error matrix that says for this machine, C is sometimes misread as T with a probability of X, C is sometimes misread as G with a probability of Y...
- How common are gaps in sequence reads? Are the gaps totally random, like from two totally different parts of the DNA strand, or are they just small slips like somehow 10bp are just missing?
- How often are there contaminations.. sequences which somehow don't even belong to the genome you're trying to measure? How do you detect these?
- Is mixed source DNA ever deliberately sampled? Something like taking samples of gut bacteria and analyzing the mix of random sequences to estimate the diversity of the flora?
- Can sequence sampling be guided at all, or is it truly a random sample from the whole genome? Can you try to just analyze one chromosome somehow from the very start? Or maybe you have an assembly and you just want some more samples in one general area, can you try to boost the probability of samples occuring there in the next sequencer run?
- Is there some classic standard genome and maybe raw example sequence samples from each brand of sequencer that people use to compare different software against? Something like BAliBASE, which tries to just present a standardized problem for software to be judged against. It'd be interesting to see how different tools can either align to the standard, or create a denovo alignment, given different data sources and error rates.
- Is a denovo assembly always preferred over one that used an existing sequence as a framework? (I would think so..) Would you choose an alignment to reference analysis instead of denovo just because of speed? Or is it common to make runs with so few repeats the denovo assembly can't really connect them, whereas you can still get some good science out of the align-to-reference?
Yes, I know my questions are all over the map. I really appreciate your help getting me at least initially oriented.
Comment