Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Many, many dumb newbie questions

    I'm just jumping into bioinformatics, especially the algorithms and analysis tools end of the pool (my background is databases and compression.). I'm now getting most interested in massive alignment (both to a reference and denovo). So I'm simultaneously reading forum posts here, studying algorithms texts like Gusfield, breezing through the Cartoon Guide to Genetics, but mostly reading the scientific papers for all the assembly tools. (Side note, the biology community has a lot better organization and accessibility of scientific papers! The CS community has a big ACM and IEEE bias, which both discourage online paper sharing since they want their download fees per paper..)

    Anyway, here's a ton of questions from my notes, and I'd love for any simple answers or pointers. It's fine to just give some answers, or a pointer or link, or even quick four word answers which would be enough for me to branch my search off of. I appreciate any help!

    In no particular order, here's my shotgun blast of newbie questions. These are mostly about new generation sequencing, and mapping them to a genome and/or doing a denovo assembly with them.
    • When a sequencer spits out its sequences, they must be huge.. are they just saved as big flat files in ASCII ATTCGTAGCA characters, or are they compressed (two bits per bp, with a header)? Does each sequencer (Solid, 454, etc) have its own format? Or is SAM/BAM now standardized and most common?
      Are many runs put into one file (probably.. or you'd have 1B files of a few hundred bytes, ugh!)
    • If you've made a big sequencing run, how do you move the data around? Isn't it tens of gigabytes for a reasonably complex run? Everyone just uses FTP over their fiber net connections?
    • Do people tend to try to compress these big lists of sequences using standard tools like gzip or 7zip? Or is it not worth it?
    • Are the sequencers themselves driven by a standard PC.. maybe running Linux or something? Is it possible for users to add their own processing steps in the sequencer's workflow? Ie, do some analysis/compression when the data has been read in but before the data files are saved out? I realize this may have different answers for different machines.
    • How many next generation sequencers are out there? I mean absolute machine counts. Are there like 100 machines in the world, 1000, 10000, 100000?
    • In assembly/matching what are typical error rates for single bp reads? (not counting bad ends.) is it like 0.1%? 1%? Are there ways of changing the sequencer's behavior, maybe getting faster reads but with more error?
      This again is very likely different for the various machines.
    • When you do get single bp errors, does each sequencing strategy have its own error behavior? Maybe some error matrix that says for this machine, C is sometimes misread as T with a probability of X, C is sometimes misread as G with a probability of Y...
    • How common are gaps in sequence reads? Are the gaps totally random, like from two totally different parts of the DNA strand, or are they just small slips like somehow 10bp are just missing?
    • How often are there contaminations.. sequences which somehow don't even belong to the genome you're trying to measure? How do you detect these?
    • Is mixed source DNA ever deliberately sampled? Something like taking samples of gut bacteria and analyzing the mix of random sequences to estimate the diversity of the flora?
    • Can sequence sampling be guided at all, or is it truly a random sample from the whole genome? Can you try to just analyze one chromosome somehow from the very start? Or maybe you have an assembly and you just want some more samples in one general area, can you try to boost the probability of samples occuring there in the next sequencer run?
    • Is there some classic standard genome and maybe raw example sequence samples from each brand of sequencer that people use to compare different software against? Something like BAliBASE, which tries to just present a standardized problem for software to be judged against. It'd be interesting to see how different tools can either align to the standard, or create a denovo alignment, given different data sources and error rates.
    • Is a denovo assembly always preferred over one that used an existing sequence as a framework? (I would think so..) Would you choose an alignment to reference analysis instead of denovo just because of speed? Or is it common to make runs with so few repeats the denovo assembly can't really connect them, whereas you can still get some good science out of the align-to-reference?



    Yes, I know my questions are all over the map. I really appreciate your help getting me at least initially oriented.
    Last edited by GerryB; 05-06-2009, 12:48 AM.

  • #2
    Whee! A bunch of dumb newbie questions that I can answer from my dumb less-newbie standpoint. BTW: I am the bioinformatics person for a small sequencing facility at Purdue University. We have two 3730 (Sanger) sequencers, one 454 and one SOLiD.

    Originally posted by GerryB View Post
    When a sequencer spits out its sequences, they must be huge.. are they just saved as big flat files in ASCII ATTCGTAGCA characters, or are they compressed (two bits per bp, with a header)?

    Does each sequencer (Solid, 454, etc) have its own format?

    Or is SAM/BAM now standardized and most common?

    Are many runs put into one file (probably.. or you'd have 1B files of a few
    hundred bytes, ugh!)
    Yes they are large, in their own format, and multi-run per file. SAM/BAM is far from a standard. Sequencing file standards come and go. FASTA, as poor as it is, rules.

    Our most recent 454 titanium run put out a total 25 GB of 834 raw images in 'pif' format. After image processing this was reduced to 3.2 GB of 2 SFF files. The data in the SFF files is also available as ~1.5 GB of sequence (ACGT) and quality (just as important!) FASTA-style files. From those files assembly, mapping, etc. can be done.

    Our most recent SOLiD run (version 2 chemistry) produced 15 GB in 2 color-space fasta-format files plus 30 GB in 2 quality files. I don't even bother saving the image files since they are so large. From the color-space files further color-space files are created. Because of the way color-space works it is rare that we convert them to base-space (ACGT) files.

    The SOLiD pipeline is very verbose and throws off a lot of temporary files. I can easily use 500 GB of disk space and then later get a project down to less than 100 GB of data to give to the customer.

    If you've made a big sequencing run, how do you move the data around? Isn't it tens of gigabytes for a reasonably complex run? Everyone just uses FTP over their fiber net connections?
    FTP is not secure and thus most of our networks have shut it down. We either transfer the files via ssh/scp or via a direct NFS-mount (also not secure but at least more controllable.) Generally all on 1-GB twisted-pair connections.

    I have also sent out the data on a portable hard drive to one our customers.

    Since processing has to be done on multiple nodes reading from a central file server then said server can bog down with all of the reads and writes. My sysadmin generally has a frown on his face when he sees me. :-(


    Do people tend to try to compress these big lists of sequences using standard tools like gzip or 7zip? Or is it not worth it?
    Sometimes gzip or another compression is used. Most of the time this is not worth it until the very end when handing off information to the customer.

    Are the sequencers themselves driven by a standard PC.. maybe running Linux or something? Is it possible for users to add their own processing steps in the sequencer's workflow? Ie, do some analysis/compression when the data has been read in but before the data files are saved out? I realize this may have different answers for different machines.
    Our 454 has a dual-core PC running Linux. It is capable of doing the image processing within a reasonable time frame. However subsequent processing would take days. We just take the images and put them on a 16-core large memory machine to do the image process and the rest of the analysis steps.

    Our SOLiD has 5 quad-core blades. 1 blade is a Windows front-end. The other 4 run a Linux cluster. The Linux cluster does image processing (in more-or-less real time ... one neat feature about the SOLiD is that it is possible to back-up and redo a sequencing step if there are problems detected). For subsequent analysis we put the color-space files on a cluster of 64-nodes. Not that I get to use all 64 without lots of complaints from the other users! Our cluster happens to run Solaris but it could easily be Linux.

    For all it is possible to add, at least some level, additional processing steps.

    How many next generation sequencers are out there? I mean absolute machine counts. Are there like 100 machines in the world, 1000, 10000, 100000?
    As an swag, I'd say in the low thousands. Probably most are 454s, then Solexas then SOLiDs. (and yes, Polonators and other off-beat varieties.)

    In assembly/matching what are typical error rates for single bp reads? (not counting bad ends.) is it like 0.1%? 1%? Are there ways of changing the sequencer's behavior, maybe getting faster reads but with more error?
    This again is very likely different for the various machines.
    I don't have a good percentage number. Probably 1%.

    The 3730s (Sangers) could be modified to give short or longer reads with more or less error on either end of the read. The quality scores help in stitching individual reads together in a reasonable way.

    For the 2nd gen sequencers it often does not pay to try to decrease run time at the expense of error. A 454 run will take 8 hours at a cost of ~$10,000. Prep for the run will take longer. So there is not much need to reduce run time. A SOLiD run will take 1 week or 2 weeks for paired runs. Reducing it by a day or two at the cost of increased errors does not seem reasonable.

    When you do get single bp errors, does each sequencing strategy have its own error behavior? Maybe some error matrix that says for this machine, C is sometimes misread as T with a probability of X, C is sometimes misread as G with a probability of Y...
    I am unaware of such a matrix. Certainly the different sequencers have different biases. 454 does poorly on homo-polymer runs. The SOLiD seems to have a higher raw error rate but due to the color-space processing catches and fixes most of the errors before further processing goes on. The Sanger sequencers will have clonal biases plus other running biases depending on the sequencer.

    But this is why quality values were invented. One should not evaluate a sequence from a sequencer without the attached quality information. Unless it is a self-correcting SOLiD run in color-space.

    How common are gaps in sequence reads? Are the gaps totally random, like from two totally different parts of the DNA strand, or are they just small slips like somehow 10bp are just missing?
    Gaps in the reads themselves should never occur. Or if they do then the quality of the read would be very low.

    Gaps in matching reads to reference sequences will occur often. Unless you are sequencing and comparing the exact individual. InDels and SNPs are some of the items that make individuals individual within the same species.



    How often are there contaminations.. sequences which somehow don't even belong to the genome you're trying to measure? How do you detect these?
    Welcome to the headache. Answer: often or at least it seems that way. In Sanger sequencing there can be cross-well contamination. There can be cross-bead and cross-well problems in all of the 2nd gen sequencers. Although, supposedly, the image processing will take care of these problems. Always there can be sample prep errors.

    Knowing what sequence you are working with helps. Running VecScreen or other blast-type programs of your sequence to the general database helps.

    Actually there is probably not much contamination. But what there is can percolate to finished genomes if it is not caught. There is one such un-published genome that my group is currently looking at in preparation for publication. Not our sequencing but ... we have found small pieces of human DNA in parts of the genome. Told the people involved and they had screen for the typical bacterial contamination but not other contamination Oops! What happened? Did someone grab the wrong sample somewhere? Not wear gloves? And why didn't we catch it before now? The problem is not simple. Maybe this organism has, somehow, incorporated human genes? Cross-species gene sharing is not unknown.

    Is mixed source DNA ever deliberately sampled? Something like taking samples of gut bacteria and analyzing the mix of random sequences to estimate the diversity of the flora?
    Yes. See Venter's work as a famous example. (http://www.jcvi.org/)

    Can sequence sampling be guided at all, or is it truly a random sample from the whole genome? Can you try to just analyze one chromosome somehow from the very start? Or maybe you have an assembly and you just want some more samples in one general area, can you try to boost the probability of samples occuring there in the next sequencer run?
    Yes, yes, and yes. The details are not simple. But yes, non-directed whole genome sequencing is actually rare. It is only with 2nd gen sequencers that it becomes feasible to do whole genome sequencing and then throw away a large portion of your data in order to focus in on part of the genome.

    Ah. Recent paper from Purdue. I haven't read it aside from the abstract but it talks about ignoring normalized cDNA libraries and just use the raw (whole) cDNA libraries via 454 sequencing in order to do 'rarefaction' or normalization via data tossing -- at least that is how I would put it.

    Background Next-generation sequencing technologies have been applied most often to model organisms or species closely related to a model. However, these methods have the potential to be valuable in many wild organisms, including those of conservation concern. We used Roche 454 pyrosequencing to characterize gene expression in polyploid lake sturgeon (Acipenser fulvescens) gonads. Results Titration runs on a Roche 454 GS-FLX produced more than 47,000 sequencing reads. These reads represented 20,741 unique sequences that passed quality control (mean length = 186 bp). These were assembled into 1,831 contigs (mean contig depth = 4.1 sequences). Over 4,000 sequencing reads (~19%) were assigned gene ontologies, mostly to protein, RNA, and ion binding. A total of 877 candidate SNPs were identified from > 50 different genes. We employed an analytical approach from theoretical ecology (rarefaction) to evaluate depth of sequencing coverage relative to gene discovery. We also considered the relative merits of normalized versus native cDNA libraries when using next-generation sequencing platforms. Not surprisingly, fewer genes from the normalized libraries were rRNA subunits. Rarefaction suggests that normalization has little influence on the efficiency of gene discovery, at least when working with thousands of reads from a single tissue type. Conclusion Our data indicate that titration runs on 454 sequencers can characterize thousands of expressed sequence tags which can be used to identify SNPs, gene ontologies, and levels of gene expression in species of conservation concern. We anticipate that rarefaction will be useful in evaluations of gene discovery and that next-generation sequencing technologies hold great potential for the study of other non-model organisms.



    Is there some classic standard genome and maybe raw example sequence samples from each brand of sequencer that people use to compare different software against? Something like BAliBASE, which tries to just present a standardized problem for software to be judged against. It'd be interesting to see how different tools can either align to the standard, or create a denovo alignment, given different data sources and error rates.
    An on-going question. Certainly one could obtain raw image files and then compare the various programs. But ... one program may be optimized for bacterial genomes. Another for highly repetitive genomes. One program may be optimized for memory constraints while another program may be optimized for cpu usage. Which one is better? I have a 192 GB memory machine so I'd rather see a quicker program at the expense of memory. And I don't work with bacteria. But that may not be applicable to you.



    [*]Is a denovo assembly always preferred over one that used an existing sequence as a framework? (I would think so..) Would you choose an alignment to reference analysis instead of denovo just because of speed? Or is it common to make runs with so few repeats the denovo assembly can't really connect them, whereas you can still get some good science out of the align-to-reference?
    SOLiD and Polanator currently has to be mapped to a reference. With 35 or less base reads it is impossible to realistically do de-novo assembly. Solexa with 75 (or so) base reads could be used for de-novo. 454 with 300+ bp reads can be used for de-novo.

    But even with Sanger sequencing (800+ bases) there are problems with assembly, especially at low coverage. So having a framework to hang one's reads off of helps out a lot. Of course you are then stuck with working with known information. So a mixed mapping and de-novo approach can be useful.

    Really it does depend on your project and what you wish to discover.


    Yes, I know my questions are all over the map. I really appreciate your help getting me at least initially oriented.
    Hope I helped at least a bit. Bioinformatics is a constantly evolving and complex field. Have fun with it! And don't forget that biology is messy. Computer type people (and I am one) hate this fact.

    Comment


    • #3
      Hi guys,

      As I'm newbie as well, this post helped me a lot to clarify things, thanks for all the work

      Comment


      • #4
        Ok, I'll also give a few questions a try:

        Originally posted by GerryB View Post
        Or is SAM/BAM now standardized and most common?
        Not really, but we are getting there. Especially, the SAMtools come with a collection of Perl scripts to convert many of the other formats to SAM. Many down-stream analysis tools now expect SAM, so these converters are quite useful.

        f you've made a big sequencing run, how do you move the data around? Isn't it tens of gigabytes for a reasonably complex run? Everyone just uses FTP over their fiber net connections?
        A few GB over a LAN is not such a big deal. If you want to send things around between institutes, it is more of a headache. One solution is to use a UDP-based file transfer protocol (FTP is unnecessary slow because it is TCP-based and keeps waiting for confirmation before sending the next chunk), but people don't seem to be able to agree on a standard. A nice low-tech solution is to send hard disks by mail. Buy something like this, and you don't need to open your PC anymore to plug in a bare hard disk.

        Do people tend to try to compress these big lists of sequences using standard tools like gzip or 7zip? Or is it not worth it?
        Many tools run input files transparently through gzip. This is useful. The SAMtools suggest a new standard for zipped files that allows random access, which is very useful (The zip library had this feature since always but the gzip file format does not expose it.)

        In assembly/matching what are typical error rates for single bp reads? (not counting bad ends.) is it like 0.1%? 1%?
        The quality scores in the FASTQ files are meant to give you the base caller's best estimate for the error probability. The probability that a base with quality score Q is called wrongly is (if the base caller is calibrated well) 10^(-Q/10).

        When you do get single bp errors, does each sequencing strategy have its own error behavior? Maybe some error matrix that says for this machine, C is sometimes misread as T with a probability of X, C is sometimes misread as G with a probability of Y...
        For Solexa, read up on the "cross-talk matrix": The Solexa uses two lasers and four colour filters to scan the flow cell. Two bases share one laser wavelength and are hence more easily mistaken for each other. (They might have changed this recently, though.)

        How common are gaps in sequence reads? Are the gaps totally random, like from two totally different parts of the DNA strand, or are they just small slips like somehow 10bp are just missing?
        Gaps are unlikely. You have very many copies of the DNA molecule in a cluster, and if some of them miss a few base incorporations and fall behind, the base caller will simply fail to get a clear signal.

        In library preparation, it seems unlikely as well that two fragments get joined.

        However, it is well possible that a large chunk of sequence is really missing in your cells that is present in your reference genome, or something is present in your cells and missing in the reference. These so-called "structural variations" are currently a hot topic of research.

        Is mixed source DNA ever deliberately sampled? Something like taking samples of gut bacteria and analyzing the mix of random sequences to estimate the diversity of the flora?
        Sure. This is called "meta-genomics" and very trendy.

        Can sequence sampling be guided at all, or is it truly a random sample from the whole genome?
        Usually, people take everything. In RNA-Seq considerable effort is required, however, to get rid of ribosomal RNA. For an example of targeted sequenceing, Google for "exon capture".

        Can you try to just analyze one chromosome somehow from the very start?
        That's what people often did in the old days, before high-throughput sequencing. Google for "chromosome walking".

        Cheers
        Simon

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Genetic Variation in Immunogenetics and Antibody Diversity
          by seqadmin



          The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
          11-06-2024, 07:24 PM
        • seqadmin
          Choosing Between NGS and qPCR
          by seqadmin



          Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
          10-18-2024, 07:11 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 11:09 AM
        0 responses
        22 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Today, 06:13 AM
        0 responses
        20 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 11-01-2024, 06:09 AM
        0 responses
        30 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-30-2024, 05:31 AM
        0 responses
        21 views
        0 likes
        Last Post seqadmin  
        Working...
        X