Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Where to start

    Hi All.

    I'm a mathematician, hoping to do a PhD on the data-analysis (statistics) of NGS-data, at the university of Ghent (Roche / Illumina).
    Unfortunately, up to now, this has not been specified, so it is not yet clear to me what kind of data I will be presented with (ChiP, de novo,...)

    This also implies that to this day, I have no data to work on, nor a clear sight on what will be expected.
    I've simply been reading up on NGS and statistics (finding strangely little articles linking them). Even more so, I am quite new at biotechnology, so it is not easy to get a focus.

    So here's my question: I would like to prepare myself somewhat for when the 'real' questions come (I expect these in the range of the next few months), so I'd like to emulate some data-analysis. Do any of you have pointers on:
    * which type of analysis would be a good starter?
    * where could I find sample data (ideally with a matching article on how somebody else analysed it)?
    * what are the statistical challenges brought on by NGS (as opposed to classical sequencing), apart from sheer volume?
    * which 'general' statistical subjects would be a good read (books/subjects welcome), e.g.: would bootstrap do me any good (and why)?

    Thanks in advance for any suggestions!

  • #2
    research topics for nextgen sequencing

    The field of analysis of RNA-seq data is somewhat young... I think many people intend to use some of the statistics developed for microarrays to detect differential expression. Some data were presented in recent Nature Methods and Genome Research papers, and I believe some are posted at the short read archive. Furthermore, ABI is pretty open with sharing data from their research labs.

    If you are open to combinatorial problems and not just statistics, there are some problems related to de novo fragment assembly that we could discuss.

    cheers,
    -mark

    Comment


    • #3
      Hello Mark,

      Thanks for the input. For me, ABI is not an option (UGent has only recently acquired the FLX and the GA). Some of my colleagues who have a history in microarrays are indeed hoping to extend their findings to NGS, but I prefer not to swim in the same lanes.

      I know by now (I am 'only the statistician') that the first runs on our FLX have been amplicon sequencing runs, and some de novo-runs (bacteria) are soon to come, so for now, I'm trying to get a picture of how these are processed nowadays. I'm hoping that once I understand the mechanisms at present, I will at least be able to find some articles describing their statistics (e.g.: I have not been able to find a single article giving proper explanation for the 'habit' of having a coverage of 20 - though everybody seems to agree that this works)

      But yes, I am also interested in combinatorics, and always open to suggestions.

      Nick.

      Comment


      • #4
        NextGen Coverage vs. Lander Waterman Statistics

        The fact that people use at least 20X coverage points out some of the difficulties in accurate statistics for sequencing. Say the FLX sequencer were only producing 100 base reads, and coverage is 20X. In de novo assembly, most (all) assemblers will have a minimum over lap length (either explicitly stated or as a k-value in a de Bruijn graph), so say k=25, so the coverage is 25% less. Still, at 15X coverage, the Lander Waterman statistics dictate that there will be one contig, yet when mapping reads back to the genome there are still usually gaps. This is worse with Illumina GAI sequencers, where I have found that 80X coverage with 35 base reads finally begins to overcome sample bias and get rid of gaps in assembly.

        I'm not saying this is an open field for research -- rather something to steer clear of. 20X coverage seems to compensate for amplification bias in 454 sequencing, which are difficult to model. In illumina sequencing projects, this will probably be overcome by adding scaffolding methods to assemblers. I imagine the latest release of Velvet has this given some of the results I've seen, and I'm working on making this a standard part of euler.

        A more reasonable avenue for statistical development, at least in de novo assembly, is regarding repeat coverage. All short read assemblers resolve repeats by using ends of mate-pairs that span the repeat.

        So, if a genome has:

        ABCDpqrEFGHIJKpqrLMNpqrSTUV

        and reads are sequenced with 3 characters
        mate-pairs BCD---EFG and IJK---LMN, LMN---STU are required to resolve if the genome is
        ABCDpqrEFGHIJKpqrLMNpqrSTUV
        versus
        ABCDpqrLMNpqrEFGHIJKpqrSTUV

        So, the question is, given a genome size G, repeat length r, repeat multiplicity m, clone length L, read length l, and number of reads, N, what is the probability that mate-pairs span all repeats?

        Comment


        • #5
          as of today

          Originally posted by mchaisso View Post
          The fact that people use at least 20X coverage points out some of the difficulties in accurate statistics for sequencing. Say the FLX sequencer were only producing 100 base reads, and coverage is 20X. In de novo assembly, most (all) assemblers will have a minimum over lap length (either explicitly stated or as a k-value in a de Bruijn graph), so say k=25, so the coverage is 25% less. Still, at 15X coverage, the Lander Waterman statistics dictate that there will be one contig, yet when mapping reads back to the genome there are still usually gaps. This is worse with Illumina GAI sequencers, where I have found that 80X coverage with 35 base reads finally begins to overcome sample bias and get rid of gaps in assembly.

          I'm not saying this is an open field for research -- rather something to steer clear of. 20X coverage seems to compensate for amplification bias in 454 sequencing, which are difficult to model. In illumina sequencing projects, this will probably be overcome by adding scaffolding methods to assemblers. I imagine the latest release of Velvet has this given some of the results I've seen, and I'm working on making this a standard part of euler.

          A more reasonable avenue for statistical development, at least in de novo assembly, is regarding repeat coverage. All short read assemblers resolve repeats by using ends of mate-pairs that span the repeat.

          So, if a genome has:

          ABCDpqrEFGHIJKpqrLMNpqrSTUV

          and reads are sequenced with 3 characters
          mate-pairs BCD---EFG and IJK---LMN, LMN---STU are required to resolve if the genome is
          ABCDpqrEFGHIJKpqrLMNpqrSTUV
          versus
          ABCDpqrLMNpqrEFGHIJKpqrSTUV

          So, the question is, given a genome size G, repeat length r, repeat multiplicity m, clone length L, read length l, and number of reads, N, what is the probability that mate-pairs span all repeats?
          solved as of today?

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Genetic Variation in Immunogenetics and Antibody Diversity
            by seqadmin



            The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
            11-06-2024, 07:24 PM
          • seqadmin
            Choosing Between NGS and qPCR
            by seqadmin



            Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
            10-18-2024, 07:11 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 11:09 AM
          0 responses
          24 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Today, 06:13 AM
          0 responses
          20 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 11-01-2024, 06:09 AM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 10-30-2024, 05:31 AM
          0 responses
          21 views
          0 likes
          Last Post seqadmin  
          Working...
          X