Seqanswers Leaderboard Ad

Collapse
X
Collapse
+ More Options
Posts
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • nullabee
    Junior Member
    • Feb 2009
    • 3

    Where to start

    Hi All.

    I'm a mathematician, hoping to do a PhD on the data-analysis (statistics) of NGS-data, at the university of Ghent (Roche / Illumina).
    Unfortunately, up to now, this has not been specified, so it is not yet clear to me what kind of data I will be presented with (ChiP, de novo,...)

    This also implies that to this day, I have no data to work on, nor a clear sight on what will be expected.
    I've simply been reading up on NGS and statistics (finding strangely little articles linking them). Even more so, I am quite new at biotechnology, so it is not easy to get a focus.

    So here's my question: I would like to prepare myself somewhat for when the 'real' questions come (I expect these in the range of the next few months), so I'd like to emulate some data-analysis. Do any of you have pointers on:
    * which type of analysis would be a good starter?
    * where could I find sample data (ideally with a matching article on how somebody else analysed it)?
    * what are the statistical challenges brought on by NGS (as opposed to classical sequencing), apart from sheer volume?
    * which 'general' statistical subjects would be a good read (books/subjects welcome), e.g.: would bootstrap do me any good (and why)?

    Thanks in advance for any suggestions!
  • mchaisso
    Member
    • Apr 2008
    • 84

    #2
    research topics for nextgen sequencing

    The field of analysis of RNA-seq data is somewhat young... I think many people intend to use some of the statistics developed for microarrays to detect differential expression. Some data were presented in recent Nature Methods and Genome Research papers, and I believe some are posted at the short read archive. Furthermore, ABI is pretty open with sharing data from their research labs.

    If you are open to combinatorial problems and not just statistics, there are some problems related to de novo fragment assembly that we could discuss.

    cheers,
    -mark

    Comment

    • nullabee
      Junior Member
      • Feb 2009
      • 3

      #3
      Hello Mark,

      Thanks for the input. For me, ABI is not an option (UGent has only recently acquired the FLX and the GA). Some of my colleagues who have a history in microarrays are indeed hoping to extend their findings to NGS, but I prefer not to swim in the same lanes.

      I know by now (I am 'only the statistician') that the first runs on our FLX have been amplicon sequencing runs, and some de novo-runs (bacteria) are soon to come, so for now, I'm trying to get a picture of how these are processed nowadays. I'm hoping that once I understand the mechanisms at present, I will at least be able to find some articles describing their statistics (e.g.: I have not been able to find a single article giving proper explanation for the 'habit' of having a coverage of 20 - though everybody seems to agree that this works)

      But yes, I am also interested in combinatorics, and always open to suggestions.

      Nick.

      Comment

      • mchaisso
        Member
        • Apr 2008
        • 84

        #4
        NextGen Coverage vs. Lander Waterman Statistics

        The fact that people use at least 20X coverage points out some of the difficulties in accurate statistics for sequencing. Say the FLX sequencer were only producing 100 base reads, and coverage is 20X. In de novo assembly, most (all) assemblers will have a minimum over lap length (either explicitly stated or as a k-value in a de Bruijn graph), so say k=25, so the coverage is 25% less. Still, at 15X coverage, the Lander Waterman statistics dictate that there will be one contig, yet when mapping reads back to the genome there are still usually gaps. This is worse with Illumina GAI sequencers, where I have found that 80X coverage with 35 base reads finally begins to overcome sample bias and get rid of gaps in assembly.

        I'm not saying this is an open field for research -- rather something to steer clear of. 20X coverage seems to compensate for amplification bias in 454 sequencing, which are difficult to model. In illumina sequencing projects, this will probably be overcome by adding scaffolding methods to assemblers. I imagine the latest release of Velvet has this given some of the results I've seen, and I'm working on making this a standard part of euler.

        A more reasonable avenue for statistical development, at least in de novo assembly, is regarding repeat coverage. All short read assemblers resolve repeats by using ends of mate-pairs that span the repeat.

        So, if a genome has:

        ABCDpqrEFGHIJKpqrLMNpqrSTUV

        and reads are sequenced with 3 characters
        mate-pairs BCD---EFG and IJK---LMN, LMN---STU are required to resolve if the genome is
        ABCDpqrEFGHIJKpqrLMNpqrSTUV
        versus
        ABCDpqrLMNpqrEFGHIJKpqrSTUV

        So, the question is, given a genome size G, repeat length r, repeat multiplicity m, clone length L, read length l, and number of reads, N, what is the probability that mate-pairs span all repeats?

        Comment

        • mandova
          Member
          • Mar 2010
          • 19

          #5
          as of today

          Originally posted by mchaisso View Post
          The fact that people use at least 20X coverage points out some of the difficulties in accurate statistics for sequencing. Say the FLX sequencer were only producing 100 base reads, and coverage is 20X. In de novo assembly, most (all) assemblers will have a minimum over lap length (either explicitly stated or as a k-value in a de Bruijn graph), so say k=25, so the coverage is 25% less. Still, at 15X coverage, the Lander Waterman statistics dictate that there will be one contig, yet when mapping reads back to the genome there are still usually gaps. This is worse with Illumina GAI sequencers, where I have found that 80X coverage with 35 base reads finally begins to overcome sample bias and get rid of gaps in assembly.

          I'm not saying this is an open field for research -- rather something to steer clear of. 20X coverage seems to compensate for amplification bias in 454 sequencing, which are difficult to model. In illumina sequencing projects, this will probably be overcome by adding scaffolding methods to assemblers. I imagine the latest release of Velvet has this given some of the results I've seen, and I'm working on making this a standard part of euler.

          A more reasonable avenue for statistical development, at least in de novo assembly, is regarding repeat coverage. All short read assemblers resolve repeats by using ends of mate-pairs that span the repeat.

          So, if a genome has:

          ABCDpqrEFGHIJKpqrLMNpqrSTUV

          and reads are sequenced with 3 characters
          mate-pairs BCD---EFG and IJK---LMN, LMN---STU are required to resolve if the genome is
          ABCDpqrEFGHIJKpqrLMNpqrSTUV
          versus
          ABCDpqrLMNpqrEFGHIJKpqrSTUV

          So, the question is, given a genome size G, repeat length r, repeat multiplicity m, clone length L, read length l, and number of reads, N, what is the probability that mate-pairs span all repeats?
          solved as of today?

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Pathogen Surveillance with Advanced Genomic Tools
            by seqadmin




            The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
            03-24-2025, 11:48 AM
          • seqadmin
            New Genomics Tools and Methods Shared at AGBT 2025
            by seqadmin


            This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

            The Headliner
            The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
            03-03-2025, 01:39 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 10:17 AM
          0 responses
          7 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-20-2025, 05:03 AM
          0 responses
          49 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-19-2025, 07:27 AM
          0 responses
          59 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-18-2025, 12:50 PM
          0 responses
          50 views
          0 reactions
          Last Post seqadmin  
          Working...