Header Leaderboard Ad


Does length matter?



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Does length matter?

    Dear forum members,

    Thanks for this wonderful site and your ever-so-useful contributions.

    I have a question about the mapping of short reads in the context of a metagenomic study:
    What impact has sequencing read length (35b vs 50b for instance) on read mapping?

    I'm thinking of these few aspects, but the list is is far from being exhaustive:
    - mapping input: number of sequences generated
    - mapping process: quality/speed of alignments
    - mapping output: false positive hits, false negative hits...

    Theoretical answers and experiences you have had are very welcome.


  • #2
    Dependence on read length, based on theoretical predictions:

    Number of sequences generated: no difference, unless you keep your image files, do a post-run identification of clusters, and have very unbalanced sequences

    Run speed: more bases will take longer to sequence. The actual extra time taken depends on how sequencing is done, how reading is done, and how many total reads there are

    Run quality: because sequencing is a stochastic process, bases at the end of reads will necessarily suffer more from phasing error, where a base is added that wasn't expected, or a base isn't added when it should be. This happens to also be one of the reasons why it's difficult to sequence proteins via Edman degradation beyond about 30 amino acids.

    False positive hits: longer reads should reduce the incidence of false positive hits, particularly for end-to-end matching, because there's a bigger target to match. There's also the advantage of being able to span larger repetitive sequences.

    False negative hits: with the same expectation of error, this may be increased due to the higher chance of any sequence read error (or sequencing error) within a longer read. However this is unlikely to be the case for a mapping tool that understands quality scores, and puts less emphasis on excluding reads based on bad matches towards the end of the sequence.

    And now for some extras...

    Coverage: longer reads will produce more bases in total, so the coverage will increase

    Proportion of mapped reads: longer reads could reduce the number of mapped reads (e.g. due to a reduction of false positive hits, or an increase of false negative hits), or increase the number (e.g. due to spanning a sequence that is too repetitive to be matched for shorter sequences)

    De-novo assembly: longer reads will give you a wider range of kmer sizes to try for assembly methods, and increase the number of kmers within each read, both which should improve the quality of the assembly


    • #3
      Many thanks for your answer, gringer.
      I agree with your point: the increase of sequencing error rate with each additional base sequenced impacts simultaneously False Negatives, False Positives and True Positives, and is therefore central for the determination of an 'ideal' read length.

      Has anyone such experience? Would you favor longer short reads?


      • #4
        It's not the length your read but how you align it. Or at least that's what guys with short reads tell their girlfriends.


        • #5
          t's not the length your read but how you align it.
          Once reads get beyond 25-30bp, a perfect read that is random enough should map uniquely to a target genome/transcriptome (assuming no duplication). The extra bases mean that you can get away with mapping less random sequences, and compensate for more read/sequencing error.

          As an alternative to longer reads, paired-end mapping gives you two chances to hit a high-complexity region, and makes it much easier to detect PCR duplicates.


          • #6
            I'd definitely go for the longest reads you can get hold of for metagenomics.

            36bp is not pretty for de novo assemblies, which are what many metagenomic projects boil down to at some stage.

            There are a few papers about on this, for example

            Article (Rodrigue2010) Rodrigue, S.; Materna, A. C.; Timberlake, S. C.; Blackburn, M. C.; Malmstrom, R. R.; Alm, E. J. & Chisholm, S. W. Unlocking short read sequencing for metagenomics. PLoS One, Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, Massachussetts, United States of America., 2010, 5, e11840 Abstract: BACKGROUND: Different high-throughput nucleic acid sequencing platforms are currently available but a trade-off currently exists between the cost and number of reads that can be generated versus the read length that can be achieved. METHODOLOGY/PRINCIPAL FINDINGS: We describe an experimental and computational pipeline yielding millions of reads that can exceed 200 bp with quality scores approaching that of traditional Sanger sequencing. The method combines an automatable gel-less library construction step with paired-end sequencing on a short-read instrument. With appropriately sized library inserts, mate-pair sequences can overlap, and we describe the SHERA software package that joins them to form a longer composite read. CONCLUSIONS/SIGNIFICANCE: This strategy is broadly applicable to sequencing applications that benefit from low-cost high-throughput sequencing, but require longer read lengths. We demonstrate that our approach enables metagenomic analyses using the Illumina Genome Analyzer, with low error rates, and at a fraction of the cost of pyrosequencing.

            Also see a 36bp metagenomic run hidden in this paper:

            Article (Sorber2008) Sorber, K.; Chiu, C.; Webster, D.; Dimon, M.; Ruby, J. G.; Hekele, A. & DeRisi, J. L. The long march: a sample preparation technique that enhances contig length and coverage by high-throughput short-read sequencing. PLoS ONE, Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America., 2008, 3, e3495 Abstract: High-throughput short-read technologies have revolutionized DNA sequencing by drastically reducing the cost per base of sequencing information. Despite producing gigabases of sequence per run, these technologies still present obstacles in resequencing and de novo assembly applications due to biased or insufficient target sequence coverage. We present here a simple sample preparation method termed the "long march" that increases both contig lengths and target sequence coverage using high-throughput short-read technologies. By incorporating a Type IIS restriction enzyme recognition motif into the sequencing primer adapter, successive rounds of restriction enzyme cleavage and adapter ligation produce a set of nested sub-libraries from the initial amplicon library. Sequence reads from these sub-libraries are offset from each other with enough overlap to aid assembly and contig extension. We demonstrate the utility of the long march in resequencing of the Plasmodium falciparum transcriptome, where the number of genomic bases covered was increased by 39 as well as in metagenomic analysis of a serum sample from a patient with hepatitis B virus (HBV)-related acute liver failure, where the number of HBV bases covered was increased by 42 We also offer a theoretical optimization of the long march for de novo sequence assembly.