Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • preseq: predicting the complexity of genomic sequencing libraries

    It is the pleasure of the Smith lab at USC to announce the publication of the preseq manuscript in Nature Methods, currently available as Advanced Online Publication ( link ).

    The preseq software is a tool designed to predict the complexity of a genomic library, quantified as the number of distinct reads obtained for a hypothetical sequencing depth. It takes as input a small mapped initial sequencing experiment in either BED or BAM format sorted by mapping location. The idea is to use the information gained from the number of times each read is observed to predict the number of new, currently unobserved, reads that will be gained from additional sequencing. The underlying model is taken from capture-recapture statistics and is shown to be applicable to a broad range of sequencing experiments, including RNA-seq, ChIP-seq, BS-seq, and Exome-seq. Our lab and our collaborators currently use preseq as part of our bisulfite sequencing pipeline to prevent deep sequencing of low complexity libraries and to optimise the depth of sequencing.

    For more information or to download the software, please visit the preseq website:

    http://smithlab.usc.edu/software/librarycomplexity

  • #2
    problem running with BAM-file

    hi timydaley,

    i have a problem to run your program with a bam file (Scientific Linux 6.3 = RHEL 6.3).

    gsl 1.15 and bamtools are installed.
    your program also
    Code:
    make all BAMTOOLS_ROOT=/home/ws/SW_install/bamtools/bamtools
    Code:
    g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -c -o continued_fraction.o continued_fraction.cpp -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include
    g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -o lc_extrap lc_extrap.cpp /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//GenomicRegion.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_os.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_utils.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//OptionParser.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//MappedRead.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//RNG.o continued_fraction.o -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include -lgsl -lgslcblas  -L/home/ws/SW_install/bamtools/bamtools/lib -lz -lbamtools
    g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -o c_curve c_curve.cpp /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//GenomicRegion.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_os.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_utils.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//OptionParser.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//MappedRead.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//RNG.o -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include -lgsl -lgslcblas  -L/home/ws/SW_install/bamtools/bamtools/lib -lz -lbamtools
    but running the executables yields following error:

    Code:
    ./c_curve: error while loading shared libraries: libbamtools.so.2.2.3: cannot open shared object file: No such file or directory
    ./lc_extrap: error while loading shared libraries: libbamtools.so.2.2.3: cannot open shared object file: No such file or directory
    what do i miss...

    thank you!

    Comment


    • #3
      Ah, yes. This a common problem we encounter with bamtools since we do not do static linking and all the problems that entails. I know of two quick solutions: have your LD_LIBRARY_PATH set to BAMTOOLS_ROOT/lib/, but then you will have problems with other programs or you can copy all the lib files in BAMTOOLS_ROOT/lib/ to your current LD_LIBRARY_PATH, typically ~/lib/.

      Thank you so much for using our software and I hope it is of use in your projects.

      Comment


      • #4
        This seems like a very interesting tool, thanks for sharing.

        Comment


        • #5
          Dear Timothy

          Thank you for your developing the method and for implementing and sharing the tool. It is a great opportunity to assess the libraries complexity and to have a statistical support to know more about what to expect with deeper sequencing.

          I've run your tool to produce c_curve and lc_extrap output (using .bam from a RNAseq experiment). It worked just fine with default parameters.

          I have 2 comments/questions:
          1- How can we estimate the library complexity in term of distinct loci identified ? (It seems possible to compute it but I could not find how to do in the manual.pdf)

          2- could you help to interpret the c_curve out file (-verbose) for "DISTINCT COUNTS/MAX COUNT/COUNTS OF 1/MAX TERMS" ?
          (my example)
          TOTAL READS = 41159280
          DISTINCT READS = 16910777
          DISTINCT COUNTS = 1258
          MAX COUNT = 16177
          COUNTS OF 1 = 1.1352e+07
          MAX TERMS = 1000
          OBSERVED COUNTS (16178)

          Thank you for your time

          Comment


          • #6
            Originally posted by kristofit View Post

            1- How can we estimate the library complexity in term of distinct loci identified ? (It seems possible to compute it but I could not find how to do in the manual.pdf)
            To compute estimates in terms of distinct loci identified you need to produce the counts of duplicate events in terms of the loci of interest. For example, you may be interested in counting distinct exons for a RNA-seq experiment. The UMI would then be the exon to which a read starts at (or ends at, or is completely contained within, either way the UMI needs to be consistent across the initial and full experiment). Duplicate events would be reads that map to the same exon, regardless if they themselves are duplicates or not.

            Or take the example from the paper, where we looked at fixed non-overlapping 1kb bins for a ChIP-seq experiment. For that we used the mapped starting location of the read to identify the bin to which a read belonged to. Duplicate events include distinct reads that map to the same bin and duplicate reads.

            Once you have the duplicate counts, the newest version of preseq (available on our website smithlab.usc.edu/sorftware/librarycomplexity ) allows you to input this as a text file, one count per line as detailed in the manual.


            Originally posted by kristofit View Post

            2- could you help to interpret the c_curve out file (-verbose) for "DISTINCT COUNTS/MAX COUNT/COUNTS OF 1/MAX TERMS" ?
            (my example)
            TOTAL READS = 41159280
            DISTINCT READS = 16910777
            DISTINCT COUNTS = 1258
            MAX COUNT = 16177
            COUNTS OF 1 = 1.1352e+07
            MAX TERMS = 1000
            OBSERVED COUNTS (16178)
            Well, you had ~41M mapped reads in the experiment, ~17M distinct reads,
            11M singleton reads, and the read with the largest number of duplicates had 16177 copies.

            Thank you for using the software and if you have anymore questions, feel free to contact me.
            Last edited by timydaley; 03-29-2013, 03:22 PM.

            Comment


            • #7
              Comparison with Picard EstimateLibraryCompleixty?

              Hi Tim,

              Congratulations on a very nice paper, I thought you laid out the problem very nicely and provided enough details of the method to be very convincing. I'm curious about how much estimates from your method differs from those provided by tools commonly used today, such as EstimateLibraryComplexity.jar in picard. It uses a formula that only uses total number of reads in sample and total number of unique reads: http://picard.sourceforge.net/javado...e(long,%20long)

              I'm guessing picard's estimate is probably even more biased than ZTNB model, but I don't know how far off it actually is. It'll be very interesting to see how it performs in your comparisons.

              Thanks for the nice paper/software!

              Comment


              • #8
                Thank you Mr. Zhao. It appears that estimateLibrarySize assumes a simple Lander-Waterman model, which would correspond to a simple Poisson model. The ZTNB model is much broader class that includes the simple Poisson model (taking alpha -> 0). Therefore, the estimates from such a model can only be more biased than the ZTNB estimates.

                If you have any more questions or would like to discuss the model, feel free to contact me. I am happy to answer any questions that I am able to.

                Comment


                • #9
                  Ok, so my question is a practical one: our pipeline uses picard markduplicates and estimate library complexity. The estimate may not be very accurate, but it's basically for free computationally. Preseq is more accurate, but running it on bams with ~ 100 million pair end reads is takes a long time on my machine. I could downsample bam file, or do some other workaround, but knowing exactly how off picard complexity estimate is will help me decide how much effort to put in it. If it's off by 30-40% then it's probably not worth the trouble, but if it's off by 3-4x than I'll have to integrate preseq into the analysis pipeline.

                  Comment


                  • #10
                    100M reads is quite a lot. Downsampling and using 20M or so, you should get extremely accurate results extrapolating pretty much as far as you want to go while cutting down on the computational time significantly.

                    As far as the bias for the Poisson, it depends a lot on the library. All libraries will appear worse than they actually are under the Poisson model, with poorer complexity libraries appearing even worse than they actually are.

                    Comment


                    • #11
                      Thanks! Good to know poission model always underestimates library complexity. From your paper, ZTNB sometimes over estimates, Fig. 2B for example. I wasn't sure if that could also happen to poisson model.

                      So the overestimate of ZTNB must be the result of second parameter of negative binomial not being estimated correctly from data?

                      Comment


                      • #12
                        Dear Timothy

                        Thank you very much for your answer and help. I have used the the newest version of preseq (maybe it could be named v1.1). It's great that paired-end data can now be used (option -P). It seems (in my hands) that c_curve (from newest preseq) using input.bam did not work properly (compiled as 'make all BAMTOOLS_ROOT=$bamtools'). However Input.bed (make all) works fine.

                        Is the more accurate library complexity results returned as the last line of the c_curve out file ?

                        As the lc_extrap takes very long time to run for 100M reads, you recommend to downsample and use 20M from bam file (or bed) to get an accurate result. From an ordered .bam or .bed, using 20M out of 100M, all chromosomes will not be sampled, does that matter ?
                        Many thanks in advance

                        Comment


                        • #13
                          Originally posted by kristofit View Post
                          From an ordered .bam or .bed, using 20M out of 100M, all chromosomes will not be sampled, does that matter ?
                          By downsampling, I mean sampling 20M reads at random without replacement from the set of all reads. As you point out, sampling the sample by truncation (taking the first 20M reads) will possibly result in a biased sample. I'm not entirely certain, but I would imagine that different chromosome possibly have very different statistical properties.

                          Comment


                          • #14
                            Hi Timothy,

                            Thanks very much for this tool. I'm trying to run c_curve on a large BAM file. With adequate memory allocated it seems to run fine, but after a while I'm getting the output "ERROR: no chrom with id: -1". Any idea what could be causing this error? Seems like it may be a problem with the BAM, but I haven't run into any issues in my previous analysis.

                            Thanks in advance!

                            Comment


                            • #15
                              Originally posted by lchong View Post
                              Hi Timothy,
                              I'm trying to run c_curve on a large BAM file. With adequate memory allocated it seems to run fine, but after a while I'm getting the output "ERROR: no chrom with id: -1". Any idea what could be causing this error? Seems like it may be a problem with the BAM, but I haven't run into any issues in my previous analysis.
                              I have no idea. I am fairly confident that's a problem with bam or possibly some interface problem with bamtools. How was the bam file generated?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advanced Tools Transforming the Field of Cytogenomics
                                by seqadmin


                                At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                                09-26-2023, 06:26 AM
                              • seqadmin
                                How RNA-Seq is Transforming Cancer Studies
                                by seqadmin



                                Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                                09-07-2023, 11:15 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 09:38 AM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-27-2023, 06:57 AM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-26-2023, 07:53 AM
                              1 response
                              23 views
                              0 likes
                              Last Post seed_phrase_metal_storage  
                              Started by seqadmin, 09-25-2023, 07:42 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X