Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • timydaley
    Member
    • Jun 2010
    • 26

    preseq: predicting the complexity of genomic sequencing libraries

    It is the pleasure of the Smith lab at USC to announce the publication of the preseq manuscript in Nature Methods, currently available as Advanced Online Publication ( link ).

    The preseq software is a tool designed to predict the complexity of a genomic library, quantified as the number of distinct reads obtained for a hypothetical sequencing depth. It takes as input a small mapped initial sequencing experiment in either BED or BAM format sorted by mapping location. The idea is to use the information gained from the number of times each read is observed to predict the number of new, currently unobserved, reads that will be gained from additional sequencing. The underlying model is taken from capture-recapture statistics and is shown to be applicable to a broad range of sequencing experiments, including RNA-seq, ChIP-seq, BS-seq, and Exome-seq. Our lab and our collaborators currently use preseq as part of our bisulfite sequencing pipeline to prevent deep sequencing of low complexity libraries and to optimise the depth of sequencing.

    For more information or to download the software, please visit the preseq website:

  • dietmar13
    Senior Member
    • Mar 2010
    • 107

    #2
    problem running with BAM-file

    hi timydaley,

    i have a problem to run your program with a bam file (Scientific Linux 6.3 = RHEL 6.3).

    gsl 1.15 and bamtools are installed.
    your program also
    Code:
    make all BAMTOOLS_ROOT=/home/ws/SW_install/bamtools/bamtools
    Code:
    g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -c -o continued_fraction.o continued_fraction.cpp -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include
    g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -o lc_extrap lc_extrap.cpp /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//GenomicRegion.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_os.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_utils.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//OptionParser.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//MappedRead.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//RNG.o continued_fraction.o -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include -lgsl -lgslcblas  -L/home/ws/SW_install/bamtools/bamtools/lib -lz -lbamtools
    g++ -Wall -fPIC -fmessage-length=50 -DHAVE_BAMTOOLS -o c_curve c_curve.cpp /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//GenomicRegion.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_os.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//smithlab_utils.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//OptionParser.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//MappedRead.o /home/ws/SW_install/presec/PRE-seq/smithlab_cpp//RNG.o -I/home/ws/SW_install/presec/PRE-seq/smithlab_cpp/ -I/home/ws/SW_install/bamtools/bamtools/include -lgsl -lgslcblas  -L/home/ws/SW_install/bamtools/bamtools/lib -lz -lbamtools
    but running the executables yields following error:

    Code:
    ./c_curve: error while loading shared libraries: libbamtools.so.2.2.3: cannot open shared object file: No such file or directory
    ./lc_extrap: error while loading shared libraries: libbamtools.so.2.2.3: cannot open shared object file: No such file or directory
    what do i miss...

    thank you!

    Comment

    • timydaley
      Member
      • Jun 2010
      • 26

      #3
      Ah, yes. This a common problem we encounter with bamtools since we do not do static linking and all the problems that entails. I know of two quick solutions: have your LD_LIBRARY_PATH set to BAMTOOLS_ROOT/lib/, but then you will have problems with other programs or you can copy all the lib files in BAMTOOLS_ROOT/lib/ to your current LD_LIBRARY_PATH, typically ~/lib/.

      Thank you so much for using our software and I hope it is of use in your projects.

      Comment

      • kopi-o
        Senior Member
        • Feb 2008
        • 319

        #4
        This seems like a very interesting tool, thanks for sharing.

        Comment

        • kristofit
          Junior Member
          • Apr 2012
          • 3

          #5
          Dear Timothy

          Thank you for your developing the method and for implementing and sharing the tool. It is a great opportunity to assess the libraries complexity and to have a statistical support to know more about what to expect with deeper sequencing.

          I've run your tool to produce c_curve and lc_extrap output (using .bam from a RNAseq experiment). It worked just fine with default parameters.

          I have 2 comments/questions:
          1- How can we estimate the library complexity in term of distinct loci identified ? (It seems possible to compute it but I could not find how to do in the manual.pdf)

          2- could you help to interpret the c_curve out file (-verbose) for "DISTINCT COUNTS/MAX COUNT/COUNTS OF 1/MAX TERMS" ?
          (my example)
          TOTAL READS = 41159280
          DISTINCT READS = 16910777
          DISTINCT COUNTS = 1258
          MAX COUNT = 16177
          COUNTS OF 1 = 1.1352e+07
          MAX TERMS = 1000
          OBSERVED COUNTS (16178)

          Thank you for your time

          Comment

          • timydaley
            Member
            • Jun 2010
            • 26

            #6
            Originally posted by kristofit View Post

            1- How can we estimate the library complexity in term of distinct loci identified ? (It seems possible to compute it but I could not find how to do in the manual.pdf)
            To compute estimates in terms of distinct loci identified you need to produce the counts of duplicate events in terms of the loci of interest. For example, you may be interested in counting distinct exons for a RNA-seq experiment. The UMI would then be the exon to which a read starts at (or ends at, or is completely contained within, either way the UMI needs to be consistent across the initial and full experiment). Duplicate events would be reads that map to the same exon, regardless if they themselves are duplicates or not.

            Or take the example from the paper, where we looked at fixed non-overlapping 1kb bins for a ChIP-seq experiment. For that we used the mapped starting location of the read to identify the bin to which a read belonged to. Duplicate events include distinct reads that map to the same bin and duplicate reads.

            Once you have the duplicate counts, the newest version of preseq (available on our website smithlab.usc.edu/sorftware/librarycomplexity ) allows you to input this as a text file, one count per line as detailed in the manual.


            Originally posted by kristofit View Post

            2- could you help to interpret the c_curve out file (-verbose) for "DISTINCT COUNTS/MAX COUNT/COUNTS OF 1/MAX TERMS" ?
            (my example)
            TOTAL READS = 41159280
            DISTINCT READS = 16910777
            DISTINCT COUNTS = 1258
            MAX COUNT = 16177
            COUNTS OF 1 = 1.1352e+07
            MAX TERMS = 1000
            OBSERVED COUNTS (16178)
            Well, you had ~41M mapped reads in the experiment, ~17M distinct reads,
            11M singleton reads, and the read with the largest number of duplicates had 16177 copies.

            Thank you for using the software and if you have anymore questions, feel free to contact me.
            Last edited by timydaley; 03-29-2013, 03:22 PM.

            Comment

            • Emo.Zhao
              Junior Member
              • Oct 2011
              • 3

              #7
              Comparison with Picard EstimateLibraryCompleixty?

              Hi Tim,

              Congratulations on a very nice paper, I thought you laid out the problem very nicely and provided enough details of the method to be very convincing. I'm curious about how much estimates from your method differs from those provided by tools commonly used today, such as EstimateLibraryComplexity.jar in picard. It uses a formula that only uses total number of reads in sample and total number of unique reads: http://picard.sourceforge.net/javado...e(long,%20long)

              I'm guessing picard's estimate is probably even more biased than ZTNB model, but I don't know how far off it actually is. It'll be very interesting to see how it performs in your comparisons.

              Thanks for the nice paper/software!

              Comment

              • timydaley
                Member
                • Jun 2010
                • 26

                #8
                Thank you Mr. Zhao. It appears that estimateLibrarySize assumes a simple Lander-Waterman model, which would correspond to a simple Poisson model. The ZTNB model is much broader class that includes the simple Poisson model (taking alpha -> 0). Therefore, the estimates from such a model can only be more biased than the ZTNB estimates.

                If you have any more questions or would like to discuss the model, feel free to contact me. I am happy to answer any questions that I am able to.

                Comment

                • Emo.Zhao
                  Junior Member
                  • Oct 2011
                  • 3

                  #9
                  Ok, so my question is a practical one: our pipeline uses picard markduplicates and estimate library complexity. The estimate may not be very accurate, but it's basically for free computationally. Preseq is more accurate, but running it on bams with ~ 100 million pair end reads is takes a long time on my machine. I could downsample bam file, or do some other workaround, but knowing exactly how off picard complexity estimate is will help me decide how much effort to put in it. If it's off by 30-40% then it's probably not worth the trouble, but if it's off by 3-4x than I'll have to integrate preseq into the analysis pipeline.

                  Comment

                  • timydaley
                    Member
                    • Jun 2010
                    • 26

                    #10
                    100M reads is quite a lot. Downsampling and using 20M or so, you should get extremely accurate results extrapolating pretty much as far as you want to go while cutting down on the computational time significantly.

                    As far as the bias for the Poisson, it depends a lot on the library. All libraries will appear worse than they actually are under the Poisson model, with poorer complexity libraries appearing even worse than they actually are.

                    Comment

                    • Emo.Zhao
                      Junior Member
                      • Oct 2011
                      • 3

                      #11
                      Thanks! Good to know poission model always underestimates library complexity. From your paper, ZTNB sometimes over estimates, Fig. 2B for example. I wasn't sure if that could also happen to poisson model.

                      So the overestimate of ZTNB must be the result of second parameter of negative binomial not being estimated correctly from data?

                      Comment

                      • kristofit
                        Junior Member
                        • Apr 2012
                        • 3

                        #12
                        Dear Timothy

                        Thank you very much for your answer and help. I have used the the newest version of preseq (maybe it could be named v1.1). It's great that paired-end data can now be used (option -P). It seems (in my hands) that c_curve (from newest preseq) using input.bam did not work properly (compiled as 'make all BAMTOOLS_ROOT=$bamtools'). However Input.bed (make all) works fine.

                        Is the more accurate library complexity results returned as the last line of the c_curve out file ?

                        As the lc_extrap takes very long time to run for 100M reads, you recommend to downsample and use 20M from bam file (or bed) to get an accurate result. From an ordered .bam or .bed, using 20M out of 100M, all chromosomes will not be sampled, does that matter ?
                        Many thanks in advance

                        Comment

                        • timydaley
                          Member
                          • Jun 2010
                          • 26

                          #13
                          Originally posted by kristofit View Post
                          From an ordered .bam or .bed, using 20M out of 100M, all chromosomes will not be sampled, does that matter ?
                          By downsampling, I mean sampling 20M reads at random without replacement from the set of all reads. As you point out, sampling the sample by truncation (taking the first 20M reads) will possibly result in a biased sample. I'm not entirely certain, but I would imagine that different chromosome possibly have very different statistical properties.

                          Comment

                          • lchong
                            Junior Member
                            • Feb 2012
                            • 2

                            #14
                            Hi Timothy,

                            Thanks very much for this tool. I'm trying to run c_curve on a large BAM file. With adequate memory allocated it seems to run fine, but after a while I'm getting the output "ERROR: no chrom with id: -1". Any idea what could be causing this error? Seems like it may be a problem with the BAM, but I haven't run into any issues in my previous analysis.

                            Thanks in advance!

                            Comment

                            • timydaley
                              Member
                              • Jun 2010
                              • 26

                              #15
                              Originally posted by lchong View Post
                              Hi Timothy,
                              I'm trying to run c_curve on a large BAM file. With adequate memory allocated it seems to run fine, but after a while I'm getting the output "ERROR: no chrom with id: -1". Any idea what could be causing this error? Seems like it may be a problem with the BAM, but I haven't run into any issues in my previous analysis.
                              I have no idea. I am fairly confident that's a problem with bam or possibly some interface problem with bamtools. How was the bam file generated?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Pathogen Surveillance with Advanced Genomic Tools
                                by seqadmin




                                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                                03-24-2025, 11:48 AM
                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-20-2025, 05:03 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-19-2025, 07:27 AM
                              0 responses
                              57 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              49 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-03-2025, 01:15 PM
                              0 responses
                              200 views
                              0 reactions
                              Last Post seqadmin  
                              Working...