Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • mdimon
    Member
    • Jan 2010
    • 10

    HMMSplicer : new software for finding splice junctions in RNA-Seq data

    The DeRisi lab is pleased to release HMMSplicer to the community. This open-source software package discovers splice junctions in RNA-Seq datasets without using gene models. HMMSplicer was benchmarked on publicly available A. thaliana, H. sapiens, and P. falciparum datasets and performed well in all cases. The software was found to perform especially well in compact genomes and on genes with low expression levels, alternative splice isoforms, or non-canonical splice junctions. In addition, HMMSplicer provides a score for every predicted junction allowing the user to set a threshold to tune false positive rates depending on the needs of the experiment. Information about the datasets tested, including the exact command parameters and the final results, is provided. HMMSplicer is implemented in Python and is freely available for all. The manuscript is currently under review.

    The code and documentation can be found at: http://derisilab.ucsf.edu/software/hmmsplicer

    We would love to get your feedback on our software. If you have any problems running HMMSplicer, or any suggestions for improvements, please let me know.

    Michelle Dimon
    DeRisi Lab, UCSF
    mdimon [at] gmail [dot] com
  • malachig
    Senior Member
    • Aug 2010
    • 117

    #2
    I have been experimenting with this tool lately. The results seem promising.

    I have the following suggestions. From a performance perspective, the major bottleneck we are encountering is disk space usage. Processing a lane of data appears to require approximately 25-30 Gb of disk space (including the fastq input file). This is no problem for 1 lane of data, but when processing many hundreds of lanes it quickly becomes an issue.

    I would request the following:
    1.) Support for compressed input fastq files. We do not store uncompressed versions of any read data. Would it be possible to decompress it on the fly without ever having an uncompressed version on disk?

    2.) You have an option to delete the temp files at the end of the job. It appears that this happens all at once at the end of the job. If the user selects this option, would it be possible to delete individual files as soon as they are no longer needed?

    Comment

    • Lee Sam
      Member
      • Oct 2008
      • 57

      #3
      I've been playing with splice discovery tools for a while. How does the tool perform time-wise? I can't get supersplat to finish, for example.

      Comment

      • malachig
        Senior Member
        • Aug 2010
        • 117

        #4
        What kind of data are we talking about? One lane? What read length? Total number of reads per lane?

        In my experience, it takes anywhere from a few hours to a week to process a lane of Illumina paired end data (depending on number of reads mostly, but also possibly read length and of course hardware). Currently it seems that only the alignment step of hmmSplicer is parallel. So, committing multiple CPUs will improve one major step but other steps can still take a while. You can segment your data to an arbitrary degree and run on a cluster if you have those resources available. I'm using ~100 cpus and 10TB of disk space, but I am processing a lot of data...

        Comment

        • malachig
          Senior Member
          • Aug 2010
          • 117

          #5
          Lee Sam, it would be great to hear your thoughts on the other splice discovery tools you have been experimenting with?

          Comment

          • malachig
            Senior Member
            • Aug 2010
            • 117

            #6
            Also, perhaps the author can comment on the advisability of partitioning the data and then merging the results given that an HMM training step is involved...

            I'm also curious about how the sampling is done for training. If I request a sample of 100k, are the first 100k reads selected? Or are they selected randomly from the input file? If the former is the case, it seems that it would be unwise to combine multiple lanes for a single hmmSplicer run (as these lanes may have distinct characteristics such as read length, error rate, etc.)

            Comment

            • Lee Sam
              Member
              • Oct 2008
              • 57

              #7
              Originally posted by malachig View Post
              What kind of data are we talking about? One lane? What read length? Total number of reads per lane?

              In my experience, it takes anywhere from a few hours to a week to process a lane of Illumina paired end data (depending on number of reads mostly, but also possibly read length and of course hardware). Currently it seems that only the alignment step of hmmSplicer is parallel. So, committing multiple CPUs will improve one major step but other steps can still take a while. You can segment your data to an arbitrary degree and run on a cluster if you have those resources available. I'm using ~100 cpus and 10TB of disk space, but I am processing a lot of data...
              I have a few dozen lanes to run (PE 2x50 GA2 runs). I suppose I can set it up to run on a cluster I have access to. My experiences have mostly been with SAW (published by some people I know), mapSplice, spliceMap, and supersplat. Run times have been a continuing concern, but I can send jobs out to a cluster with a lot of 12-core i7 nodes - right now it's been exploratory.

              Comment

              • ilivyatan
                Junior Member
                • Aug 2010
                • 7

                #8
                RFC.
                If it doesn't yet support SOLiD file formats, please ...

                Comment

                • zukey
                  Junior Member
                  • May 2009
                  • 5

                  #9
                  I am trying the HMMSplicer. Could anyone let me know how to load paired-end data (illumina) to it?

                  Thanks alot,

                  Qi

                  Comment

                  • mdimon
                    Member
                    • Jan 2010
                    • 10

                    #10
                    The manuscript with a more complete description of the tool will be published soon, hopefully that will answer many of your questions.

                    In terms of performance, HMMSplicer is comparable to TopHat, depending on the size of the genome and the size of the dataset. For a human test set with about 10 million paired end reads, running across 4 processors, HMMSplicer took 14 hours to complete on my setup. As far as using multiple processors, if you have a small genome then the alignment and the splice junction detection steps are parellelized. If you use the 'large genome' option then only the alignment steps are parallelized.

                    Splitting the input reads into multiple groups for processing on a cluster shouldn't have an adverse effect on the HMM training, as long as each group is large enough to sample. The trained HMM parameters are printed to the log file, so you can always check the differences in the trained parameters to make sure each subset is training to approximately the same values. The sampling is done randomly -- so if you select 100k reads to sample, they will be spread randomly throughout the input file.

                    Comment

                    • mdimon
                      Member
                      • Jan 2010
                      • 10

                      #11
                      Qi,
                      As far as running HMMSplicer with paired end data, HMMSplicer does not do any special processing for paired ends yet, so simply concatenate the read files and use the combined reads as input.

                      Comment

                      • mdimon
                        Member
                        • Jan 2010
                        • 10

                        #12
                        malachig,

                        Thanks for the feedback and the suggestions. Python has good tools for handling compressed files, so this should be a relatively straightforward addition for the next release. I really like the idea of deleting tmp files along the way, also.

                        Thanks!
                        Michelle

                        Comment

                        • mdimon
                          Member
                          • Jan 2010
                          • 10

                          #13
                          The manuscript is now available for HMMSplicer:


                          There should also be a new version of the software available later today with some of the suggestions here as well as improved descriptions on how to use the helper scripts that are included as part of HMMSplicer.

                          Comment

                          • proteomania
                            Member
                            • Sep 2010
                            • 11

                            #14
                            Thanks for developing the tool. The default output doesn't seems to contain the alignmeant of the spliced reads, it would be great if the software could output the spliced read alignments in sam format.

                            Comment

                            • darked89
                              Member
                              • Jun 2009
                              • 38

                              #15
                              @mdimon

                              I have few lanes of paired RNA-Seq reads from few tissues (plant, novel genome). Do you recommend concatenating all of them into a single giant file, or running individual lanes, or even _1 and _2 reads
                              will not make a big difference? I want to get as many reliable splice junctions as possible.
                              If combining reads: I assume individual reads names must be unique in the entire file?

                              My other question: can I feed HMMsplicer just with unmapped reads to speed up things? I already have mapping results in a non-spliced mode for several lanes.

                              Thanks a lot for developing it.

                              PS HMMsplicer works OK with bowtie 0.12.7 /Python 2.6.4 on Linux Fedora 8.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              25 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              42 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              48 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...