Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HMMSplicer : new software for finding splice junctions in RNA-Seq data

    The DeRisi lab is pleased to release HMMSplicer to the community. This open-source software package discovers splice junctions in RNA-Seq datasets without using gene models. HMMSplicer was benchmarked on publicly available A. thaliana, H. sapiens, and P. falciparum datasets and performed well in all cases. The software was found to perform especially well in compact genomes and on genes with low expression levels, alternative splice isoforms, or non-canonical splice junctions. In addition, HMMSplicer provides a score for every predicted junction allowing the user to set a threshold to tune false positive rates depending on the needs of the experiment. Information about the datasets tested, including the exact command parameters and the final results, is provided. HMMSplicer is implemented in Python and is freely available for all. The manuscript is currently under review.

    The code and documentation can be found at: http://derisilab.ucsf.edu/software/hmmsplicer

    We would love to get your feedback on our software. If you have any problems running HMMSplicer, or any suggestions for improvements, please let me know.

    Michelle Dimon
    DeRisi Lab, UCSF
    mdimon [at] gmail [dot] com

  • #2
    I have been experimenting with this tool lately. The results seem promising.

    I have the following suggestions. From a performance perspective, the major bottleneck we are encountering is disk space usage. Processing a lane of data appears to require approximately 25-30 Gb of disk space (including the fastq input file). This is no problem for 1 lane of data, but when processing many hundreds of lanes it quickly becomes an issue.

    I would request the following:
    1.) Support for compressed input fastq files. We do not store uncompressed versions of any read data. Would it be possible to decompress it on the fly without ever having an uncompressed version on disk?

    2.) You have an option to delete the temp files at the end of the job. It appears that this happens all at once at the end of the job. If the user selects this option, would it be possible to delete individual files as soon as they are no longer needed?

    Comment


    • #3
      I've been playing with splice discovery tools for a while. How does the tool perform time-wise? I can't get supersplat to finish, for example.

      Comment


      • #4
        What kind of data are we talking about? One lane? What read length? Total number of reads per lane?

        In my experience, it takes anywhere from a few hours to a week to process a lane of Illumina paired end data (depending on number of reads mostly, but also possibly read length and of course hardware). Currently it seems that only the alignment step of hmmSplicer is parallel. So, committing multiple CPUs will improve one major step but other steps can still take a while. You can segment your data to an arbitrary degree and run on a cluster if you have those resources available. I'm using ~100 cpus and 10TB of disk space, but I am processing a lot of data...

        Comment


        • #5
          Lee Sam, it would be great to hear your thoughts on the other splice discovery tools you have been experimenting with?

          Comment


          • #6
            Also, perhaps the author can comment on the advisability of partitioning the data and then merging the results given that an HMM training step is involved...

            I'm also curious about how the sampling is done for training. If I request a sample of 100k, are the first 100k reads selected? Or are they selected randomly from the input file? If the former is the case, it seems that it would be unwise to combine multiple lanes for a single hmmSplicer run (as these lanes may have distinct characteristics such as read length, error rate, etc.)

            Comment


            • #7
              Originally posted by malachig View Post
              What kind of data are we talking about? One lane? What read length? Total number of reads per lane?

              In my experience, it takes anywhere from a few hours to a week to process a lane of Illumina paired end data (depending on number of reads mostly, but also possibly read length and of course hardware). Currently it seems that only the alignment step of hmmSplicer is parallel. So, committing multiple CPUs will improve one major step but other steps can still take a while. You can segment your data to an arbitrary degree and run on a cluster if you have those resources available. I'm using ~100 cpus and 10TB of disk space, but I am processing a lot of data...
              I have a few dozen lanes to run (PE 2x50 GA2 runs). I suppose I can set it up to run on a cluster I have access to. My experiences have mostly been with SAW (published by some people I know), mapSplice, spliceMap, and supersplat. Run times have been a continuing concern, but I can send jobs out to a cluster with a lot of 12-core i7 nodes - right now it's been exploratory.

              Comment


              • #8
                RFC.
                If it doesn't yet support SOLiD file formats, please ...

                Comment


                • #9
                  I am trying the HMMSplicer. Could anyone let me know how to load paired-end data (illumina) to it?

                  Thanks alot,

                  Qi

                  Comment


                  • #10
                    The manuscript with a more complete description of the tool will be published soon, hopefully that will answer many of your questions.

                    In terms of performance, HMMSplicer is comparable to TopHat, depending on the size of the genome and the size of the dataset. For a human test set with about 10 million paired end reads, running across 4 processors, HMMSplicer took 14 hours to complete on my setup. As far as using multiple processors, if you have a small genome then the alignment and the splice junction detection steps are parellelized. If you use the 'large genome' option then only the alignment steps are parallelized.

                    Splitting the input reads into multiple groups for processing on a cluster shouldn't have an adverse effect on the HMM training, as long as each group is large enough to sample. The trained HMM parameters are printed to the log file, so you can always check the differences in the trained parameters to make sure each subset is training to approximately the same values. The sampling is done randomly -- so if you select 100k reads to sample, they will be spread randomly throughout the input file.

                    Comment


                    • #11
                      Qi,
                      As far as running HMMSplicer with paired end data, HMMSplicer does not do any special processing for paired ends yet, so simply concatenate the read files and use the combined reads as input.

                      Comment


                      • #12
                        malachig,

                        Thanks for the feedback and the suggestions. Python has good tools for handling compressed files, so this should be a relatively straightforward addition for the next release. I really like the idea of deleting tmp files along the way, also.

                        Thanks!
                        Michelle

                        Comment


                        • #13
                          The manuscript is now available for HMMSplicer:
                          HMMSplicer was found to perform especially well in compact genomes and on genes with low expression levels, alternative splice isoforms, or non-canonical splice junctions. Because HHMSplicer does not rely on pre-built gene models, the products of inexact splicing are also detected. For H. sapiens, w …


                          There should also be a new version of the software available later today with some of the suggestions here as well as improved descriptions on how to use the helper scripts that are included as part of HMMSplicer.

                          Comment


                          • #14
                            Thanks for developing the tool. The default output doesn't seems to contain the alignmeant of the spliced reads, it would be great if the software could output the spliced read alignments in sam format.

                            Comment


                            • #15
                              @mdimon

                              I have few lanes of paired RNA-Seq reads from few tissues (plant, novel genome). Do you recommend concatenating all of them into a single giant file, or running individual lanes, or even _1 and _2 reads
                              will not make a big difference? I want to get as many reliable splice junctions as possible.
                              If combining reads: I assume individual reads names must be unique in the entire file?

                              My other question: can I feed HMMsplicer just with unmapped reads to speed up things? I already have mapping results in a non-spliced mode for several lanes.

                              Thanks a lot for developing it.

                              PS HMMsplicer works OK with bowtie 0.12.7 /Python 2.6.4 on Linux Fedora 8.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X