Unconfigured Ad

**malachig** · 10-15-2010, 11:36 AM

I have been experimenting with this tool lately. The results seem promising.

I have the following suggestions. From a performance perspective, the major bottleneck we are encountering is disk space usage. Processing a lane of data appears to require approximately 25-30 Gb of disk space (including the fastq input file). This is no problem for 1 lane of data, but when processing many hundreds of lanes it quickly becomes an issue.

I would request the following:
1.) Support for compressed input fastq files. We do not store uncompressed versions of any read data. Would it be possible to decompress it on the fly without ever having an uncompressed version on disk?

2.) You have an option to delete the temp files at the end of the job. It appears that this happens all at once at the end of the job. If the user selects this option, would it be possible to delete individual files as soon as they are no longer needed?

**Lee Sam** · 10-19-2010, 12:23 PM

I've been playing with splice discovery tools for a while. How does the tool perform time-wise? I can't get supersplat to finish, for example.

**malachig** · 10-19-2010, 12:38 PM

What kind of data are we talking about? One lane? What read length? Total number of reads per lane?

In my experience, it takes anywhere from a few hours to a week to process a lane of Illumina paired end data (depending on number of reads mostly, but also possibly read length and of course hardware). Currently it seems that only the alignment step of hmmSplicer is parallel. So, committing multiple CPUs will improve one major step but other steps can still take a while. You can segment your data to an arbitrary degree and run on a cluster if you have those resources available. I'm using ~100 cpus and 10TB of disk space, but I am processing a lot of data...

**malachig** · 10-19-2010, 12:39 PM

Lee Sam, it would be great to hear your thoughts on the other splice discovery tools you have been experimenting with?

**malachig** · 10-19-2010, 12:43 PM

Also, perhaps the author can comment on the advisability of partitioning the data and then merging the results given that an HMM training step is involved...

I'm also curious about how the sampling is done for training. If I request a sample of 100k, are the first 100k reads selected? Or are they selected randomly from the input file? If the former is the case, it seems that it would be unwise to combine multiple lanes for a single hmmSplicer run (as these lanes may have distinct characteristics such as read length, error rate, etc.)

**Lee Sam** · 10-19-2010, 12:47 PM

Originally posted by malachig View Post

What kind of data are we talking about? One lane? What read length? Total number of reads per lane?

In my experience, it takes anywhere from a few hours to a week to process a lane of Illumina paired end data (depending on number of reads mostly, but also possibly read length and of course hardware). Currently it seems that only the alignment step of hmmSplicer is parallel. So, committing multiple CPUs will improve one major step but other steps can still take a while. You can segment your data to an arbitrary degree and run on a cluster if you have those resources available. I'm using ~100 cpus and 10TB of disk space, but I am processing a lot of data...

I have a few dozen lanes to run (PE 2x50 GA2 runs). I suppose I can set it up to run on a cluster I have access to. My experiences have mostly been with SAW (published by some people I know), mapSplice, spliceMap, and supersplat. Run times have been a continuing concern, but I can send jobs out to a cluster with a lot of 12-core i7 nodes - right now it's been exploratory.

**ilivyatan** · 10-20-2010, 03:15 AM

RFC.
If it doesn't yet support SOLiD file formats, please ...

**zukey** · 10-21-2010, 07:02 AM

I am trying the HMMSplicer. Could anyone let me know how to load paired-end data (illumina) to it?

Thanks alot,

Qi

**mdimon** · 10-21-2010, 09:15 AM

The manuscript with a more complete description of the tool will be published soon, hopefully that will answer many of your questions.

In terms of performance, HMMSplicer is comparable to TopHat, depending on the size of the genome and the size of the dataset. For a human test set with about 10 million paired end reads, running across 4 processors, HMMSplicer took 14 hours to complete on my setup. As far as using multiple processors, if you have a small genome then the alignment and the splice junction detection steps are parellelized. If you use the 'large genome' option then only the alignment steps are parallelized.

Splitting the input reads into multiple groups for processing on a cluster shouldn't have an adverse effect on the HMM training, as long as each group is large enough to sample. The trained HMM parameters are printed to the log file, so you can always check the differences in the trained parameters to make sure each subset is training to approximately the same values. The sampling is done randomly -- so if you select 100k reads to sample, they will be spread randomly throughout the input file.

**mdimon** · 10-21-2010, 09:17 AM

Qi,
As far as running HMMSplicer with paired end data, HMMSplicer does not do any special processing for paired ends yet, so simply concatenate the read files and use the combined reads as input.

**mdimon** · 10-21-2010, 09:19 AM

malachig,

Thanks for the feedback and the suggestions. Python has good tools for handling compressed files, so this should be a relatively straightforward addition for the next release. I really like the idea of deleting tmp files along the way, also.

Thanks!
Michelle

**mdimon** · 11-23-2010, 09:50 AM

The manuscript is now available for HMMSplicer:

Checking your browser - reCAPTCHA

http://www.ncbi.nlm.nih.gov/pubmed/21079731

There should also be a new version of the software available later today with some of the suggestions here as well as improved descriptions on how to use the helper scripts that are included as part of HMMSplicer.

**proteomania** · 11-24-2010, 11:51 AM

Thanks for developing the tool. The default output doesn't seems to contain the alignmeant of the spliced reads, it would be great if the software could output the spliced read alignments in sam format.

**darked89** · 11-26-2010, 02:13 AM

@mdimon

I have few lanes of paired RNA-Seq reads from few tissues (plant, novel genome). Do you recommend concatenating all of them into a single giant file, or running individual lanes, or even _1 and _2 reads
will not make a big difference? I want to get as many reliable splice junctions as possible.
If combining reads: I assume individual reads names must be unique in the entire file?

My other question: can I feed HMMsplicer just with unmapped reads to speed up things? I already have mapping results in a non-spliced mode for several lanes.

Thanks a lot for developing it.

PS HMMsplicer works OK with bowtie 0.12.7 /Python 2.6.4 on Linux Fedora 8.

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, Yesterday, 05:37 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 52 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 110 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

HMMSplicer : new software for finding splice junctions in RNA-Seq data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News