Unconfigured Ad

**Jon_Keats** · 09-22-2010, 11:59 AM

Hi,

Nice looking application. Do you have any suggestion for the minimum number of read pairs per sample? For the hypothetical events in the database would this include all possible exon junctions (ie. assuming no known transcript or est support for alternatives) for the following example:

Exon1--Exon2--Exon3--Exon4--Exon5

Canonical transcript/est supported junctions

Exon1-Exon2
Exon2-Exon3
Exon3-Exon4
Exon4-Exon5

Hypothetical junctions generated

Exon1-Exon3
Exon1-Exon4
Exon1-Exon5
Exon2-Exon4
Exon2-Exon5
Exon3-Exon5

**malachig** · 09-22-2010, 02:28 PM

Hello Jon,

Thanks for the encouraging word. Your two questions are quite different so I will answer them separately.

"Do you have any suggestion for the minimum number of read pairs per sample?"

This is a straight-forward and reasonable question to ask but is difficult to answer directly. This is by far the number one question I am asked about RNA-seq analysis. It has been discussed in various places in this forum including by myself here: How much coverage we need?

The answer really depends on the particulars of your input material (e.g. RNA quality, cell heterogeneity), the type of library construction (e.g. polyA+ RNA vs. ribominus RNA), the tissues they were created from, the goals of the analysis, etc. I would always rather have more data than less. When absolutely forced to give a hard number I say that for alternative expression analysis with ALEXA-seq, the results really started to shine when I had at least 100 million paired 42-mers (of which say ~40-70% map to known transcripts depending on the library). If you have longer paired reads, you can get away with less of them.

I have analyzed libraries of highly varying depth and quality and many of these analyses are summarized here. You can browse through these and see what the outcome looks like to get a more hands-on feel for what increasing depth gets you in terms of alternative expression analysis. For example, the REMC, Morgen, and 5-FU datasets have ~100-200 million mapped paired-end reads (36-mers to 75-mers) and produce beautiful alternative expression results. On the other hand the Sutent dataset has only ~10 million mapped reads and is really only good for gene-level analysis. Similarly, the AllenBrain libraries suffered from poor quality input RNA and this caused all kinds of problems with the analysis even though the number of reads was reasonable.

**malachig** · 09-22-2010, 02:42 PM

Your second question is more straightforward. Yes, that is how we create the hypothetical events in the junction databases. Using Ensembl exons as a starting point, we create the combinatorial pairwise connections of these exons. A subset thus correspond to canonical junctions but the majority correspond to hypothetical connections. The number of possible junctions for a gene with n known exons is n!/(2!(n – 2)!)

For the human hg19 transcriptome annotated by Ensembl, this results in 3,305,170 junctions, only 284,796 of which correspond to a known transcript. If you think such a database might be useful to you, please refer to the downloads page. Junction databases are available for human hg18 and hg19 here. See links to 'additional junction DBs' on this page.

Junction databases including the sequences (fasta format) and corresponding annotation info for each are provided for 20 lengths of junction sequences (from 60mers up to 150mers). Included in the annotation files are chromosome coordinates, number of exons skipped, Ensembl support, EST and mRNA support from human and all other species, predicted peptide sequence, etc.

**Lee Sam** · 10-14-2010, 08:36 AM

I'm playing with the ALEXA-Seq image, and I'm wondering what kind of data path the scripts require. I ask because I just point it at a common folder /home/alexa-seq/seq_files with .fastq files named s_n_1/2_sequence.txt (just for test, 2 lanes). Does it need the full pipeline analysis path?

**malachig** · 10-14-2010, 12:40 PM

I assume you mean in the config file where you point to the data... If so, then the data path can be anywhere, but it has to be a complete path to a directory that contains your data files... This doesn't have to be where the data files where originally generated. If you are using fastq files, you will have to change the SeqFileType column to fastq. I recommend using qseq files instead as the first step will be faster.

**Lee Sam** · 10-14-2010, 07:24 PM

Originally posted by malachig View Post

I assume you mean in the config file where you point to the data... If so, then the data path can be anywhere, but it has to be a complete path to a directory that contains your data files... This doesn't have to be where the data files where originally generated. If you are using fastq files, you will have to change the SeqFileType column to fastq. I recommend using qseq files instead as the first step will be faster.

I figured out my issue. Now I have another question: have you processed any HiSeq data with the pipeline? I started a couple HiSeq lanes 4 hours ago and it isn't even done with the read pre-processing step (processRawSolexaReads.sh). The last message was that the BerkleyDB was being created to save memory. Thanks for the help.

**hong_sunwoo** · 10-15-2010, 12:08 AM

Hello malachig,
I checked ALEXA-Seq web site and found that this tool support only paired-end data.
Do you have a plan to develope a tool for single-end data?

**malachig** · 10-15-2010, 10:09 AM

Lee Sam. Yes, the support for fastq was added near the end of development to support another user. It still needs some optimization as the initial read processing step is slow. If you are impatient you can convert your fastq file to either qseq or seq format and this step will run faster. We have processed some HiSeq data, and because each lane is so much larger it did tend to take longer for each step (and use more memory).

micrornas, no we don't have a specific plan to develop a tool for single-end data as we never generate single end RNA-seq data... I am aware of another user that processed single end data by creating 'dummy' read pairs (somewhat of a hack but apparently it worked).

**obig** · 11-17-2010, 11:55 AM

single-end data

micrornas. I have processed single-end data with alexa-seq. I created dummy R2 qseq files with sequences of Ns at the same length as the real read and quality strings comprised of all "B" values. This allows the pipeline to run and all dummy reads are filtered out at the first step as "Low Quality" reads. A few of the library summary figures and stats will be affected by this. But, the results I got out were still usable and useful.

**Lee Sam** · 11-17-2010, 12:00 PM

We're trying to get the heavy lifting (preprocessing, alignment) parts of ALEXA going on our cluster which uses the Torque scheduler. I know that ALEXA was designed to run on a cluster, was there a particular configuration it was designed to work with? I was hoping to edit some of the configuration and script batch generation code to generate jobs that could be submitted.

**malachig** · 11-17-2010, 12:39 PM

Our cluster uses Sun Grid Engine (sge). Submitting jobs to the cluster is accomplished using a wrapper for the 'qsub' utility of sge. Basically the submission command is just pointing to a batch file containing bash commands (one job per line). I assume this is a somewhat common theme in cluster job submission. If this is the case for you, it shouldn't be too hard to modify the 'createAnalysisCommands' step. You would just need to modify all the lines containing 'mqsub' to match the submission style of your cluster and then when you run createAnalysisCommands use the option '--cluster_commands=1'

**obig** · 11-17-2010, 01:18 PM

alexa-seq cluster

I guess there are too many different cluster configurations for alexa-seq to anticipate. So, simple bash files are produced which can be run serially (for very small libraries) or submitted to your cluster according to its protocols. You will probably have to work with your cluster administrator to get things running optimally.

Our cluster here (lawrencium) uses PBS Torque Resource manager and Moab job scheduler. And, with some work, I have been able to submit Alexa-seq jobs to it. I have processed four projects with over 100 libraries to date. So, it is doable. Instead of trying to edit all those parts of the alexa-seq pipeline code that produce job batch files and submission commands, I created a simple perl script which takes an alexa-seq job batch file (essentially just an sh file with one "task/command" per line) and produces the submission files compatible with our scheduler. I strongly recommend this strategy. Changing the alexa-seq code will be a lot more work. What I do is run the alexa-seq pipeline as instructed for steps 0 to 5B. Step 5C (submitMapBatch.sh) is the first step that requires submitting to a cluster. That sh file contains a whole bunch of bash commands for additional sh files (e.g., blast_vs_intergenics.sh). It is those files which should be submitted to a cluster, not the parent submitMapBatch.sh file. You can do them individually or cat them into combined files. I create one combined batch file for all libraries separated only by feature type (repeats, transcripts, etc) because they have different memory and runtime requirements. I can thus optimize cluster submission parameters for each of the 6 feature types. This is necessary because our cluster uses wallclock estimates and task number to determine job priority in the queue. Maybe your cluster has a more simple setup and this step will be unnecessary for you. Once I have combined the bash files I run my submitjobs.pl script on it and wait for it to finish. In later steps, whenever alexa says to submit some jobs to a cluster, the bash file typically contains the tasks/commands (instead of additional bash commands as above). I just run my submitjobs.pl script on each of those bash files. Check .output and .error files for problems and then proceed to the next step.

For each project, once the alexa-seq .commands file is produced, I make a new copy of this file and edit it to add my own commands that are necessary for job submission. This file can then be used as a template for running future projects.

**bioinfosm** · 11-17-2010, 02:58 PM

Originally posted by obig View Post

micrornas. I have processed single-end data with alexa-seq. I created dummy R2 qseq files with sequences of Ns at the same length as the real read and quality strings comprised of all "B" values. This allows the pipeline to run and all dummy reads are filtered out at the first step as "Low Quality" reads. A few of the library summary figures and stats will be affected by this. But, the results I got out were still usable and useful.

Could you share what advantage you had of tweaking this particular tool and not using any of the specific microRNA tools?

**obig** · 11-17-2010, 03:12 PM

Dear bioinfosm,

I was responding to a question from the user with user name = "micrornas". This thread doesn't actually have anything specifically to do with the biological entity called microRNA. And, I'm afraid I have no experience to share regarding microRNA tools. This is perhaps a cautionary tale for those choosing a user name that has specific meaning and is commonly used and searched for in the forums.

Topics	Statistics	Last Post
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 15 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 35 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM

Unconfigured Ad

ALEXA-Seq : Alternative expression analysis by RNA sequencing paper

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News