Unconfigured Ad

apfejes · 09-28-2009, 01:16 PM

Hi kmcarr - thanks for the clarification. I was under the impression that Gerald was simply one step in the process, rather than a wrapper around the Eland calls. It's getting harder and harder to keep on top of all of the different aligner formats and pipelines.

For the record, I rarely use Eland output of any form myself. We mainly use Maq here and I expect we'll be moving to SAM/BAM based formats in the future.

kmcarr · 09-28-2009, 01:05 PM

Originally posted by apfejes View Post

Hi ka123$,
I should also mention that the "-aligner" format used sets the format and some of the behaviours of FindPeaks. If you've selected "-aligner eland", then FindPeaks expects the files you provide to be in the Eland format. I don't know what format Gerald uses, but I'm certain it's not the same as the output from the Eland aligner.

Anthony,

Actually the GERALD output is the appropriate place to look. GERALD.pl is a wrapper script which (among other things) calls the Eland aligner. The output from Eland is then placed in the "GERALD_<DD-MM-YYYY>_<USERNAME>" folder. Included in that output is the s_N_eland_extended.txt, s_N_eland_multi.txt, s_N_export.txt and s_N_sorted.txt. As you stated the s_N_sorted.txt file should be able to be used in FindPeaks directly. (I've never done it myself so I can't speak from experience.)

After looking at your link above I think the problem may be that Kal needs to specify elandext as the "-aligner" parameter. While the program is still called the "Eland" the standard "eland" invocation is essentially deprecated. The program is now almost always invoked (through GERALD) using "eland_extended".

apfejes · 09-28-2009, 10:46 AM

Hi ka123$,

kmcarr is right - Gerald is an intermediate program along the way from the sequencing machine to getting results. It's not an appropriate place to look for files to work with FindPeaks.

If your problem is with the sorting and pre-processing, you might consider using the s_N_sorted.txt produced by findPeaks. It's pre-sorted, so it should make your life easier.

I should also mention that the "-aligner" format used sets the format and some of the behaviours of FindPeaks. If you've selected "-aligner eland", then FindPeaks expects the files you provide to be in the Eland format. I don't know what format Gerald uses, but I'm certain it's not the same as the output from the Eland aligner.

As for the problem you're seeing, I'm not sure why 2.3M reads would cause an out of memory error, however, I suspect that despite allocating 2Gb of RAM, the machine you're using actually has less than that free. (-Xmx2G sets the maximum the application is allowed to use, not the actual amount available.) I've certainly sorted much larger files than that with the SortFiles program, although I do tend to use a machine with more than 2Gb of Ram so I don't see that problem myself.

I'm happy to try helping, but I think you need to clarify a few things for me. What aligner are you using, and what commands are you using? If we settle on one aligner, I can point you in the right direction as to the work flow you're using, and if I can see the commands you're using, I can check to see if any of the parameters should be changed.

Cheers,

Anthony

Ka123$ · 09-28-2009, 10:33 AM

Thanks to both kmcarr and apfejes !
I did belive that GERALD generates the Eland format files. But when I used GERALD files to perform a separate reads according to findpeaks and I used ELAND as an aligner name it gave me an error saying that it was a wrong aligner name.......hence needed a confirmation as to what I thought was actually the correct thing or not.....
I dont know why it said that?
Did I have to use GERALD.fa or the export file? not sure....

Why I needed to use GERALD instead of aligned files?
Reason being,when I used the findpeaks tool to perform a conversion of my aligned files to wig files , I would need to go through the separate and sort files..... When I perform separate files using bowtie aligned files, I get just one gi|......|.......|.part.bowtie.gz which contains the contigs with each contig having the name gi|.....|.....| etc along with their position w.r.t the reference.

Why did I get only one gi|........file although I have separated it? if I sorted this either a gz or gunzipped I get memory error
as whenever I used sort files on this I get memory heap error: at 2300000 lines read.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.String.substring(Unknown Source)
at java.lang.String.subSequence(Unknown Source)
at java.util.regex.Pattern.split(Unknown Source)
at java.lang.String.split(Unknown Source)
at java.lang.String.split(Unknown Source)
at src.lib.ioInterfaces.BowtieIterator.next(BowtieIterator.java:145)
at src.lib.ioInterfaces.BowtieIterator.next(BowtieIterator.java:20)
at src.lib.ioInterfaces.Generic_AlignRead_Iterator.hasNext(Generic_AlignRead_Iterator.java:103)
at src.fileUtilities.SortFiles.main(SortFiles.java:79)

although I use -Xmx2G........

So we thought we could use GERALD to separate into indiv chr and then sort on each indv chr instead?????

ANy suggestions?

kmcarr · 09-28-2009, 09:33 AM

Kal,

Bustard and GERALD are not files with a format in the sense you are asking. Bustard and GERALD are pipelines for processing Illumina short reads data. They generate many different output files with many different formats.

The Bustard pipeline performs base calling starting with signal intensity information. The primary output of the Bustard pipeline are qseq files. These files are a format peculiar to Illumina which contain the read ID, base calls and quality scores for each read on a single line as a set of tab separated values. Bustard may output other files (e.g. qval, prb) depending on options supplied when the pipeline is launched.

GERALD is the pipeline for performing alignments using one of two different aligners supplied with the Pipeline software. The first aligner, PhageAlign is only useful for very small genomes and data sets and is almost never used so I will forego any further mention of it. The primary aligner supplied with the Illumina pipeline is Eland. GERALD calls the Eland aligner and passes it a set of configuration parameters. Eland outputs a number of files which all have similar (but slightly different) formats. Some examples of the files generated by Eland are s_N_eland_extended.txt, s_N_eland_multi.txt (where N = lane number from the Illumina run). These files basically list each read, its sequence and quality scores, where it matches the reference sequence and what mismatches exist between the read and the reference. Which files Eland generates and details of their format will be dependent on the arguments used when invoking Eland. GERALD may also be used to output sequence files in FASTQ format.

apfejes · 09-28-2009, 07:52 AM

Hi Ka123$,

Gerald and bustard are files produced by the Illumina Pipeline, as far as I know, and neither one should contain useful information about the origin of a fragment. Only output from an aligner can be used in the context of peak finding.

For a list of formats accepted by FindPeaks, please see the following page:

Vancouver Short Read Analysis Package

http://sourceforge.net/apps/mediawiki/vancouvershortr/index.php?title=InputFormats

Download Vancouver Short Read Analysis Package for free. This package contains code for use with Short Read DNA Sequencing technologies, and includes packages for ChIP-Seq, Whole Transcriptome Shotgun Sequencing, Whole Genome Shotgun Sequencing, SNP Detection, Transcript expression and file conversion.

If you're having an error with Eland files, please let me know what it is, and I'll try to fix it.

Anthony

Ka123$ · 09-28-2009, 03:10 AM

If I would directly perform separate reads and sort reads on the GERALD alignment files what type of aligner do I need to specify? GERALD/Eland if specified give me an error on fndpeaks
Error: Did not recognize aligner type: GERALD/Eland
Error: Please check that you have not made a spelling mistake when providing the alignment type
same error if I specify only Eland.....so what type of an aligner is used GERALD files from solexa?

Ka123$ · 09-28-2009, 02:35 AM

what kind of formats are BUSTARD and GERALD files from solexa?

nathan.genome · 09-25-2009, 07:24 AM

hello everybody

hello everybody

i am working on a resequencing project. i have a reference genome and a set of sanger pairmates from a genotype. i identified a list of structural variations. i want to visualize them. Can i use lookseq ?

thanks
nathan

Ka123$ · 09-24-2009, 09:27 AM

Thanks a lot I will try all the options you gave me and let u know how it worked for me.

apfejes · 09-24-2009, 09:22 AM

I seem to recall that bowtie is able to produce .map files - which would be pre-sorted and directly readable by FindPeaks without breaking it up into chromosomes. That might be a good first pass to try. (Assuming this is SET data. if it's PET data, you'll need to do the pairing anyhow, so SeparateReads wouldn't have been the right path to take.)

I suppose I should also mention that running SortReads.jar on .gz bowtie files *should* work. If you could send me the error you're getting, I may be able to track down the reason why it's not working for you.

And finally, I should probably also mention that bowtie seems to be doing something funny to your chromosome names. I don't use bowtie myself, but someone had previously reported to me that there was an option you can use to get more "sane" chromosome names. I would suggest you take a look - it may help you out downstream.

Ka123$ · 09-24-2009, 08:57 AM

sortpeaks

Yeah sure,
I had this huge I human seq reads that I aligned using bowtie. This bowtie alignment I need to convert into wig files. So I have been using the separateReads as the first step in converting into wig. This worked fine and I got a gi|22XXXXXX|ref|NT_XXXXXX.12|.bg.bowtie also I have the same with .part.bowtie after I ran the separtereads.
Now on this file (uncompressed) I ran sortfiles using -Xmx2G memory heap specified. But after some lines it gives me a memory error.
I tried running sortfiles on the "gz"ed separate reads but did not work. The file was not recognisable or something.

Is it the bowtie mapped reads that is the problem and so I might need to use GERALD instead directly?
Or is it the separate reads/sortreads problem?
Hope this helps. I appreciate any suggestions in this matter.
I found findpeaks very cool but unfortunately not working for me now....

apfejes · 09-24-2009, 08:43 AM

Hi Ka123,

There are other ways to do the sort - including several methods you could try from the linux command line. However, I'm really not sure why it's taking so memory. Could you give me a few ideas as to what your work flow is?

In the meantime, documentation and an example command for SeparateReads can be found here:

Vancouver Short Read Analysis Package

https://sourceforge.net/apps/mediawiki/vancouvershortr/index.php?title=SeparateReads

Download Vancouver Short Read Analysis Package for free. This package contains code for use with Short Read DNA Sequencing technologies, and includes packages for ChIP-Seq, Whole Transcriptome Shotgun Sequencing, Whole Genome Shotgun Sequencing, SNP Detection, Transcript expression and file conversion.

Ka123$ · 09-24-2009, 06:51 AM

Solexa findpeaks

Using Findpeaks sort reads on bowtie mapped alignment is taking up too much memory......!!!!! So I am trying using the GERALD maps reads directly from solexa to convert to wig files...I believe the solexa GERALD mapped alignments are ELAND format?
So the aligner type will be -aligner eland, to perform separateReads.jar?
Any suggestions?

Dinny · 08-21-2009, 03:53 AM

Hi Anthony,
Thanks again for the advice. Taking the reads directly into .bed would be better. The .map converter in Bowtie needs a library file created in Maq, so it would be easier to limit the number of applications the data goes through...less opportunity to completely jumble it.
Couldn't see a way to align straight to .map, but I'll look again.
Dinny

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 30 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 38 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 43 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 64 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News