Header Leaderboard Ad


Software packages for next gen sequence analysis



No announcement yet.
This topic is closed.
This is a sticky topic.
  • Filter
  • Time
  • Show
Clear All
new posts

  • sortpeaks

    Yeah sure,
    I had this huge I human seq reads that I aligned using bowtie. This bowtie alignment I need to convert into wig files. So I have been using the separateReads as the first step in converting into wig. This worked fine and I got a gi|22XXXXXX|ref|NT_XXXXXX.12|.bg.bowtie also I have the same with .part.bowtie after I ran the separtereads.
    Now on this file (uncompressed) I ran sortfiles using -Xmx2G memory heap specified. But after some lines it gives me a memory error.
    I tried running sortfiles on the "gz"ed separate reads but did not work. The file was not recognisable or something.

    Is it the bowtie mapped reads that is the problem and so I might need to use GERALD instead directly?
    Or is it the separate reads/sortreads problem?
    Hope this helps. I appreciate any suggestions in this matter.
    I found findpeaks very cool but unfortunately not working for me now....


    • I seem to recall that bowtie is able to produce .map files - which would be pre-sorted and directly readable by FindPeaks without breaking it up into chromosomes. That might be a good first pass to try. (Assuming this is SET data. if it's PET data, you'll need to do the pairing anyhow, so SeparateReads wouldn't have been the right path to take.)

      I suppose I should also mention that running SortReads.jar on .gz bowtie files *should* work. If you could send me the error you're getting, I may be able to track down the reason why it's not working for you.

      And finally, I should probably also mention that bowtie seems to be doing something funny to your chromosome names. I don't use bowtie myself, but someone had previously reported to me that there was an option you can use to get more "sane" chromosome names. I would suggest you take a look - it may help you out downstream.
      The more you know, the more you know you don't know. —Aristotle


      • Thanks a lot I will try all the options you gave me and let u know how it worked for me.


        • hello everybody

          hello everybody

          i am working on a resequencing project. i have a reference genome and a set of sanger pairmates from a genotype. i identified a list of structural variations. i want to visualize them. Can i use lookseq ?



          • what kind of formats are BUSTARD and GERALD files from solexa?


            • If I would directly perform separate reads and sort reads on the GERALD alignment files what type of aligner do I need to specify? GERALD/Eland if specified give me an error on fndpeaks
              Error: Did not recognize aligner type: GERALD/Eland
              Error: Please check that you have not made a spelling mistake when providing the alignment type
              same error if I specify only Eland.....so what type of an aligner is used GERALD files from solexa?


              • Hi Ka123$,

                Gerald and bustard are files produced by the Illumina Pipeline, as far as I know, and neither one should contain useful information about the origin of a fragment. Only output from an aligner can be used in the context of peak finding.

                For a list of formats accepted by FindPeaks, please see the following page:


                If you're having an error with Eland files, please let me know what it is, and I'll try to fix it.

                The more you know, the more you know you don't know. —Aristotle


                • Kal,

                  Bustard and GERALD are not files with a format in the sense you are asking. Bustard and GERALD are pipelines for processing Illumina short reads data. They generate many different output files with many different formats.

                  The Bustard pipeline performs base calling starting with signal intensity information. The primary output of the Bustard pipeline are qseq files. These files are a format peculiar to Illumina which contain the read ID, base calls and quality scores for each read on a single line as a set of tab separated values. Bustard may output other files (e.g. qval, prb) depending on options supplied when the pipeline is launched.

                  GERALD is the pipeline for performing alignments using one of two different aligners supplied with the Pipeline software. The first aligner, PhageAlign is only useful for very small genomes and data sets and is almost never used so I will forego any further mention of it. The primary aligner supplied with the Illumina pipeline is Eland. GERALD calls the Eland aligner and passes it a set of configuration parameters. Eland outputs a number of files which all have similar (but slightly different) formats. Some examples of the files generated by Eland are s_N_eland_extended.txt, s_N_eland_multi.txt (where N = lane number from the Illumina run). These files basically list each read, its sequence and quality scores, where it matches the reference sequence and what mismatches exist between the read and the reference. Which files Eland generates and details of their format will be dependent on the arguments used when invoking Eland. GERALD may also be used to output sequence files in FASTQ format.


                  • Thanks to both kmcarr and apfejes !
                    I did belive that GERALD generates the Eland format files. But when I used GERALD files to perform a separate reads according to findpeaks and I used ELAND as an aligner name it gave me an error saying that it was a wrong aligner name.......hence needed a confirmation as to what I thought was actually the correct thing or not.....
                    I dont know why it said that?
                    Did I have to use GERALD.fa or the export file? not sure....

                    Why I needed to use GERALD instead of aligned files?
                    Reason being,when I used the findpeaks tool to perform a conversion of my aligned files to wig files , I would need to go through the separate and sort files..... When I perform separate files using bowtie aligned files, I get just one gi|......|.......|.part.bowtie.gz which contains the contigs with each contig having the name gi|.....|.....| etc along with their position w.r.t the reference.

                    Why did I get only one gi|........file although I have separated it? if I sorted this either a gz or gunzipped I get memory error
                    as whenever I used sort files on this I get memory heap error: at 2300000 lines read.
                    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
                    at java.lang.String.substring(Unknown Source)
                    at java.lang.String.subSequence(Unknown Source)
                    at java.util.regex.Pattern.split(Unknown Source)
                    at java.lang.String.split(Unknown Source)
                    at java.lang.String.split(Unknown Source)
                    at src.lib.ioInterfaces.BowtieIterator.next(BowtieIterator.java:145)
                    at src.lib.ioInterfaces.BowtieIterator.next(BowtieIterator.java:20)
                    at src.lib.ioInterfaces.Generic_AlignRead_Iterator.hasNext(Generic_AlignRead_Iterator.java:103)
                    at src.fileUtilities.SortFiles.main(SortFiles.java:79)

                    although I use -Xmx2G........

                    So we thought we could use GERALD to separate into indiv chr and then sort on each indv chr instead?????

                    ANy suggestions?


                    • Hi ka123$,

                      kmcarr is right - Gerald is an intermediate program along the way from the sequencing machine to getting results. It's not an appropriate place to look for files to work with FindPeaks.

                      If your problem is with the sorting and pre-processing, you might consider using the s_N_sorted.txt produced by findPeaks. It's pre-sorted, so it should make your life easier.

                      I should also mention that the "-aligner" format used sets the format and some of the behaviours of FindPeaks. If you've selected "-aligner eland", then FindPeaks expects the files you provide to be in the Eland format. I don't know what format Gerald uses, but I'm certain it's not the same as the output from the Eland aligner.

                      As for the problem you're seeing, I'm not sure why 2.3M reads would cause an out of memory error, however, I suspect that despite allocating 2Gb of RAM, the machine you're using actually has less than that free. (-Xmx2G sets the maximum the application is allowed to use, not the actual amount available.) I've certainly sorted much larger files than that with the SortFiles program, although I do tend to use a machine with more than 2Gb of Ram so I don't see that problem myself.

                      I'm happy to try helping, but I think you need to clarify a few things for me. What aligner are you using, and what commands are you using? If we settle on one aligner, I can point you in the right direction as to the work flow you're using, and if I can see the commands you're using, I can check to see if any of the parameters should be changed.


                      The more you know, the more you know you don't know. —Aristotle


                      • Originally posted by apfejes View Post
                        Hi ka123$,
                        I should also mention that the "-aligner" format used sets the format and some of the behaviours of FindPeaks. If you've selected "-aligner eland", then FindPeaks expects the files you provide to be in the Eland format. I don't know what format Gerald uses, but I'm certain it's not the same as the output from the Eland aligner.

                        Actually the GERALD output is the appropriate place to look. GERALD.pl is a wrapper script which (among other things) calls the Eland aligner. The output from Eland is then placed in the "GERALD_<DD-MM-YYYY>_<USERNAME>" folder. Included in that output is the s_N_eland_extended.txt, s_N_eland_multi.txt, s_N_export.txt and s_N_sorted.txt. As you stated the s_N_sorted.txt file should be able to be used in FindPeaks directly. (I've never done it myself so I can't speak from experience.)

                        After looking at your link above I think the problem may be that Kal needs to specify elandext as the "-aligner" parameter. While the program is still called the "Eland" the standard "eland" invocation is essentially deprecated. The program is now almost always invoked (through GERALD) using "eland_extended".
                        Last edited by kmcarr; 09-28-2009, 01:11 PM. Reason: Add bit about eland_extended


                        • Hi kmcarr - thanks for the clarification. I was under the impression that Gerald was simply one step in the process, rather than a wrapper around the Eland calls. It's getting harder and harder to keep on top of all of the different aligner formats and pipelines.

                          For the record, I rarely use Eland output of any form myself. We mainly use Maq here and I expect we'll be moving to SAM/BAM based formats in the future.
                          The more you know, the more you know you don't know. —Aristotle


                          • First of, thanks so much for all your guidance, from both of you!
                            I really appreciate it so much!

                            I previously tried using bowtie aligner. As bowtie aligner gave me only one separatefile.gz and I could not make sense of it.... We reverted to use GERALD alignment directly to separate and sort.........But here are the comands I have used using bowtie aligner:

                            Secondly I followed bowtie commands to do my alignment .
                            ./bowtie -a -v 2 -f h_X_GERALD.fa h_sap (did I have to use the -chr here???)

                            I used findpeaks cmds here:
                            java -jar -SeparateReads.jar elandext p_align_copy p_7_ger

                            (before I had problems using this for gerald and it said aligner format not recognised,so according to the blog I used elandext
                            java -jar SeparateReads.ja
                            r elandext p_align_copy p_7_ger
                            Error: Couldn't create log file : p_7_ger/SeparateReads.log)

                            for sort reads previously I have used this cmd:
                            java -jar Sort* bowtie g_sort_7 p_7_ger/*.bowtie

                            (although it ran sometime gave me memory problems)


                            • can anyone let me know why findpeaks separatereads.jar command cannot create a log file when I use the GERALD aligned files or the bowtie aligned files?
                              In GERALD aligned files I indicated elandext or eland_extended as the aligner type....?

                              Bowtie aligned files were giving me problems to run on findpeaks to separate and sort so I am directly converting gerald files to wig files although GERALD is probably not a best choice over bowtie alignment.
                              Any suggestion


                              • Hi Ka123$,

                                Once again, it would really help if you tell us what the error is that you're seeing. The most common errors are:

                                - Trying to write to a directory without permissions
                                - missing a parameter (FindPeaks won't start without it, and throws and error)
                                - a parameter is incorrect (FindPeaks won't start with an invalid parameter)

                                If you tell us what error you've got, I might be able to narrow it down.

                                Is the error above the same one? I think this is probably a path problem. You're trying to write to a directory called p_7_ger in the directory from which you're launching the jar program. Does that directory already exist?

                                The more you know, the more you know you don't know. —Aristotle


                                Latest Articles


                                • seqadmin
                                  A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                                  by seqadmin

                                  ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                                  01-24-2023, 01:19 PM
                                • seqadmin
                                  Introduction to Single-Cell Sequencing
                                  by seqadmin
                                  Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                                  The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                                  01-09-2023, 03:10 PM
                                • seqadmin
                                  AVITI from Element Biosciences: Latest Sequencing Technologies—Part 6
                                  by seqadmin
                                  Element Biosciences made its sequencing market debut this year when it released AVITI, its first sequencer. The AVITI System uses avidity sequencing, a novel sequencing chemistry that delivers higher quality data, decreases cycle times, and requires lower reagent concentrations. This new instrument reportedly features lower operating and start-up costs while maintaining quality sequencing.

                                  Read type and length
                                  AVITI is a short-read benchtop sequencer that also offers an innovative...
                                  12-29-2022, 10:44 AM