Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Jumping library

    Hi all,

    I'm trying to merge 2 genome assemblies using metassembler (

    Unfortunately, I don't have jumping libraries for my assemblies but a colleague told me that I could "simulate" them consistently using the reads I have.
    As a computer scientist, I don't know much about jumping libs, and I have no idea on how to simulate them.

    The genome assemblies I have are in fasta format, and reads in fq.

    It would be very nice to get any help on the matter, even starting with references to get a better understanding of jumping libs (I've already read the wikipedia page several times ).

    Thanks a lot !


  • #2
    You can simulate a 4000bp jump library from your existing data like this (using BBTools):

    cat assembly1.fa assembly2.fa > combined.fa ref=combined.fa reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect=1 q=35 out=jump.fq.gz

    That may not help much, since it is only recycling information that is already present in your assemblies, unlike a real jump library.

    The orientation will come out forward/reverse; if you need them in a different orientation you can reverse-complement one or both reads with the associated "" tool.
    Last edited by Brian Bushnell; 03-27-2015, 09:24 AM.


    • #3
      Hi Brian,

      Thanks a lot for your answer!
      I'm trying to run BBMap but I get the following error message:
      mjm@mining:~/sequence_assembly/bbmap> ref=A.fa
      java -Djava.library.path=/home/mjm/sequence_assembly/bbmap/jni/ -ea -Xmx39107m -cp /home/mjm/sequence_assembly/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=A.fa
      Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, ref=A.fa]

      BBMap version 34.72
      Retaining first best site only for ambiguous mappings.
      No output file.
      NOTE: Deleting contents of ref/genome/1 because reference is specified and overwrite=true
      Writing reference.
      Executing dna.FastaToChromArrays2 [A.fa, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

      Set genScaffoldInfo=true
      Writing chunk 1
      Set genome to 1

      Loaded Reference: 0.015 seconds.
      Loading index for chunk 1-1, build 1
      No index available; generating from reference genome: /home/mjm/sequence_assembly/bbmap/ref/index/1/chr1_index_k13_c5_b1.block
      Indexing threads started for block 0-1
      Indexing threads finished for block 0-1
      Generated Index: 9.051 seconds.
      No reads to process; quitting.

      Total time: 10.184 seconds.

      mjm@mining:~/sequence_assembly/bbmap> reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect q=35 out=jump.fq.gz
      java -ea -Xmx39106m -cp /home/mjm/sequence_assembly/bbmap/current/ align2.RandomReads3 build=1 reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect q=35 out=jump.fq.gz
      Exception in thread "main" java.lang.NumberFormatException: For input string: "true"
      at sun.misc.FloatingDecimal.readJavaFormatString(
      at sun.misc.FloatingDecimal.parseFloat(
      at java.lang.Float.parseFloat(
      at align2.RandomReads3.main(

      My config is as follows:
      mjm@mining:~/sequence_assembly/bbmap> uname -a
      Linux mining 3.13.0-48-generic #80-Ubuntu SMP Thu Mar 12 11:16:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

      mjm@mining:~/sequence_assembly/bbmap> java -version
      java version "1.8.0_25"
      Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
      Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

      Any idea why it crashes?



      • #4
        Oh, sorry - instead of "perfect" it should be "perfect=1". That's the fraction of reads that are generated with no errors, and for your usage, you don't want any errors. reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect=1 q=35 out=jump.fq.gz


        • #5
          Hi Brian,

          Thanks a lot for your answer. Now it works fine!
          I'm wondering if I could switch the quality values to a given numerical value?
          For instance, I'd like to replace:



          I've tried ref=A.fa fakefastaquality=2 but it doesn't seem to be the solution.

          Many thanks for your help,



          • #6

            I'm not sure why you would want to do that. Are you aware that the symbol "2" indicates the quality value 17, which is fairly low?

            Anyway, you could do it like this, assuming your fake LMP reads are in "lmp.fq":

   in=lmp.fq out=lmp.fa
   in=lmp.fa out=fixed.fq qfake=17

            A better value for qfake on synthetic data would be something like 30.



            • #7
              Hi Brian,

              Thanks for your solution.
              I'm still trying to use the jumping libs I created with bbmap for running matassembler, and I wanted the libs to be as close as possible to the samples provided by metassember.
              I'll update the thread once it works...



              • #8


                I'm trying to run this exact pipeline on my genomes, and I run into this

                "java.lang.OutOfMemoryError: Java heap space" errors, even when using

                However, if i hit "ctrl+c" while it's still running (but not doing anything), it continues, and quite quickly produces the jump.fq.gz file.

                And the file consist of the right amount of reads, but I'm worried because I had to force it through, and also because all the quality values are all either just A's, B's....G's like this:


                Is this OK, or is there a way to fix this?

                I'm running it on a quite large cluster, so there should be no problem allocating the needed memory resources...



                • #9
                  So.... first off, that's really strange about the ctrl+c thing. Though I suppose that maybe you are allocating enough memory for the reference but not BBMap's index, which is not really necessary for randomreads.

                  How much memory is on the computer, and how big is the reference you are using?

                  I suggest deleting the "ref" folder, which may be corrupt, and then doing something like this (with the specific numbers adjusted as needed):

         ref=reference.fa -Xmx25g reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect=1 q=35 out=jump.fq.gz

                  As for the fixed quality values, you can remove "q=35" and reads will get random quality values, but that won't affect anything other than making the output file bigger due to worse compression. -Xmx25g will allocate 25 gigabytes of memory; that's probably way more than you need. It should require something like 1.5x the size of the reference file.

                  Please let me know if this still causes problems.


                  • #10
                    Hi Brian!

                    That worked perfectly! I tried up to 10g, as I figured that that should be more than enough, but apparently 25g did the trick.

                    Thanks a lot!

                    Best regards,


                    Latest Articles


                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin

                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin

                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM





                    Topics Statistics Last Post
                    Started by seqadmin, Today, 08:47 AM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    Last Post seqadmin