Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Jumping library

    Hi all,

    I'm trying to merge 2 genome assemblies using metassembler (http://sourceforge.net/projects/metassembler/).

    Unfortunately, I don't have jumping libraries for my assemblies but a colleague told me that I could "simulate" them consistently using the reads I have.
    As a computer scientist, I don't know much about jumping libs, and I have no idea on how to simulate them.

    The genome assemblies I have are in fasta format, and reads in fq.

    It would be very nice to get any help on the matter, even starting with references to get a better understanding of jumping libs (I've already read the wikipedia page several times ).

    Thanks a lot !

    mjm

  • #2
    You can simulate a 4000bp jump library from your existing data like this (using BBTools):

    cat assembly1.fa assembly2.fa > combined.fa
    bbmap.sh ref=combined.fa
    randomreads.sh reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect=1 q=35 out=jump.fq.gz


    That may not help much, since it is only recycling information that is already present in your assemblies, unlike a real jump library.

    The orientation will come out forward/reverse; if you need them in a different orientation you can reverse-complement one or both reads with the associated "reformat.sh" tool.
    Last edited by Brian Bushnell; 03-27-2015, 09:24 AM.

    Comment


    • #3
      Hi Brian,

      Thanks a lot for your answer!
      I'm trying to run BBMap but I get the following error message:
      ==
      mjm@mining:~/sequence_assembly/bbmap> bbmap.sh ref=A.fa
      java -Djava.library.path=/home/mjm/sequence_assembly/bbmap/jni/ -ea -Xmx39107m -cp /home/mjm/sequence_assembly/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=A.fa
      Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, ref=A.fa]

      BBMap version 34.72
      Retaining first best site only for ambiguous mappings.
      No output file.
      NOTE: Deleting contents of ref/genome/1 because reference is specified and overwrite=true
      Writing reference.
      Executing dna.FastaToChromArrays2 [A.fa, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

      Set genScaffoldInfo=true
      Writing chunk 1
      Set genome to 1

      Loaded Reference: 0.015 seconds.
      Loading index for chunk 1-1, build 1
      No index available; generating from reference genome: /home/mjm/sequence_assembly/bbmap/ref/index/1/chr1_index_k13_c5_b1.block
      Indexing threads started for block 0-1
      Indexing threads finished for block 0-1
      Generated Index: 9.051 seconds.
      No reads to process; quitting.

      Total time: 10.184 seconds.

      mjm@mining:~/sequence_assembly/bbmap> randomreads.sh reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect q=35 out=jump.fq.gz
      java -ea -Xmx39106m -cp /home/mjm/sequence_assembly/bbmap/current/ align2.RandomReads3 build=1 reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect q=35 out=jump.fq.gz
      Exception in thread "main" java.lang.NumberFormatException: For input string: "true"
      at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
      at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
      at java.lang.Float.parseFloat(Float.java:451)
      at align2.RandomReads3.main(RandomReads3.java:296)

      My config is as follows:
      mjm@mining:~/sequence_assembly/bbmap> uname -a
      Linux mining 3.13.0-48-generic #80-Ubuntu SMP Thu Mar 12 11:16:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

      mjm@mining:~/sequence_assembly/bbmap> java -version
      java version "1.8.0_25"
      Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
      Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

      Any idea why it crashes?

      mjm

      Comment


      • #4
        Oh, sorry - instead of "perfect" it should be "perfect=1". That's the fraction of reads that are generated with no errors, and for your usage, you don't want any errors.

        randomreads.sh reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect=1 q=35 out=jump.fq.gz

        Comment


        • #5
          Hi Brian,

          Thanks a lot for your answer. Now it works fine!
          I'm wondering if I could switch the quality values to a given numerical value?
          For instance, I'd like to replace:
          ====================
          @0_chr1_1_22737261_22737360_185068_scaffold_2
          TATCCAGGCGTTCCTCGATACGGAGACCGATGGCACCTGCAAGCGCAATGCATTCGCGGCCCTCATGTCCATCAGCCACCAAAAGGCCCTGGAGTACCTG
          +
          BGEBEFEHHHJJIGJJJIEJJGIJHGDHGGJJHJHJIHHJJGGJFGJJGIJGHHIIJJHJJHHHIJJHIDIJHHJJJFIGJEIIFHJJIIGIFHCCBBBC
          ====================

          by

          ====================
          @0_chr1_1_22737261_22737360_185068_scaffold_2
          TATCCAGGCGTTCCTCGATACGGAGACCGATGGCACCTGCAAGCGCAATGCATTCGCGGCCCTCATGTCCATCAGCCACCAAAAGGCCCTGGAGTACCTG
          +
          2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222
          ====================

          I've tried bbmap.sh ref=A.fa fakefastaquality=2 but it doesn't seem to be the solution.

          Many thanks for your help,

          mjm

          Comment


          • #6
            mjm,

            I'm not sure why you would want to do that. Are you aware that the symbol "2" indicates the quality value 17, which is fairly low?

            Anyway, you could do it like this, assuming your fake LMP reads are in "lmp.fq":

            reformat.sh in=lmp.fq out=lmp.fa
            reformat.sh in=lmp.fa out=fixed.fq qfake=17


            A better value for qfake on synthetic data would be something like 30.

            -Brian

            Comment


            • #7
              Hi Brian,

              Thanks for your solution.
              I'm still trying to use the jumping libs I created with bbmap for running matassembler, and I wanted the libs to be as close as possible to the samples provided by metassember.
              I'll update the thread once it works...

              mjm

              Comment


              • #8
                Issues...

                Hello,

                I'm trying to run this exact pipeline on my genomes, and I run into this

                "java.lang.OutOfMemoryError: Java heap space" errors, even when using
                "-Xmx3200m"

                However, if i hit "ctrl+c" while it's still running (but not doing anything), it continues, and quite quickly produces the jump.fq.gz file.

                And the file consist of the right amount of reads, but I'm worried because I had to force it through, and also because all the quality values are all either just A's, B's....G's like this:

                @0_chr2_0_77814904_77815003_17861_flattened_line_6732
                AGGATCTCCAGCCGCTCCTCGCTCTCCTCCTGATGCTCTCTCTCGCTCCTCGCCAGCTGTGACGGCCGCGACCTGGCCTCCTTACGGGACATCTCCTGCC
                +
                FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

                Is this OK, or is there a way to fix this?

                I'm running it on a quite large cluster, so there should be no problem allocating the needed memory resources...

                Martin

                Comment


                • #9
                  So.... first off, that's really strange about the ctrl+c thing. Though I suppose that maybe you are allocating enough memory for the reference but not BBMap's index, which is not really necessary for randomreads.

                  How much memory is on the computer, and how big is the reference you are using?

                  I suggest deleting the "ref" folder, which may be corrupt, and then doing something like this (with the specific numbers adjusted as needed):

                  randomreads.sh ref=reference.fa -Xmx25g reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect=1 q=35 out=jump.fq.gz

                  As for the fixed quality values, you can remove "q=35" and reads will get random quality values, but that won't affect anything other than making the output file bigger due to worse compression. -Xmx25g will allocate 25 gigabytes of memory; that's probably way more than you need. It should require something like 1.5x the size of the reference file.

                  Please let me know if this still causes problems.

                  Comment


                  • #10
                    Hi Brian!

                    That worked perfectly! I tried up to 10g, as I figured that that should be more than enough, but apparently 25g did the trick.

                    Thanks a lot!

                    Best regards,
                    Martin

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    13 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    17 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    14 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    43 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X