Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Non-overlapping read assembly

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Non-overlapping read assembly

    Hi there,

    I'm sequencing amplicons on the MiSeq for soil metagenomics, and some of my amplicons are length variable. As a result, some are longer than 500bp. I'm faced with throwing out some of these reverse reads for ITS1 as they won't merge/ assemble.

    I've thought up a scheme that might enable me to use them and I'm wondering what you think.

    If I could merge them together and have a standard set of gaps in between the forward and reverse reads, then I could align them against a database to get the "proper" gap length.

    I'm having trouble finding software that can do this merging, but first I want to confirm that fastq files actually provide enough information to merge reads based on sequence identifier titles without sequence overlaps.

    Tophat's capabilities are the closest to what I need, but it needs a whole reference genome instead of an amplicon reference library to align against.

    Thanks!

    -Jeff

  • #2
    I haven't used Tophat in a while but I thought that it could use any reference library -- genome or amplicon. The underlying program is bowtie2 which certainly handles odd-ball references.

    Comment


    • #3
      Jeff,

      BBMerge has the capability to do this from custom headers generated by BBMap, but currently it is not easy; I will update it to make it simple (probably later today; I'll let you know). Normal fastq headers do NOT contain the relevant information, and sam files don't provide any guarantee of read ordering, so it's nontrivial.

      Comment


      • #4
        Thanks guys! Worst case scenario I'll just have to throw out my reverse reads. Compared with read lengths three years ago, 250bp isn't awful anyway...

        What are your thoughts on quality filtering, clustering and then BLASTing the forward and reverse reads separately. Then merging the taxonomic results?

        Comment


        • #5
          I've uploaded a new version of BBTools; you can merge them like this:

          bbmap.sh -Xmx7g ref=reference.fa in=reads.fq outm=mapped.fq outu=unmapped.fq nodisk po int rbm don

          bbmerge.sh in=mapped.fq out=merged_by_mapping.fq int usemapping parsecustom

          bbmerge.sh in=unmapped.fq out=merged_by_overlap.fq outu=unmerged.fq int

          Then you can concatenate the merged_by_mapping.fq and merged_by_overlap.fq files to get a single file.

          Note that these commands all assume you are using a single interleaved file. For two files, you can use the "in1", "in2", "out1", and "out2" flags.

          Comment


          • #6
            Excellent! Thank you

            Comment


            • #7
              How would you like me to cite BBtools?

              Comment


              • #8
                The website address is fine.

                Comment


                • #9
                  Hi Brian,

                  I finally got around to trying this bbmap code that you suggested.

                  I'm trying to merge fungal ITS1 sequences. My two Illumina read files each contain ~340,000 reads (173MB total size). My reference set is the current fasta release of the UNITE database with ~44,000 sequences (31 MB in size).

                  I'm running the code in a virtualized (Virtualbox) version of Ubuntu Linux on a PC with 8GB of RAM and 8 processors.

                  It's currently using 22% of CPU resources and 80MB of RAM.

                  It's been running for an hour now... Is there any way to speed things up? I've been using UPARSE for my other reads and all steps finish in a manner of seconds.

                  The output files are 5MB (mapped) and 3MB (unmapped) in size so far.

                  Here is the command that I used and the output that has been generated so far:
                  [email protected]:~/bbmap$ ./bbmap.sh -Xmx7g ref=/home/qiime/bbmap/UNITE_release_s_10.09.2014.fasta in1=/home/qiime/bbmap/reads/EM1-102-R1-ITS_TRIMM_R1.fastq in2=/home/qiime/bbmap/reads/EM1-102-R1-ITS_TRIMM_R2.fastq outm1=mapped_R1.fq outu1=unmapped_R1.fq outm2=mapped_R2.fq outu2=unmapped_R2.fq nodisk po int rbm don
                  java -ea -Xmx7g -cp /home/qiime/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 -Xmx7g ref=/home/qiime/bbmap/UNITE_release_s_10.09.2014.fasta in1=/home/qiime/bbmap/reads/EM1-102-R1-ITS_TRIMM_R1.fastq in2=/home/qiime/bbmap/reads/EM1-102-R1-ITS_TRIMM_R2.fastq outm1=mapped_R1.fq outu1=unmapped_R1.fq outm2=mapped_R2.fq outu2=unmapped_R2.fq nodisk po int rbm don
                  Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, -Xmx7g, ref=/home/qiime/bbmap/UNITE_release_s_10.09.2014.fasta, in1=/home/qiime/bbmap/reads/EM1-102-R1-ITS_TRIMM_R1.fastq, in2=/home/qiime/bbmap/reads/EM1-102-R1-ITS_TRIMM_R2.fastq, outm1=mapped_R1.fq, outu1=unmapped_R1.fq, outm2=mapped_R2.fq, outu2=unmapped_R2.fq, nodisk, po, int, rbm, don]

                  BBMap version 33.21
                  Set INTERLEAVED to true
                  Retaining first best site only for ambiguous mappings.
                  Executing dna.FastaToChromArrays2 [/home/qiime/bbmap/UNITE_release_s_10.09.2014.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gzip=true, chromc=false, maxlen=536670912, writechroms=false, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=true]

                  Set genScaffoldInfo=true
                  Reset INTERLEAVED to false because paired input files were specified.
                  Set genome to 1

                  Loaded Reference: 1.142 seconds.
                  Loading index for chunk 1-1, build 1
                  Indexing threads started for block 0-1
                  Indexing threads finished for block 0-1
                  Generated Index: 7.765 seconds.
                  Analyzed Index: 6.103 seconds.
                  Started output stream: 2.810 seconds.
                  Started output stream: 0.001 seconds.
                  Cleared Memory: 0.166 seconds.
                  Processing reads in paired-ended mode.
                  Started read stream.
                  Started 1 mapping thread.

                  MANY THANKS!

                  Comment


                  • #10
                    Jeff,

                    That's strange. BBMap is usually very fast with such a small input file, but if the reference is extremely repetitive, it can slow down a lot. You can speed it up by adding the flags "fast" and "t=8". "t=8" forces it to use 8 threads; that's not normally needed because the number of cores is autodetected, but it looks like the virtualization somehow interfered with that. "fast" decreases sensitivity slightly, and typically doubles the speed.

                    But if the speed is still a problem with those settings, you can merge by overlap with BBMerge first and then only map the unmerged reads, which should vastly reduce the number that need to be mapped; THEN merge the mapped reads by mapping information. BBMerge will only take a few seconds. The modified commands would be:

                    bbmerge.sh in=reads.fq out=merged_by_overlap.fq outu=unmerged_A.fq t=8

                    bbmap.sh -Xmx7g ref=reference.fa in=unmerged_A.fq outm=mapped.fq outu=unmapped.fq nodisk po int rbm don fast t=8

                    bbmerge.sh in=mapped.fq out=merged_by_mapping.fq outu=unmerged_B.fq int usemapping parsecustom int t=8


                    So ultimately your merged reads will be in the two 'merged' files and your unmerged reads will be in unmerged_B.fq (interleaved).

                    Comment


                    • #11
                      Thanks Brian, I tried it with 8 threads and it seems just as slow. Can I just run bbmerge.sh from the Windows command line instead to avoid any issues with virtualization?

                      Comment


                      • #12
                        Jeff,

                        You can run BBMap and BBMerge in Windows, you just can't use the shellscript. So:

                        bbmerge.sh
                        becomes
                        java -ea -Xmx200m -cp path\to\bbmap\current\ jgi.BBMerge

                        and
                        bbmap.sh
                        becomes
                        java -ea -Xmx7g -cp path\to\bbmap\current\ align2.BBMap

                        Comment


                        • #13
                          Originally posted by jstrohm View Post
                          Thanks Brian, I tried it with 8 threads and it seems just as slow. Can I just run bbmerge.sh from the Windows command line instead to avoid any issues with virtualization?
                          That you can't use more cores might have something to do with the setup of your virtual machine. If you only hand through one core, it is obvious that you can't work with 8.... Check on that if you want to use your virtual machine instead Windows...

                          Comment


                          • #14
                            OK! It's running through windows now. It's using 99% of CPU resources and 1GB of RAM. It's still running after an hour though...

                            Previously, I tried merging these reads with UPARSE and essentially none of them overlap leading me to try this method. Is there any reason why that would slow things down? I didn't actually try the "fast" command yet. You said it slightly decreases sensitivity so I wanted to see if I could get away without it.

                            I may be forced to give up and just use the 250bp forward reads. Anyway... thanks a bunch for your efforts and assistance!

                            With regards to my VirtualBox, I used to run scripts that would auto-detect PC resources. The OS has been acting a little strange since I upgraded to the latest version of Ubuntu. It's probably time for a fresh install, but that would require some serious moving about of files...

                            Comment


                            • #15
                              Hang on! I made a rookie mistake... I have a quad core processor with 8 "logical processors". So I can't actually run 8 parallel jobs or threads. I dialed it back to t=4 and the job ran in 10 minutes!

                              When I selected t=8 I had to abort it after three hours.

                              Comment

                              Working...
                              X