Announcement

Collapse
No announcement yet.

Speeding up alignment? Do Bowtie first, then BFAST?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Speeding up alignment? Do Bowtie first, then BFAST?

    Hi everyone,

    Just wanted to pick anyone's brain and get their opinion on this.

    I'm drowning in more and more sequence data , while at the same time, being limited in computing resources. I do not want to go to the cloud (not yet at least)!

    I am running BFAST which does a great job for my tumor mutation analysis (allows more mismatches, and allows indels compared to Bowtie). But it is incredibly resource and time hungry, because it takes time to do all those local alignments.

    My question is, if I were to do a first pass of the reads with Bowtie, say without mismatches in the seeds, and then take the left-over unmatched reads and align them with BFAST, would that be reasonable? or would I risk losing better alignments that might have been done with BFAST?

    I am thinking that by getting the "near perfect" matching reads out of the way, I can then feed the rest to BFAST to handle the more complicated reads holding indels and multiple mistmatches...

    What do you think? Is this a bad idea?

  • #2
    A similar approach is used by TopHat for splice junction analysis. In theory there is the possibility that you would lose some better alignments, but I don't think you would lose many in practice. The easiest matches will be made by pretty much all aligners; most of the differences come from difficult-to-match reads. It's also worth noting that if BFAST and bowtie get different results for a high-quality low-mismatch read, you will likely have problems with false positives for that read.

    I would not expect this sort of two-stage process to produce much data loss, if any.

    Comment


    • #3
      Originally posted by NGSfan View Post
      Hi everyone,
      My question is, if I were to do a first pass of the reads with Bowtie, say without mismatches in the seeds, and then take the left-over unmatched reads and align them with BFAST, would that be reasonable? or would I risk losing better alignments that might have been done with BFAST?

      What do you think? Is this a bad idea?
      I think it is a good idea. One suggestion, try bwa also for the first iteration. It is both accurate and fast.

      Have you done any testing in the cloud already? I'd love to heard more about it.
      -drd

      Comment


      • #4
        Thanks! I will try BWA since it seems a good compromise between speed and accuracy and might be better than bowtie for a first pass. Although BFAST still does a better job, particularly with larger indels (>10bp).

        Btw - do you know if BWA will output the unaligned reads like Bowtie does?

        I also saw some paper show Novoalign is pretty accurate, but I don't know how fast it is in comparison.

        Comment


        • #5
          Originally posted by NGSfan View Post
          Thanks! I will try BWA since it seems a good compromise between speed and accuracy and might be better than bowtie for a first pass. Although BFAST still does a better job, particularly with larger indels (>10bp).

          Btw - do you know if BWA will output the unaligned reads like Bowtie does?

          I also saw some paper show Novoalign is pretty accurate, but I don't know how fast it is in comparison.
          BWA outputs unaligned reads. Novoalign is more accurate and sensitive than BWA/BFAST/others but is generally slower. How much slower is dependent on the type of data and compute infrastructure.

          Comment


          • #6
            I started working with novoalign and I am pretty impressed. I am still doing some more testing but I am already seeing what Nils points out. He actually advised me to align with novoalign the unaligned reads from bfast. But at this point I am considering starting with
            novoalign. We'll see how it scales.
            -drd

            Comment


            • #7
              Novoalign is nearly as fast aligning all reads as just the unaligned reads from Bowtie or Bfast as the unaligned reads are usually the most difficult to align and take the longest time, Novoaligns iterative alignment process flies through the easy to align reads.
              The other thing to watch out for is false positive alignments produced by your first aligner as these won't be in the unaligned file and they will add noise to you SNV analysis, we recommend using Novoalign from the start.
              As mentioned by Nils, Novoaligns performance can be affected by the data especially if you have a bad run with lots of low quality bases. The latest version of Novoalign has a quality filter (-p option) that can be used to filter low quality reads. You can also use the -l option to filter reads that have a lot of very low quality bases, set -l to about 2/3rds read length.
              By default Novoalign allows a very high level of mismatches and long indels especially with long paired end reads. This slows down alignment especially for the reads that just won't align.
              If you want faster alignment try decreasing the alignment threshold. We have users aligning a lane of 45bp paired end reads in 20 minutes on a 32-core server at -t 180. Default threshold is around 5*(l-15) where l is read length (sum of both reads for pairs), try setting -t to -3*(l-15) or even lower. A threshold of 250 would allow a 10bp indel and a couple of SNPs.

              Comment


              • #8
                The experience from G1K was that given decent reads under the default option, novoalign was about 2-3X slower than bwa for 100bp reads, but >10X slower for 40bp reads. G1K opt out novoalign in the end partly because of this (G1K has produced a lot of 36bp reads) and also because its free version (at least at that time) did not support multi-threading while it took more than 6.5GB of memory.

                As to accuracy, it depends on the applications you have. If you do chip/rna-seq, even bowtie is fine. If you want to find SNPs, the accuracy of bwa is acceptable. For indels, novoalign will be better but not much, I guess (no proof). For SVs, I think one should consider to take the intersection of two distinct aligners, like what hydra-sv is recommending.

                Comment


                • #9
                  Originally posted by drio View Post
                  Have you done any testing in the cloud already? I'd love to heard more about it.
                  Sorry I did not answer your question. No, I haven't tried it, but it was a suggestion. What worries me is the idea of having to transfer all that data over some limited bandwidth connection.

                  I have seen some rather fast transfers (eg. 500gb transfered overnight) between cities on a company's internal network. So it's possible. I just don't know if transfering to the Amazon Cloud would be as fast.

                  I imagine FastQs and a reference genome + index might not be too bad... I have to look into it some more. Lots of conflicting opinions on the whole thing (cloud vs in-house) !

                  Comment


                  • #10
                    For our structural variation analyses with Hydra, we use a similar, tiered approach using BWA followed by Novoalign. Using even default settings, BWA is very fast and reasonably sensitive. We use Novoalign as a second pass on the discordant/aberrant pairs that BWA claims are not concordant with the reference genome. As discordant pairs are a primary signal for SV, one wants to be as sensitive as possible when deciding whether or not a given pair is discordant (else a burdensome load of false positives). We find that Novoalign does the best job we've seen at detecting "cryptic" concordant pairs that are otherwise missed by other aligners.

                    In addition, as lh3 mentions, Novoalign's speed improves substantially as read lengths and accuracy increase. As I understand, it has also undergone some algorithmic improvement that further expedites alignment. We've recently found that Novoalign is acceptably fast as both a first (less sensitive settings) and second (crank up the sensitivity!, -r E 1000) tier with recent 100bp paired-end human data having overall error rates less than 2%.

                    In short, substantial work has gone into improving alignment speed and sensitivity. The fact remains that alignment is everything when analyzing NGS data. In my experience, shortcuts during alignments lead to painful and artefactual analyses.

                    I hope this helps.
                    Aaron

                    Comment


                    • #11
                      Specifically concerning RNA-seq analysis. The new tools to align spliced reads (such as MapSplice) are show to be more sensitive than Tophat (mentioned earlier) and I was wondering if in this case using these packages can in some ways provide better results (or at least drastically reduce the number of unaligned reads) than using a sensitive read aligner??

                      I am also curious if anyone has any figures regarding how many extra reads can be aligned using a second, more sensitive aligner (I realize this is highly situational but still interesting to see).

                      Comment


                      • #12
                        The two aligner process is basically flawed as the first aligner, the less sensitive and less specific aligner, will also align some read in the wrong location (false positives) no amount of aligning the unmapped reads will get rid of these incorrect alignments from the first aligner.

                        Comment


                        • #13
                          I think the common practice is to remap paired-end reads that are not mapped "properly" by the first aligner. When a read pair is mapped properly, the chance of seeing a wrong alignment is pretty low. Sometimes, one may also want to remap reads with too many mismatches, which is also a sign of misalignment.

                          Comment


                          • #14
                            Originally posted by sparks View Post
                            The two aligner process is basically flawed as the first aligner, the less sensitive and less specific aligner, will also align some read in the wrong location (false positives) no amount of aligning the unmapped reads will get rid of these incorrect alignments from the first aligner.
                            Would you mind elaborating on this? I cannot see the difference between having novoalign running a first-pass fast alignment to map the majority of reads (if mapped incorrectly here wouldnt the reads remain incorrectly mapped? ) and then aligning the rest using more sensitive parameters compared to running a fast aligner with stringent match criteria followed by a more sensitive one.

                            Comment


                            • #15
                              Originally posted by lh3 View Post
                              I think the common practice is to remap paired-end reads that are not mapped "properly" by the first aligner. When a read pair is mapped properly, the chance of seeing a wrong alignment is pretty low. Sometimes, one may also want to remap reads with too many mismatches, which is also a sign of misalignment.
                              If you are really worried about this you could set the cutoffs so that only reads which couldn't possibly map better with a different algorithm would be mapped. Obvious case is perfect matches, but certainly some mismatches will be accepted by any algorithm.

                              Comment

                              Working...
                              X