Announcement

Collapse
No announcement yet.

Slider - Maximum use of probability information for alignment of short sequence reads

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Slider - Maximum use of probability information for alignment of short sequence reads

    A new paper describing an improved solexa aligner / SNP caller just came out. Looks interesting.

    *****************************

    Slider - Maximum use of probability information for alignment of short sequence reads and SNP detection.


    Malhis N, Butterfield Y, Ester M, Jones SJ.

    Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, Canada.

    MOTIVATION: A plethora of alignment tools have been created that are designed to best fit different types of alignment conditions. While some of these are made for aligning Illumina Sequence Analyzer reads, none of these are fully utilizing its probability (prb) output. In this paper, we will introduce a new alignment approach (Slider) that reduces the alignment problem space by utilizing each read base's probabilities given in the prb files. RESULTS: Compared with other aligners, Slider has higher alignment accuracy and efficiency. In addition, given that Slider matches bases with probabilities other than the most probable, it significantly reduces the percentage of base mismatches. The result is that its SNP predictions are more accurate than other SNP prediction approaches used today that start from the most probable sequence, including those using base quality. CONTACT: nmalhis *(<AT>)*bcgsc.ca Supplementary information and availability: http://www.bcgsc.ca/platform/bioinfo/software/slider.

  • #2
    Looks interesting.. using .prb instead of the fastq. There are tools that optionally take .prb files as input, but I am not sure if they use probability information for each base!
    --
    bioinfosm

    Comment


    • #3
      from the author

      This release of Slider was prepared for the Oxford Bioinformatics paper reviewers as a proof of concept:
      http://bioinformatics.oxfordjournals...urcetype=HWCIT

      I’m working now on a beta release with much improvements and capabilities. This new release should be ready by the end of this month (Nov. 2008).

      Nawar Malhis
      Last edited by nmalhis; 11-05-2008, 09:15 AM.

      Comment


      • #4
        SliderII: High Quality SNP Calling Using Illumina Data at Shallow Coverage:

        is now available from:

        http://www.bcgsc.ca/platform/bioinfo/software/SliderII

        Sorry for the delay,

        Nawar

        Comment


        • #5
          Also going to follow up via email, but just in case: Illumina seems to be moving towards a change in the .prb files; the new workflow does not seem to produce the four-channel probabilities anymore.

          Is there a workaround? This would also affect other probabilistic aligners.

          -- Oliver

          Comment


          • #6
            Oliver,

            You can rerun the base calling, starting the pipeline with Bustard using the intensity files generated by RTA. Bustard will accept as optional arguments --with-seq, --with-qval, --with-sig2 and --with-prb which will instruct Bustard to generate these legacy files. You can also add these arguments to the goat.py command line if you are restarting the pipeline from the image analysis step.

            Comment


            • #7
              Glad to hear, thanks for the information! Going to report back on how SliderII handles very deep sequence coverage soon-ish.

              -- Oliver

              Comment


              • #8
                Hi,

                Novoalign will take prb format read files. It will use prb values as probabilities both when generating seeds and in calculating penalties for the Needleman-Wunsch alignment. This usually gives more alignments than running off the fastq files but has been criticised by some as the Illumina fastq files have been quality calibrated but the prb files are not. I have never seen any test comparing SNP calls with Genotype that would show whether using prb files improves SNP calls.

                Colin

                Comment


                • #9
                  Wouldnt it be better in the long run to use calibrated base calls rather than second-guessing with the PRB base calls?
                  The 1000 genomes project recalibrated their FASTQ files using prior alignment information to improve the data quality.


                  Originally posted by sparks View Post
                  gives more alignments than running off the fastq files but has been criticised by some as the Illumina fastq files have been quality calibrated but the prb files are not. I have never seen any test comparing SNP calls with Genotype that would show whether using prb files improves SNP calls.
                  Colin

                  Comment


                  • #10
                    Colin, good meeting you at ISMB! Should have some comparative data for FASTQ vs PRB files soon. Zee, tend to agree, but we are looking at data with 2+ SNPs per read on average, and in many cases at high frequency, and from more than two clones. Was hoping that in these cases the underlying PRB data might be informative.

                    Comment


                    • #11
                      I’d like to add that Slider II calibrate prb data before calling SNPs.
                      Regarding the storage space of prb files, since these files contain reparative data, compressing these files to .gz while reduce the size by 7 to 10 times. Slider II reads .gz files.
                      When we have more than 2 SNPs in a read, Slider II, like other SNPs calling tools, filter dense SNPs so results might not be good.

                      Nawar

                      Comment


                      • #12
                        Originally posted by nmalhis View Post
                        When we have more than 2 SNPs in a read, Slider II, like other SNPs calling tools, filter dense SNPs so results might not be good.

                        Nawar
                        Yep, that's going to be a problem no matter what tool we use -- four to five SNPs per read on average. Having said that, as we are only aligning against 10kb of reference sequence most reads should still be align-able. Now, if we could stop the genomic center from deleting the intensity and PRB files after each run...

                        Comment


                        • #13
                          "four to five SNPs per read on average" and "10kb of reference sequence ", This is about 10% of the reference is unknown, I would assemble these reads since the reference is short enough not to have a repeat issues.

                          Comment


                          • #14
                            Interesting. Hadn't even thought about reference-based or de novo assemblies as an alternative. Will keep it in mind, thanks again!

                            Comment


                            • #15
                              Very usefully... I heard about using .prb instead of the fastq. Now working on it.

                              Comment

                              Working...
                              X