Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • mathieu
    Junior Member
    • Aug 2009
    • 9

    SHRiMP vs BFAST

    Hi all,

    I am working with 50-bases-length Solid RNA-seq data. I want to do both genotyping and RNA quantification. I am currently hesitating between SHRiMP and BFAST to perform the alignments. Both seems to me equivalent in term of mapping strategy. Does someone who experienced these two aligners can give me his opinion ?
    best,
    Mathieu
    Last edited by mathieu; 10-15-2010, 05:53 AM.
  • poisson200
    Member
    • Feb 2010
    • 63

    #2
    Hi Mathieu,
    I can only contribute a little; my data is fly genomic reads (nucleosome mapping) and the little I can say is that Shrimp seemed slow in my hands, if compared to bowtie. I have also used novoalignCS, which can deal with small indels, which to my knowledge, bowtie does not. You could also try BWA, which I think does colour space reads.

    Have you tried Bioscope? I assume you have access to this software if you have a SOLiD sequencer. I have thought about trying BFAST too but we are currently comparing Bioscope read mapping with bowtie/novoalign. (I think you need a licence for novoalign). RNA-seq reads may benefit from TopHat (now handles colour space) as it can also map reads that span splice junctions/introns.

    Kind regards,

    John.

    Comment

    • epigen
      Senior Member
      • May 2010
      • 101

      #3
      I had human RNA-Seq data 50 bp and tried BWA, BioScope, NovoalignCS, BFAST, and MOSAIK. I recommend BFAST for its high mapping rate and easy use (once you've created the indexes). BWA, NovoalignCS and MOSAIK have very low mapping rates. BioScope with the whole transcriptome pipeline can find splice junctions and gets rid of repeats but does not do gapped alignment (as Bowtie) and is a pain to install on a cluster.

      Comment

      • mathieu
        Junior Member
        • Aug 2009
        • 9

        #4
        Hi epigen & John,
        Thanks for your advices. I tried to install bioscope on our cluster but I gave up... Concerning BWA, it is the first one I tried and I was quite disappointed by the results since it has been highly recommended with low mapping rate (22.6%). The first results I have using BFAST and ShRIMP are almost the same in term mapping rate (57.5% and 51.2% respectively). However ShRIMP was a bit faster.

        @epigen: For the SNP et InDels calling I am using samtools so far, but I am not very satisfied there are too many miscalls. What are your advices?

        Comment

        • epigen
          Senior Member
          • May 2010
          • 101

          #5
          Yes, BFAST might give a lot of false positives, therefore the developer advises to do local realignment before. I didn't because I was interested in SNPs that are already annotated in dbSNP so I filtered for them. I also used samtools, but required SNPs to be present in at least 20 reads, have a score of at least 20, and not be at the end of a read. The most recent version of samtools has improved SNP calling compared to the previous one.
          Now we want to find unknown, somatic SNPs for which we use SomaticCall from Broad, which of course only works if you have tumor-normal pairs. Otherwise, VarScan would be an option. For indels we use the indel genotyper from BROAD and Pindel.

          Comment

          • zee
            NGS specialist
            • Apr 2008
            • 249

            #6
            I think it is important to consider mapping accuracy over the number of reads aligned. Consider looking at how well the aligner does in terms of concordance with DBSNP or any other set of know reference SNP/Indel positions.
            We have developed NovoalignCS for this purpose of trying to get the best alignment for a read and it does come with a cost to performance. That said if you have enough cores the slower aligners like MOSAIK and Novoalign can run in a very short time and still give you more reliable alignments that lower the false discovery rate. This should also be tested on a case-by-case basis as the read quality and repeat content of the reference genome can influence how the aligner performs.

            Originally posted by epigen View Post
            I had human RNA-Seq data 50 bp and tried BWA, BioScope, NovoalignCS, BFAST, and MOSAIK. I recommend BFAST for its high mapping rate and easy use (once you've created the indexes). BWA, NovoalignCS and MOSAIK have very low mapping rates. BioScope with the whole transcriptome pipeline can find splice junctions and gets rid of repeats but does not do gapped alignment (as Bowtie) and is a pain to install on a cluster.

            Comment

            • mathieu
              Junior Member
              • Aug 2009
              • 9

              #7
              Thanks for the advices. Unfortunately I am working with an organism for which no SNPs are known yet. Therefore, I have to rely only on the deep sequencing data. I am currently testing the GATK pipeline and .... it is very demanding in term of resources but the first results seems to far more realist than the samtools ones. I will have a try with VarScan. Epigen: did you ever try GATK versus VarScan?

              Comment

              • zee
                NGS specialist
                • Apr 2008
                • 249

                #8
                I have used GATK and samtools. Samtools has a new base alignment quality (BAQ) feature which Heng Li claims will greatly improve your ability to call SNPs more reliably.
                Both tools are very good and sometimes do have a steep learning curve but I think it's worth it. I have not used Varscan but I have heard good things about it.
                Have you tried using NovoalignCS?

                Comment

                • epigen
                  Senior Member
                  • May 2010
                  • 101

                  #9
                  @mathieu: Personally I have not compared GATK and VarScan, but my colleague. She says GATK is much better - no wonder since it uses sophisticated algorithms whereas VarScan just filters the output of samtools pileup. GATK is indeed very demanding. We run it for each chromosome separately.

                  @zee: I tried NovoalignCS but it was by far the slowest and still had a very low mapping rate. Now I have PE data and I'm thinking about trying it again. BFAST also becomes very slow for PE due to the localalign step.
                  Last edited by epigen; 10-26-2010, 08:01 AM. Reason: making clear what I refer to

                  Comment

                  • lh3
                    Senior Member
                    • Feb 2008
                    • 686

                    #10
                    On Illumina data, the choice of mappers does not matter too much to SNP calling. A 1000X better mapper on simulated data may only lead to a few percent differences in SNP accuracy. On SOLiD, I do not know. But you should beware bwa's default is not designed for SOLiD. One must increase the tolerant of mismatches (-n) to get acceptable results.

                    As to samtools' SNP calling, are you following the steps listed here:

                    Download SAM tools for free. SAM (Sequence Alignment/Map) is a flexible generic format for storing nucleotide sequence alignment. SAMtools provide efficient utilities on manipulating alignments in the SAM format.


                    SAMtools caller has been used in a few Nature/Plos genetics papers. If you count the papers using maq which samtools is derived from, much more. They cannot be all wrong.

                    So far as I know, VarScan is not a Bayesian model.

                    The BAQ computation is *strongly* recommended for SNP calling. Almost everyone I know (Umich, Broad/GATK, Sanger) who has tried it once immediately incorporates it into the production pipeline.
                    Last edited by lh3; 10-26-2010, 10:34 AM.

                    Comment

                    • Michael.James.Clark
                      Senior Member
                      • Apr 2009
                      • 207

                      #11
                      Originally posted by epigen View Post
                      @mathieu: Personally I have not compared GATK and VarScan, but my colleague. She says GATK is much better - no wonder since it uses sophisticated algorithms whereas VarScan just filters the output of samtools pileup. GATK is indeed very demanding. We run it for each chromosome separately.
                      Indeed, I've used all three (GATK, samtools and VarScan) and VarScan is basically a filtering/annotation tool, not a variant caller. GATK and samtools are both good. I found GATK to give even better variant counts than samtools pileup, but samtools is still good.

                      @zee: I tried NovoalignCS but it was by far the slowest and still had a very low mapping rate. Now I have PE data and I'm thinking about trying it again. BFAST also becomes very slow for PE due to the localalign step.
                      If BFAST is slow for you and you have access to a strong distributed cluster, try the bfast.submit.pl script that comes with it to make it more parallel and save a lot of wallclock time.
                      Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                      Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                      Projects: U87MG whole genome sequence [Website] [Paper]

                      Comment

                      • mathieu
                        Junior Member
                        • Aug 2009
                        • 9

                        #12
                        My results and your recommendations are in favor of using a BFAST+GATK pipeline. I have to say that I really like the GATK UnifiedGenotyper. Moreover it seems that the integration of a robust indel genotyper within the UnifiedGenotyper is in preparation. That will make the tool even more valuable.
                        The trick is now to have some good filtering after the raw snp calls. Do you guys have some advices?

                        Comment

                        • lh3
                          Senior Member
                          • Feb 2008
                          • 686

                          #13
                          GATK comes with the most sophisticated filtering. That is one of the reasons why it is good.

                          Comment

                          • mathieu
                            Junior Member
                            • Aug 2009
                            • 9

                            #14
                            @lh3 : I agree. My main difficulty is that I do not have any prior knowledge of SNPs on the organism I am working on. Therefore, I cannot use the VariantRecalibrator... Therefoe, after having applied basic filtering and indel masking, it is more tricky to perform the good filtering... Do you have advices?

                            Comment

                            • lh3
                              Senior Member
                              • Feb 2008
                              • 686

                              #15
                              I see. Perhaps you may play around to get the expected ts/tv. I think all recalibrator needs is an expected ts/tv. If you have to do manual filtering, strand bias is believed to be the most effective filter. Depth filtering is also necessary. Also, run BAQ. The GATK group also apply BAQ to their projects and is planing to reimplement this in GATK.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              20 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...