Unconfigured Ad

**poisson200** · 10-16-2010, 06:24 AM

Hi Mathieu,
I can only contribute a little; my data is fly genomic reads (nucleosome mapping) and the little I can say is that Shrimp seemed slow in my hands, if compared to bowtie. I have also used novoalignCS, which can deal with small indels, which to my knowledge, bowtie does not. You could also try BWA, which I think does colour space reads.

Have you tried Bioscope? I assume you have access to this software if you have a SOLiD sequencer. I have thought about trying BFAST too but we are currently comparing Bioscope read mapping with bowtie/novoalign. (I think you need a licence for novoalign). RNA-seq reads may benefit from TopHat (now handles colour space) as it can also map reads that span splice junctions/introns.

Kind regards,

John.

**epigen** · 10-20-2010, 10:32 AM

I had human RNA-Seq data 50 bp and tried BWA, BioScope, NovoalignCS, BFAST, and MOSAIK. I recommend BFAST for its high mapping rate and easy use (once you've created the indexes). BWA, NovoalignCS and MOSAIK have very low mapping rates. BioScope with the whole transcriptome pipeline can find splice junctions and gets rid of repeats but does not do gapped alignment (as Bowtie) and is a pain to install on a cluster.

**mathieu** · 10-20-2010, 11:14 PM

Hi epigen & John,
Thanks for your advices. I tried to install bioscope on our cluster but I gave up... Concerning BWA, it is the first one I tried and I was quite disappointed by the results since it has been highly recommended with low mapping rate (22.6%). The first results I have using BFAST and ShRIMP are almost the same in term mapping rate (57.5% and 51.2% respectively). However ShRIMP was a bit faster.

@epigen: For the SNP et InDels calling I am using samtools so far, but I am not very satisfied there are too many miscalls. What are your advices?

**epigen** · 10-26-2010, 07:06 AM

Yes, BFAST might give a lot of false positives, therefore the developer advises to do local realignment before. I didn't because I was interested in SNPs that are already annotated in dbSNP so I filtered for them. I also used samtools, but required SNPs to be present in at least 20 reads, have a score of at least 20, and not be at the end of a read. The most recent version of samtools has improved SNP calling compared to the previous one.
Now we want to find unknown, somatic SNPs for which we use SomaticCall from Broad, which of course only works if you have tumor-normal pairs. Otherwise, VarScan would be an option. For indels we use the indel genotyper from BROAD and Pindel.

**zee** · 10-26-2010, 07:35 AM

I think it is important to consider mapping accuracy over the number of reads aligned. Consider looking at how well the aligner does in terms of concordance with DBSNP or any other set of know reference SNP/Indel positions.
We have developed NovoalignCS for this purpose of trying to get the best alignment for a read and it does come with a cost to performance. That said if you have enough cores the slower aligners like MOSAIK and Novoalign can run in a very short time and still give you more reliable alignments that lower the false discovery rate. This should also be tested on a case-by-case basis as the read quality and repeat content of the reference genome can influence how the aligner performs.

Originally posted by epigen View Post

I had human RNA-Seq data 50 bp and tried BWA, BioScope, NovoalignCS, BFAST, and MOSAIK. I recommend BFAST for its high mapping rate and easy use (once you've created the indexes). BWA, NovoalignCS and MOSAIK have very low mapping rates. BioScope with the whole transcriptome pipeline can find splice junctions and gets rid of repeats but does not do gapped alignment (as Bowtie) and is a pain to install on a cluster.

**mathieu** · 10-26-2010, 07:50 AM

Thanks for the advices. Unfortunately I am working with an organism for which no SNPs are known yet. Therefore, I have to rely only on the deep sequencing data. I am currently testing the GATK pipeline and .... it is very demanding in term of resources but the first results seems to far more realist than the samtools ones. I will have a try with VarScan. Epigen: did you ever try GATK versus VarScan?

**zee** · 10-26-2010, 07:55 AM

I have used GATK and samtools. Samtools has a new base alignment quality (BAQ) feature which Heng Li claims will greatly improve your ability to call SNPs more reliably.
Both tools are very good and sometimes do have a steep learning curve but I think it's worth it. I have not used Varscan but I have heard good things about it.
Have you tried using NovoalignCS?

**epigen** · 10-26-2010, 07:55 AM

@mathieu: Personally I have not compared GATK and VarScan, but my colleague. She says GATK is much better - no wonder since it uses sophisticated algorithms whereas VarScan just filters the output of samtools pileup. GATK is indeed very demanding. We run it for each chromosome separately.

@zee: I tried NovoalignCS but it was by far the slowest and still had a very low mapping rate. Now I have PE data and I'm thinking about trying it again. BFAST also becomes very slow for PE due to the localalign step.

**lh3** · 10-26-2010, 10:27 AM

On Illumina data, the choice of mappers does not matter too much to SNP calling. A 1000X better mapper on simulated data may only lead to a few percent differences in SNP accuracy. On SOLiD, I do not know. But you should beware bwa's default is not designed for SOLiD. One must increase the tolerant of mismatches (-n) to get acceptable results.

As to samtools' SNP calling, are you following the steps listed here:

SAM tools

http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_protocol#Basic_Protocol_3:_Variant_Calling_with_SAMtools

Download SAM tools for free. SAM (Sequence Alignment/Map) is a flexible generic format for storing nucleotide sequence alignment. SAMtools provide efficient utilities on manipulating alignments in the SAM format.

SAMtools caller has been used in a few Nature/Plos genetics papers. If you count the papers using maq which samtools is derived from, much more. They cannot be all wrong.

So far as I know, VarScan is not a Bayesian model.

The BAQ computation is *strongly* recommended for SNP calling. Almost everyone I know (Umich, Broad/GATK, Sanger) who has tried it once immediately incorporates it into the production pipeline.

**Michael.James.Clark** · 10-26-2010, 11:50 AM

Originally posted by epigen View Post

@mathieu: Personally I have not compared GATK and VarScan, but my colleague. She says GATK is much better - no wonder since it uses sophisticated algorithms whereas VarScan just filters the output of samtools pileup. GATK is indeed very demanding. We run it for each chromosome separately.

Indeed, I've used all three (GATK, samtools and VarScan) and VarScan is basically a filtering/annotation tool, not a variant caller. GATK and samtools are both good. I found GATK to give even better variant counts than samtools pileup, but samtools is still good.

@zee: I tried NovoalignCS but it was by far the slowest and still had a very low mapping rate. Now I have PE data and I'm thinking about trying it again. BFAST also becomes very slow for PE due to the localalign step.

If BFAST is slow for you and you have access to a strong distributed cluster, try the bfast.submit.pl script that comes with it to make it more parallel and save a lot of wallclock time.

**mathieu** · 10-29-2010, 02:10 AM

My results and your recommendations are in favor of using a BFAST+GATK pipeline. I have to say that I really like the GATK UnifiedGenotyper. Moreover it seems that the integration of a robust indel genotyper within the UnifiedGenotyper is in preparation. That will make the tool even more valuable.
The trick is now to have some good filtering after the raw snp calls. Do you guys have some advices?

**lh3** · 10-29-2010, 04:13 AM

GATK comes with the most sophisticated filtering. That is one of the reasons why it is good.

**mathieu** · 10-29-2010, 04:33 AM

@lh3 : I agree. My main difficulty is that I do not have any prior knowledge of SNPs on the organism I am working on. Therefore, I cannot use the VariantRecalibrator... Therefoe, after having applied basic filtering and indel masking, it is more tricky to perform the good filtering... Do you have advices?

**lh3** · 10-29-2010, 04:47 AM

I see. Perhaps you may play around to get the expected ts/tv. I think all recalibrator needs is an expected ts/tv. If you have to do manual filtering, strand bias is believed to be the most effective filter. Depth filtering is also necessary. Also, run BAQ. The GATK group also apply BAQ to their projects and is planing to reimplement this in GATK.

Topics	Statistics	Last Post
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM
Scientists Solve a 25-Year Mystery in RNA Interference by SEQadmin2 Started by SEQadmin2, 05-26-2026, 10:12 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 05-26-2026, 10:12 AM

Unconfigured Ad

SHRiMP vs BFAST

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News