Seqanswers Leaderboard Ad

**aleferna** · 08-18-2010, 07:12 AM

Nop, I calculate score as matches - mismatches - gapbases

**aleferna** · 08-18-2010, 07:13 AM

also if I can venture a suggestion, when I implemented the blat mapq value using S1 - S2, I noticed a big effect on the MinMapQ the MapQ threshold needed to achieve 99% specificity, as you have longer reads you need a lower thresholds, but also as your error rate decreases you need a lower threshold as well. I've been wondering that instead of using the S1 to normalize the value, if you could normalize by some sort of combination between the error rate and the read length. I think you can better approximate MapQ by combining these 2 components rather than trying to summarize them in S1.

**aleferna** · 08-18-2010, 07:21 AM

Here's the behavior of a simple Blat MapQ value

Attached Files

Selection_006.jpg (20.5 KB, 70 views)

**lh3** · 08-18-2010, 07:55 AM

I gradually recall the decision on choosing the parameters for blat. My focus was more on >=500bp reads. And for these reads, blat -fastMap is similar to blat deault in accuracy but tens of times faster. However, for shorter reads which you are focusing on, blat default is much more accurate than blat -fastMap (still much slower, though). Your table would largely agree with mine for blat default.

Actually for 454, I would highly recommend ssaha2. Ssaha2 is designed for mapping sequencing data and calling SNPs from the first day and has been thoroughly validated. Blat, although being one of the best tools for mapping ESTs, is not for SNP finding initially and is not heavily evaluated. From what I have heard, blat does not refine the final alignment, which may make gaps positioned suboptimally and pose problems to indel finding. The default blat mode is also much slower and less accurate than ssaha2. In my view, it is a common mistake to overlook the superiority of ssaha2 for longer reads. The 1000 genomes project chooses every program for a reason.

**aleferna** · 08-18-2010, 08:46 AM

@Adamo

Sorry if I'm spamming you, I don't understand how the private messages work here. Send me a message to afer at kth.se I can send you the script to joing the 2 BWA files if you still want to use bwa.

**aleferna** · 08-18-2010, 08:52 AM

@Heng Li

Well I don't care too much about SNPs, actually what I work with resembles more chip-seq technology. All I need to know is the position, not the alignment. I like BWA because I need to work with both 454 and HiSeq, and compare them, so I prefer BWA because seems to be able to manage both. Does Ssaha2 manage high throughput?

**lh3** · 08-18-2010, 09:14 AM

Originally posted by aleferna View Post

@Heng Li

Well I don't care too much about SNPs, actually what I work with resembles more chip-seq technology. All I need to know is the position, not the alignment. I like BWA because I need to work with both 454 and HiSeq, and compare them, so I prefer BWA because seems to be able to manage both. Does Ssaha2 manage high throughput?

Ssaha2 is designed for high throughput sequencing. As I said, it is usually faster than blat, although less easy to use, I would say.

**SoftGenetics** · 08-19-2010, 04:59 AM

Originally posted by query View Post

What is the best tool available to map 454 reads to a reference genome? What is the method used by gs reference Mapper (analysis tool that comes with 454) and does it do a decent job of mapping and identifying variants?

You may wish to try the mapper in NextGEne it is especially robust for the detection of indels using a 3 step process...you can obtain a free time limited trial on the softgenetics web site.

**aleferna** · 08-20-2010, 01:27 AM

@Adamo

Here is the script that I've been using. DISCLAIMER: I made this for my own data and it has not been tested on regular sequence data, so please read the code make sure you understand what the script does before using it. It is tuned to join BWASW Z 100 with ALN N 4 sam files.

Also, its a python script but the system wouldn't upload it with extention .py.

Attached Files

JoinBWA_ALN_BWASW.py.pl (1.6 KB, 28 views)

**Adamo** · 08-20-2010, 02:00 AM

Originally posted by aleferna View Post

@Adamo

Here is the script that I've been using. DISCLAIMER: I made this for my own data and it has not been tested on regular sequence data, so please read the code make sure you understand what the script does before using it. It is tuned to join BWASW Z 100 with ALN N 4 sam files.

Also, its a python script but the system wouldn't upload it with extention .py.

Ok, I didn't notice you'd posted here!
Thanks a lot, I'm gonna see what's in it now.

**robs** · 08-21-2010, 03:35 PM

Instead of using Z=100 on the whole data set, it might be a better (meaning faster) idea to first align the data set with Z=1 (default value) and then realign the ones that do not satisfy your alignment criteria with a higher value for Z. This should speed up the process if you assume that a high number of the reads will map to the reference.

**robs** · 08-21-2010, 03:40 PM

Originally posted by aleferna View Post

The first time I ran BWA with the long aligner I didn't realize that there was a short/long option and since I have both in my library I was very disappointed of BWA. I started testing algorithm after algorithm and finally reviewed BWA again. This time I made a small script that will just join 2 sam files, one for the small aligner and one from the long aligner. It will choose the alignment from the short aligner if it cannot find it in the long aligner, this was the winning combination.

I've mentioned this chart in another thread, but here you can see that BWA is the only one that can cover the full range of read sizes in 454 datasets (or in 100bp solexa data after you remove the pair end adapters!)

Dismissed site: www.nada.kth.se

http://www.nada.kth.se/~afer/benchmark.jpeg

Moreover, I know using the Z=100 seems a bit of an overkill but with 454 data and a decent computer BWA will take just a few minutes and I did measure Z=1,10,25,50,100,250 and even 500. Z = 100 seems to be the peak, after this I cannot squeeze any specificity out of the algorithm, but you do see a change from Z=10 to Z=100.

Looking at your chart, you actually get better sensitivity for longer reads with low error rates using the default settings instead of using Z=100. Any idea what causes a higher Z-best value to result in lower sensitivity?

**boyzoe** · 08-22-2010, 06:57 PM

Originally posted by lh3 View Post

Ssaha2 is designed for high throughput sequencing. As I said, it is usually faster than blat, although less easy to use, I would say.

Actually, I couldn't install in ubuntu. After extraction, I could see the files (read me, ssaha2, ssaha2build, ssaha snp). However, after put the command into terminal, it told me that command can't found. This bothers me for a week.

My RNA-seq data is not for a species that genome is sequenced but zebrafish genome maybe suitable for these sample are fishes which are close relative of zebrafish. The goal is to analysis SNP and recombination in hybirds and their parents. Is there any guys have idea?

Really appreciate for you guys!

**maubp** · 08-23-2010, 04:26 AM

Originally posted by boyzoe View Post

Actually, I couldn't install in ubuntu. After extraction, I could see the files (read me, ssaha2, ssaha2build, ssaha snp). However, after put the command into terminal, it told me that command can't found. This bothers me for a week.

Try:

./ssaha

(assuming the file is in the current directory, indicated by the dot in Unix). If you tried this:

ssaha

it would look for an installed copy of ssaha on the system path - but it would not try the current directory. At least, that is how recent versions of Ubuntu are configured.

**aleferna** · 08-23-2010, 07:28 AM

Originally posted by robs View Post

Looking at your chart, you actually get better sensitivity for longer reads with low error rates using the default settings instead of using Z=100. Any idea what causes a higher Z-best value to result in lower sensitivity?

@rob

you mean like 200bp 0% error? where Z100 is 97.29% and default is 97.30%??

Topics	Statistics	Last Post
Study Highlights Challenges in Cellular Reprogramming for Regenerative Medicine by seqadmin Started by seqadmin, Today, 06:25 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 06:25 AM
New DNA Modification Discovered as Key to Gene Activation in Early Development by seqadmin Started by seqadmin, Yesterday, 01:02 PM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 01:02 PM
Wastewater Analysis Unlocks New Method for Identifying Public Health Threats by seqadmin Started by seqadmin, 09-18-2024, 06:39 AM	0 responses 14 views 0 likes	Last Post by seqadmin 09-18-2024, 06:39 AM
Molecular Markers Shared Across Dementias by seqadmin Started by seqadmin, 09-11-2024, 02:44 PM	0 responses 14 views 0 likes	Last Post by seqadmin 09-11-2024, 02:44 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News