BFAST wish list - SEQanswers

aleferna replied

08-01-2010, 10:38 AM
Yes actually the point of the study is to determine how they behave with different read lengths. Most mappers are made for say 50bp, but what about 55 or 45 or 60...? The problem is that my type of reads ( from 4C / Hi-C / 3C ) don't have nice single size reads. That's why I'm testing each aligner to see where it is the strongest and where it is the weakest. My hope is that I can combine 2 or more programs to have a complete solution.
My question for BFast is because I thought that a longer read would generate more CALS and therefore, all things being equal, would be easier to map ... which is what happens in Blat... anyway I will try to mess around with BFast a bit more

Last edited by aleferna; 08-01-2010, 10:47 AM.
Leave a comment:
nilshomer replied

08-01-2010, 10:29 AM
I like the presentation (heatmap table). Could you try varying the "-K" and "-M" parameters? Alternatively, you could design indexes with greater key size (more ones in the mask). There is a lot of flexibility, though I haven't thought about longer read lengths. The only criticism I have is that short-read aligners are designed for short-reads, and you are using them in a non-standard way; just like BLAT would not be used for 35bp error-prone reads.
Leave a comment:
aleferna replied

08-01-2010, 10:22 AM
Here's a table with some comparisons I'm doing. The bmr column correlates to the number of mismatches, a read with bmr 1% in 50bp will typically have 2 mismatches.

http://www.nada.kth.se/~afer/benchmark.jpeg

Last edited by aleferna; 08-01-2010, 10:27 AM.
Leave a comment:
aleferna replied

08-01-2010, 10:15 AM
I'm just optimizing for specificity right now, not worrying too much on speed. I'm using the 10 indexes that you mention in the manual and no options at all. I'm comparing how different algorithms work at different read lengths / mismatches. It is similar to the study you did in the BFast paper but with 25 to 500 bp read lengths.
Leave a comment:
nilshomer replied

08-01-2010, 09:49 AM
Originally posted by aleferna View Post

I just finished the analysis of BFast and the results are very strange. I get really good performance at 50 and 75 bp but this degrades (significantly) with 150, 200 and 500bp reads. Is there anything that you need to adjust in bfast when you have bigger reads? In the case of blat you get better specificity and sensitivity as reads get longer, I thought BFast would out perform blat but it doesn't?

What performance metrics are you using (running time, accuracy, sensitivity)? I haven't tried BFAST with longer reads (>200bp), so there would need to be some thought on how to make it work for long reads. Remember there are short-read and long-read aligners. Have you tried the BWA-SW module? It performs very well for longer reads.
Leave a comment:
aleferna replied

07-31-2010, 11:24 PM
I just finished the analysis of BFast and the results are very strange. I get really good performance at 50 and 75 bp but this degrades (significantly) with 150, 200 and 500bp reads. Is there anything that you need to adjust in bfast when you have bigger reads? In the case of blat you get better specificity and sensitivity as reads get longer, I thought BFast would out perform blat but it doesn't?
Leave a comment:
nilshomer replied

07-25-2010, 10:10 AM
Originally posted by aleferna View Post

This is very odd, I reran the localalign using -t 24 and its been running for 2 days now where as with -t 16 it only takes a few hours? Has anybody else seen this problem?

Also why does it say endReadNum: 2147483647 when there are only 3 million reads?

The threading option is "-n", not "-t". Threading is not perfectly scalable, and can be a result from many factors (take an OS & architecture course for an introduction).

If not specified, the start/end read #s default to 1 and infinity (in this case (2^32)-1) respectively. Use the "-p" option to see the program parameters.
Leave a comment:
aleferna replied

07-25-2010, 02:56 AM
bfast localalign takes longer with 24 threads than with 16???

This is very odd, I reran the localalign using -t 24 and its been running for 2 days now where as with -t 16 it only takes a few hours? Has anybody else seen this problem?

Also why does it say endReadNum: 2147483647 when there are only 3 million reads?
Leave a comment:
aleferna replied

07-23-2010, 02:11 AM
Sensitivity / Specificity study

Hi Bioinfosm,

Sure I hope to have the results ready soon, I've been struggling with MAQ but I finally realize that it needs reads to be exactly the same size. Since I'm simulating the reads they usually vary 2 or 3 bases in length, that was giving me really bad maq sensitivity but now I have it working.

I will post my results, but I'm working on a very weird dataset, don't think many people has these types of problems. I'm focusing on errors due to high mutation rates not on sequencing errors. We work with cancer stem cell lines that have abnormal mutation rates and therefore the MapQ value breaks down very often. To make things worst all the reads are chimeric (its a 4C experiment) and therefore they are really tricky to map. Basically my thesis is how to combine maq, blat, bfast , bwa aln and bwa bwasw to get > 99% sensitivity with > 99.5% specificity. So far it has been impossible to achieve this level using with a single algorithm so I decided to apply each algorithm where it has the best results.

Hope I can share some of this soon
Leave a comment:
nilshomer replied

07-21-2010, 10:20 PM
Originally posted by epigen View Post

Hi aleferna and Nils,

your thread already answered most of the questions I would also have asked Nils. But I still have two:
1. To reduce non-parallelizable I/O, would it be possible to replace the large temp files that bfast match produces by keeping the info in the memory?

Yes, if enough memory is available. Storing on disk is a function of not having enough RAM (1TB should solve a lot of this ).

2. Could I pipe the indexes from gunzip and would that make loading them faster?

Probably not, since the underlying system calls are using zlib (gzip). My suggestion would be to get a faster disk.

And something for the wish list: Why do the bfast programs not output any information when their input comes from standard input? It would be nice to have the info in case the pipeline crashes at some point to know why.

They do! Each command initially prints its program parameters! See the "readsFileName:" line in "bfast match" for example. It will name the file or STDIN.
Leave a comment:
lh3 replied

07-20-2010, 05:25 PM
I would go for SSE2 first before considering CUDA. As Nils said, it would be good for someone to take on this as a research project, but in the near future, CUDA would not deliver a performance boost significant enough to make it practically attractive and cost-effective. When you look into details, CUDA is not that decent as it looks to be. hmmerGPU, mummerGPU and swGPU are all far from the theoretical speed due to unconquerable technical difficulties.

Last edited by lh3; 07-20-2010, 05:31 PM.
Leave a comment:
bioinfosm replied

07-19-2010, 07:28 AM
aleferna,

am interested in the "sensitivity/specificity study between aligners..." do you have any updates, resources, blog or paper to point?

thanks!
Leave a comment:
epigen replied

07-19-2010, 02:46 AM
Hi aleferna and Nils,

your thread already answered most of the questions I would also have asked Nils. But I still have two:
1. To reduce non-parallelizable I/O, would it be possible to replace the large temp files that bfast match produces by keeping the info in the memory?
2. Could I pipe the indexes from gunzip and would that make loading them faster?

And something for the wish list: Why do the bfast programs not output any information when their input comes from standard input? It would be nice to have the info in case the pipeline crashes at some point to know why.

BFAST for CUDA sounds like a really good idea. Parallel merge sort would be great too because the merging step is the most time-consuming. Unfortunately I'm not a good programmer so I can't offer my help with opimizing the code. But I always stumble across bugs so I'd at least make a good beta tester.

I'd also like to take the opportunity to thank you all for your support!

Barbara
Leave a comment:
nilshomer replied

07-16-2010, 07:40 PM
Originally posted by aleferna View Post

Wow, thanks for the instant reply, I love SeqAnswers where else can you talk to the man himself, cool!

Without users, a developer is nothing.

2. Didn't quite understand your response regarding the sensitivity on running BFast on GPU's. I see a trend of new aligners being made to run in a computer cloud, but I think that it will take longer to upload the data to the cloud than to process it locally using GPU architecture such as the NVidia CUDA.

Implementation is important, and the GPU vs. cloud vs. FPGA or a solution customized by the problem are all important things to consider. I don't weight in on this topic for good reason: I need more data to make an opinion.

3. I always get 255 for the MapQ value am I doing something wrong? What is a typical value for the --avgMismatchQuality in the post process?

Will check the source code, Thanks!!

A 255 is returned if there is no second best hit, which happens when the read is uniquely mapped. See

Blat-like Fast Accurate Search Tool

https://sourceforge.net/apps/mediawiki/bfast/index.php?title=Mapping_Quality

Download Blat-like Fast Accurate Search Tool for free. BFAST facilitates the fast and accurate mapping of short reads to reference sequences, where mapping billions of short reads with variants is of utmost importance.
Leave a comment:
aleferna replied

07-16-2010, 03:07 PM
Wow, thanks for the instant reply, I love SeqAnswers where else can you talk to the man himself, cool!

1. The 2^N issue maybe I'm mistaken at some point I ran one of the processes with the number of threads = 24 and it started working, I came back to check on the process some ours later and it said that the number of threads must be a power of 2, I't might have been the index creation.

2. Didn't quite understand your response regarding the sensitivity on running BFast on GPU's. I see a trend of new aligners being made to run in a computer cloud, but I think that it will take longer to upload the data to the cloud than to process it locally using GPU architecture such as the NVidia CUDA.

3. I always get 255 for the MapQ value am I doing something wrong? What is a typical value for the --avgMismatchQuality in the post process?

Will check the source code, Thanks!!
Leave a comment:

Previous 1 2 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News