Seqanswers Leaderboard Ad

**Wallysb01** · 06-27-2012, 08:48 AM

I have done this before and I think it took a day or so, and that was with more like 100M reads, not 2B! What ever you are trying to do, it will probably be easier a different way. At the very least, you should probably be using mpiblast, as any querying of that database will also take a very long time.

What is it you are trying to do anyway? The community here might have some better suggestions for you.

**wdemos** · 06-27-2012, 09:05 AM

One aspect of my project is recover the original read names. I have data from DNA STAR but it changed the original read names. I want to identify the unmatched reads as well and verify sequences the original analyst found interesting.

**Wallysb01** · 06-27-2012, 09:28 AM

Ok, I'm still somewhat confused by what data you have and what you're doing with it.

Are you trying to, for example, find the reads that support a SNP from an alignment? You could more easily reduce a data set like that with samtools mpileup to just reads that overlap the SNP. Then blasting, or even just inspection, could work a lot faster.

**wdemos** · 06-27-2012, 09:35 AM

Sorry for not being clear. Thank you for your help and patience. Here is the main purpose of my task:

I have some illumina sequences that were identified to be chimeric by another analyst. He came to this conclusion using another software that trimmed and renamed the reads. I want to blast these 'chimeric' reads against the custom blast database populated with all the reads from the entire dataset. My hope is that the chimeric reads from the other analyst will align with the original reads and that I will be able to view the entire read (not trimmed) with the header information so I have the original read name. Then I can use the original data for further analysis.

**GenoMax** · 06-27-2012, 09:44 AM

Provided you have access to a machine with a good amount of RAM, doing a blat search may also be a possibility. Are your sequences already in a multi-fasta format?

**wdemos** · 06-27-2012, 09:49 AM

Yes, the files are concatenated into one large fasta file.

**Wallysb01** · 06-27-2012, 10:12 AM

Hmm, I see. That is a little more problematic then isn't it.

It does sound like blat or blast are your only options. You could try splitting up that huge file of reads into some number of chunks and running makeblastdb individually, then just concatenating the results from each chunk later. Blat does have the advantage that you don't need to format your database. If you have Kent's src, you could use a few of the tools to speed this job up (ie faSplit), then again running multiple queries on individual chunks.

If you have access to a cluster, you could take advantage of hundreds of cores with blat. Just be clever about the faSplit output names, then take advantage of the qsub -t and a $PBS_ARRAYID variable in your command. Blat also has the advantage of being able to easily merge and filter the resulting .psl files.

It doesn't sound like fun to me, but it might just be done within a day if you have the right resources and a well thought out plan.

**wdemos** · 06-27-2012, 11:07 AM

Thank you. I was just approved for access to a cluster. I will let you know how it turns out.

**swbarnes2** · 06-27-2012, 11:26 AM

How about making a bwa or bowtie index of the full reads, and aligning to them, instead of blasting? After all that's the virtue of those algorithms; aligning short reads.

**wdemos** · 06-28-2012, 11:01 AM

I left the makeblastdb command running while I tried working with the cluster and it actually finished populating last night. I currently am conducting my BLAST search. I appreciate all the help and if my blast fails I may consider the bwa/bowtie suggestion. Thank you.

**Wallysb01** · 06-28-2012, 11:12 AM

Originally posted by wdemos View Post

I left the makeblastdb command running while I tried working with the cluster and it actually finished populating last night. I currently am conducting my BLAST search. I appreciate all the help and if my blast fails I may consider the bwa/bowtie suggestion. Thank you.

Sounds good, I'm happy to help.

Topics	Statistics	Last Post
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, Yesterday, 12:17 PM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 23 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM

Seqanswers Leaderboard Ad

Announcement

Build BLAST db with illumina reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News