I am running BLAST locally and would like to populate a database with a set of paired end illumina reads. There are 2,132,034,004 reads that I would like to dump in the db. It is taking a very long time to get the db going. I have used the makebalstdb command. Anyone try this before? How long did it take, any general advice or other suggestions? Thank you.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
I have done this before and I think it took a day or so, and that was with more like 100M reads, not 2B! What ever you are trying to do, it will probably be easier a different way. At the very least, you should probably be using mpiblast, as any querying of that database will also take a very long time.
What is it you are trying to do anyway? The community here might have some better suggestions for you.
-
Ok, I'm still somewhat confused by what data you have and what you're doing with it.
Are you trying to, for example, find the reads that support a SNP from an alignment? You could more easily reduce a data set like that with samtools mpileup to just reads that overlap the SNP. Then blasting, or even just inspection, could work a lot faster.
Comment
-
Sorry for not being clear. Thank you for your help and patience. Here is the main purpose of my task:
I have some illumina sequences that were identified to be chimeric by another analyst. He came to this conclusion using another software that trimmed and renamed the reads. I want to blast these 'chimeric' reads against the custom blast database populated with all the reads from the entire dataset. My hope is that the chimeric reads from the other analyst will align with the original reads and that I will be able to view the entire read (not trimmed) with the header information so I have the original read name. Then I can use the original data for further analysis.
Comment
-
Hmm, I see. That is a little more problematic then isn't it.
It does sound like blat or blast are your only options. You could try splitting up that huge file of reads into some number of chunks and running makeblastdb individually, then just concatenating the results from each chunk later. Blat does have the advantage that you don't need to format your database. If you have Kent's src, you could use a few of the tools to speed this job up (ie faSplit), then again running multiple queries on individual chunks.
If you have access to a cluster, you could take advantage of hundreds of cores with blat. Just be clever about the faSplit output names, then take advantage of the qsub -t and a $PBS_ARRAYID variable in your command. Blat also has the advantage of being able to easily merge and filter the resulting .psl files.
It doesn't sound like fun to me, but it might just be done within a day if you have the right resources and a well thought out plan.
Comment
-
I left the makeblastdb command running while I tried working with the cluster and it actually finished populating last night. I currently am conducting my BLAST search. I appreciate all the help and if my blast fails I may consider the bwa/bowtie suggestion. Thank you.
Comment
-
Originally posted by wdemos View PostI left the makeblastdb command running while I tried working with the cluster and it actually finished populating last night. I currently am conducting my BLAST search. I appreciate all the help and if my blast fails I may consider the bwa/bowtie suggestion. Thank you.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 11-22-2024, 07:36 AM
|
0 responses
55 views
0 likes
|
Last Post
by seqadmin
11-22-2024, 07:36 AM
|
||
Started by seqadmin, 11-22-2024, 07:04 AM
|
0 responses
76 views
0 likes
|
Last Post
by seqadmin
11-22-2024, 07:04 AM
|
||
Started by seqadmin, 11-21-2024, 09:19 AM
|
0 responses
75 views
0 likes
|
Last Post
by seqadmin
11-21-2024, 09:19 AM
|
||
Started by seqadmin, 11-08-2024, 11:09 AM
|
0 responses
319 views
0 likes
|
Last Post
by seqadmin
11-08-2024, 11:09 AM
|
Comment