Hello everyone. I’m in the process of analyzing some transcriptome data. I’m not quantifying expression at all. I only want to search for sequences based on a query file of my genes of interest. I have paired end reads, so I’m not using a fully assembled transcriptome. Basically, I need some advice since I’m new to this area.
What I have done so far is merge the R1/R2 files and convert the files from Fastq to Fasta using the FastX-toolkit. I used the makeblastdb command to make the blast database from the resulting Fasta file. I know that I’m supposed to get .nhr .nin and .nsq, but I think that the database is so big that I got something like this:
DB.00.nsq, DB.00.nin, DB.00.nhr
DB.01.nsq, DB.01.nin, DB.01.nhr
and so on.
So here’s the first question: is this a problem? Or will I just have to blast my query file against each database (00, 01), one at a time?
Also, before I get too far into this, I also would like to know if for some reason I shouldn’t be merging the read files and creating a database from it.
Thank you for taking the time to read this!
What I have done so far is merge the R1/R2 files and convert the files from Fastq to Fasta using the FastX-toolkit. I used the makeblastdb command to make the blast database from the resulting Fasta file. I know that I’m supposed to get .nhr .nin and .nsq, but I think that the database is so big that I got something like this:
DB.00.nsq, DB.00.nin, DB.00.nhr
DB.01.nsq, DB.01.nin, DB.01.nhr
and so on.
So here’s the first question: is this a problem? Or will I just have to blast my query file against each database (00, 01), one at a time?
Also, before I get too far into this, I also would like to know if for some reason I shouldn’t be merging the read files and creating a database from it.
Thank you for taking the time to read this!
Comment