Seqanswers Leaderboard Ad

**maubp** · 09-26-2013, 12:40 PM

What do you mean by a 'binary of COGS'? You need a (plain text) FASTA file to make a BLAST database, or to use as BLAST queries.

**milo0615** · 09-26-2013, 05:33 PM

Originally posted by maubp View Post

What do you mean by a 'binary of COGS'? You need a (plain text) FASTA file to make a BLAST database, or to use as BLAST queries.

Hi maubp,

By binary of COGS I mean a folder containing 350 FASTA files that I want to blast against 20 differnt contigs.fa assembly files generated by Abyss, and then based on the results pick the best alignment. Is there a way to batch BLAST all COGS against all assembly files? I would really appreciate your help.

**mike.t** · 09-26-2013, 06:11 PM

I guess what you want to know is which assembly has all 350 of your COGs? If that's the case, then you'd want to make 20 blast databases, one for each assembly. Then concatenate (join) together all of the COG sequences into one fasta file somehow (easy to do on linux or mac command line). Then on the command line you can run one batch blast of all 350 sequences at once. do that for each blast database and you will have 20 huge blast reports...
You can maybe limit the blast report to show only one hit per COG sequence and also format it for tab-delimited text which you could import into a spreadsheet to examine somehow.
Or you could use a desktop application like Geneious to do this.
If it were me I would either use Geneious or write a script to analyze the blast reports.

**Kennels** · 09-26-2013, 06:24 PM

Yes you can do the blast in one go.

1. combine your 350 fasta files into 1 file. In command line you can simply use 'cat' command. Maker sure the headers for each sequence is unique.

2. Do the same for your assemblies from Abyss. Note however your contigs in your assemblies might have the same header name if they were created separately. You will need to somehow rename the header for each assembly to be specific for that assembly
e.g. from >k40_000001 to something like >asm1_k40_000001 for 1st assembly, >asm2_....
etc.

If however the headers are already unique don't worry about this.

3. Create a blast database on the combined assemblies from Abyss

4. Run blast. You might want to make it only output 1 match per sequence by the '-max_target_seqs' parameter (set it to -max_target_seqs 1), and output a table format for easy parsing using the '-outfmt' parameter (set it to -outfmt 6).
(note however if you make it output only the best hit, you might be missing on other information. Play around with the outfmt parameter to get a format you like).

You can get a full explanation of the blastn commands by typing 'blastn -help'

**milo0615** · 10-03-2013, 03:12 PM

Thank you all for your help. I will give it a try and let you know if I have any problems.

**milo0615** · 10-05-2013, 11:31 AM

Hello,

So I am having issues running the following blastx commands:

1.) "blastx -db ../gjhk34 -query verifyCOSIIcombined.fasta -evalue 0.00001 -max_target_seqs 1 -num_threads 8 -outfmt '10 qseqid qacc qlen qframe qstart qend qseq sseqid sacc slen sframe sstart send sseq pident nident length mismatch positive ppos gapopen gaps evalue bitscore score' -out output.blastx.csv", but I get the following error:

BLAST Database error: No alias or index file found for protein database [../gjhk34] in search path [/home/youngsook/Documents/blast::]

2.)"blastx -db "../gjhsk34/gjk34-contigs.fa" -query verifyCOSIIcombined.fasta -evalue 0.00001 -max_target_seqs 1 -num_threads 8 -outfmt '10 qseqid qacc qlen qframe qstart qend qseq sseqid sacc slen sframe sstart send sseq pident nident length mismatch positive ppos gapopen gaps evalue bitscore score' -out output.blastx.csv" and runs perfectly.

My questions is: After creating the database, do I run it against the .fa file or against any of the ".phr, ".pin", ".psq", ".pal" database files? I noticed on other examples that the database on the command does not have an index like "nr" and still works.

I really appreciate your help.

Thank you,

-Milo

**maubp** · 10-05-2013, 12:00 PM

If your database files are named gjhk34.phr, gjhk34.pin, etc, the database name is just gjhk34 only.

If your database files are named gjhk34.fa.phr, gjhk34.fa.pin, etc, the database name is gjhk34.fa instead.

You can have either of these situations from a FASTA file gjhk34.fa depending on the options you used for makeblastdb.

**milo0615** · 10-07-2013, 11:19 AM

Hello,

The command that I used to create the database is:

makeblastdb -in gjhk34-contigs.fa -dbtype prot -parse_seqids

However, I am running the following command:

"blastx -db "../gjhsk34/gjk34-contigs.fa" -query verifyCOSIIcombined.fasta -evalue 0.00001 -max_target_seqs 1 -num_threads 8 -outfmt '10 qseqid qacc qlen qframe qstart qend qseq sseqid sacc slen sframe sstart send sseq pident nident length mismatch positive ppos gapopen gaps evalue bitscore score' -out output.blastx.csv"

After blastx is done running, I get a csv file but it is empty, even if I change the output to a .out file. Do you know what I am doing wrong or why it is generating an empty output?

Once again, thank you for your help.

**GenoMax** · 10-07-2013, 11:25 AM

Why are you including the quotes here?

-db "../gjhsk34/gjk34-contigs.fa"

Just checking to confirm that "gjk34-contigs.fa" contains protein sequences (since you are going blastx).

When debugging an problem like this make a test file with just one query sequence. This way you can debug problems rapidly instead of waiting for the full set to go through.

**milo0615** · 10-07-2013, 12:01 PM

Hi GenoMax,

I am including quotes to point to my database location, but even if I dont include the quotes I would still get an empty output.

I am pretty sure "gjk34-contigs.fa" is a protein sequence. I created the database with the "-dbtype prot." So basically what you are saying is that if my contigs.fa file is not a protein sequence, I first need to translate it to protein and then create the database? Below is a screenshot of the gjk34-contigs.fa file...

**GenoMax** · 10-07-2013, 12:09 PM

The screenshot did not come through. Use the "Go advanced" button as you are editing the message. That will allow you to attach PNG files to your post.

If "gjk34-contigs.fa" has DNA sequence (which looking at the name may be the case) you will need to do "tblastx" if you wanted to do a ranslated query/db search.

NOTE: Just checked the screenshot link in your post (https://www.dropbox.com/s/3uznelmn1d...contigs.fa.png). That is indeed DNA sequence. So that is the reason you are not getting anything in the output. You can't do a "blastx" search against DNA database.

**maubp** · 10-08-2013, 12:43 AM

Originally posted by milo0615 View Post

After blastx is done running, I get a csv file but it is empty, even if I change the output to a .out file. Do you know what I am doing wrong or why it is generating an empty output?

If there are no BLAST hits, then the tabular and csv output would be emtpy.

Try asking for commented tabular, commented cvs, or the default plain text output to double check this.

**milo0615** · 10-12-2013, 07:45 PM

Hi All,

Yes, I had to re-create my database and now it works perfectly. However, I do have a few more questions:

- What would be the best way or the best practice to analyze all of the blast results to check for the assembly with the most hits?

- Is there a free application that would help with the analysis?

I was thinking about exporting all the blast results into excel and then analyze them from there....

Thank you

**rhinoceros** · 10-13-2013, 03:44 AM

Originally posted by milo0615 View Post

Hi All,

Yes, I had to re-create my database and now it works perfectly. However, I do have a few more questions:

- What would be the best way or the best practice to analyze all of the blast results to check for the assembly with the most hits?

- Is there a free application that would help with the analysis?

I was thinking about exporting all the blast results into excel and then analyze them from there....

Thank you

CLI is by far the most efficient way handle large tables. Google: man sort, man awk, man sed, man grep, man cut, and man paste. There are related threads in this forum too. Then R for statistical analysis and plotting.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Standalone Blast+ Database Help

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News