Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Standalone Blast+ Database Help

    Hello,

    I have installed Blast+ on my local computer; however, I am confused about the database creation. I have a binary of COGS that I want to blast against different alignments generated from Abyss to select the best one. Therefore I have the following questions:

    1. When creating the database, is there a way to create the whole binary,or do I have to do it file by file?

    2. How do I batch blast all the COGS against the different alignments generated from Abyss?

    I will really appreciate your help.

    Thank you

  • #2
    What do you mean by a 'binary of COGS'? You need a (plain text) FASTA file to make a BLAST database, or to use as BLAST queries.

    Comment


    • #3
      Originally posted by maubp View Post
      What do you mean by a 'binary of COGS'? You need a (plain text) FASTA file to make a BLAST database, or to use as BLAST queries.
      Hi maubp,

      By binary of COGS I mean a folder containing 350 FASTA files that I want to blast against 20 differnt contigs.fa assembly files generated by Abyss, and then based on the results pick the best alignment. Is there a way to batch BLAST all COGS against all assembly files? I would really appreciate your help.

      Comment


      • #4
        I guess what you want to know is which assembly has all 350 of your COGs? If that's the case, then you'd want to make 20 blast databases, one for each assembly. Then concatenate (join) together all of the COG sequences into one fasta file somehow (easy to do on linux or mac command line). Then on the command line you can run one batch blast of all 350 sequences at once. do that for each blast database and you will have 20 huge blast reports...
        You can maybe limit the blast report to show only one hit per COG sequence and also format it for tab-delimited text which you could import into a spreadsheet to examine somehow.
        Or you could use a desktop application like Geneious to do this.
        If it were me I would either use Geneious or write a script to analyze the blast reports.

        Comment


        • #5
          Yes you can do the blast in one go.

          1. combine your 350 fasta files into 1 file. In command line you can simply use 'cat' command. Maker sure the headers for each sequence is unique.

          2. Do the same for your assemblies from Abyss. Note however your contigs in your assemblies might have the same header name if they were created separately. You will need to somehow rename the header for each assembly to be specific for that assembly
          e.g. from >k40_000001 to something like >asm1_k40_000001 for 1st assembly, >asm2_....
          etc.

          If however the headers are already unique don't worry about this.

          3. Create a blast database on the combined assemblies from Abyss

          4. Run blast. You might want to make it only output 1 match per sequence by the '-max_target_seqs' parameter (set it to -max_target_seqs 1), and output a table format for easy parsing using the '-outfmt' parameter (set it to -outfmt 6).
          (note however if you make it output only the best hit, you might be missing on other information. Play around with the outfmt parameter to get a format you like).

          You can get a full explanation of the blastn commands by typing 'blastn -help'

          Comment


          • #6
            Thank you all for your help. I will give it a try and let you know if I have any problems.

            Comment


            • #7
              Hello,

              So I am having issues running the following blastx commands:

              1.) "blastx -db ../gjhk34 -query verifyCOSIIcombined.fasta -evalue 0.00001 -max_target_seqs 1 -num_threads 8 -outfmt '10 qseqid qacc qlen qframe qstart qend qseq sseqid sacc slen sframe sstart send sseq pident nident length mismatch positive ppos gapopen gaps evalue bitscore score' -out output.blastx.csv", but I get the following error:

              BLAST Database error: No alias or index file found for protein database [../gjhk34] in search path [/home/youngsook/Documents/blast::]


              2.)"blastx -db "../gjhsk34/gjk34-contigs.fa" -query verifyCOSIIcombined.fasta -evalue 0.00001 -max_target_seqs 1 -num_threads 8 -outfmt '10 qseqid qacc qlen qframe qstart qend qseq sseqid sacc slen sframe sstart send sseq pident nident length mismatch positive ppos gapopen gaps evalue bitscore score' -out output.blastx.csv" and runs perfectly.

              My questions is: After creating the database, do I run it against the .fa file or against any of the ".phr, ".pin", ".psq", ".pal" database files? I noticed on other examples that the database on the command does not have an index like "nr" and still works.

              I really appreciate your help.

              Thank you,

              -Milo

              Comment


              • #8
                If your database files are named gjhk34.phr, gjhk34.pin, etc, the database name is just gjhk34 only.

                If your database files are named gjhk34.fa.phr, gjhk34.fa.pin, etc, the database name is gjhk34.fa instead.

                You can have either of these situations from a FASTA file gjhk34.fa depending on the options you used for makeblastdb.

                Comment


                • #9
                  Hello,

                  The command that I used to create the database is:

                  makeblastdb -in gjhk34-contigs.fa -dbtype prot -parse_seqids

                  However, I am running the following command:

                  "blastx -db "../gjhsk34/gjk34-contigs.fa" -query verifyCOSIIcombined.fasta -evalue 0.00001 -max_target_seqs 1 -num_threads 8 -outfmt '10 qseqid qacc qlen qframe qstart qend qseq sseqid sacc slen sframe sstart send sseq pident nident length mismatch positive ppos gapopen gaps evalue bitscore score' -out output.blastx.csv"

                  After blastx is done running, I get a csv file but it is empty, even if I change the output to a .out file. Do you know what I am doing wrong or why it is generating an empty output?

                  Once again, thank you for your help.

                  Comment


                  • #10
                    Why are you including the quotes here?

                    -db "../gjhsk34/gjk34-contigs.fa"
                    Just checking to confirm that "gjk34-contigs.fa" contains protein sequences (since you are going blastx).

                    When debugging an problem like this make a test file with just one query sequence. This way you can debug problems rapidly instead of waiting for the full set to go through.
                    Last edited by GenoMax; 10-07-2013, 11:38 AM.

                    Comment


                    • #11
                      Hi GenoMax,

                      I am including quotes to point to my database location, but even if I dont include the quotes I would still get an empty output.

                      I am pretty sure "gjk34-contigs.fa" is a protein sequence. I created the database with the "-dbtype prot." So basically what you are saying is that if my contigs.fa file is not a protein sequence, I first need to translate it to protein and then create the database? Below is a screenshot of the gjk34-contigs.fa file...

                      Comment


                      • #12
                        The screenshot did not come through. Use the "Go advanced" button as you are editing the message. That will allow you to attach PNG files to your post.

                        If "gjk34-contigs.fa" has DNA sequence (which looking at the name may be the case) you will need to do "tblastx" if you wanted to do a ranslated query/db search.

                        NOTE: Just checked the screenshot link in your post (https://www.dropbox.com/s/3uznelmn1d...contigs.fa.png). That is indeed DNA sequence. So that is the reason you are not getting anything in the output. You can't do a "blastx" search against DNA database.
                        Last edited by GenoMax; 10-07-2013, 12:16 PM.

                        Comment


                        • #13
                          Originally posted by milo0615 View Post
                          After blastx is done running, I get a csv file but it is empty, even if I change the output to a .out file. Do you know what I am doing wrong or why it is generating an empty output?
                          If there are no BLAST hits, then the tabular and csv output would be emtpy.

                          Try asking for commented tabular, commented cvs, or the default plain text output to double check this.

                          Comment


                          • #14
                            Hi All,

                            Yes, I had to re-create my database and now it works perfectly. However, I do have a few more questions:

                            - What would be the best way or the best practice to analyze all of the blast results to check for the assembly with the most hits?

                            - Is there a free application that would help with the analysis?

                            I was thinking about exporting all the blast results into excel and then analyze them from there....

                            Thank you

                            Comment


                            • #15
                              Originally posted by milo0615 View Post
                              Hi All,

                              Yes, I had to re-create my database and now it works perfectly. However, I do have a few more questions:

                              - What would be the best way or the best practice to analyze all of the blast results to check for the assembly with the most hits?

                              - Is there a free application that would help with the analysis?

                              I was thinking about exporting all the blast results into excel and then analyze them from there....

                              Thank you
                              CLI is by far the most efficient way handle large tables. Google: man sort, man awk, man sed, man grep, man cut, and man paste. There are related threads in this forum too. Then R for statistical analysis and plotting.
                              Last edited by rhinoceros; 10-13-2013, 04:35 AM.
                              savetherhino.org

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              27 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              27 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X