No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    If you run "blastx -help" on the command line you will get all options for blastx. One of the sections is for output formats. Default format is "0".

     -outfmt <String>
       alignment view options:
         0 = pairwise,
         1 = query-anchored showing identities,
         2 = query-anchored no identities,
         3 = flat query-anchored, show identities,
         4 = flat query-anchored, no identities,
         5 = XML Blast output,
         6 = tabular,
         7 = tabular with comment lines,
         8 = Text ASN.1,
         9 = Binary ASN.1,
        10 = Comma-separated values,
        11 = BLAST archive format (ASN.1) 
        12 = JSON Seqalign output 
       Options 6, 7, and 10 can be additionally configured to produce
       a custom format specified by space delimited format specifiers.
       The supported format specifiers are:
                qseqid means Query Seq-id
                   qgi means Query GI
                  qacc means Query accesion
               qaccver means Query accesion.version
                  qlen means Query sequence length
                sseqid means Subject Seq-id
             sallseqid means All subject Seq-id(s), separated by a ';'
                   sgi means Subject GI
                sallgi means All subject GIs
                  sacc means Subject accession
               saccver means Subject accession.version
               sallacc means All subject accessions
                  slen means Subject sequence length
                qstart means Start of alignment in query
                  qend means End of alignment in query
                sstart means Start of alignment in subject
                  send means End of alignment in subject
                  qseq means Aligned part of query sequence
                  sseq means Aligned part of subject sequence
                evalue means Expect value
              bitscore means Bit score
                 score means Raw score
                length means Alignment length
                pident means Percentage of identical matches
                nident means Number of identical matches
              mismatch means Number of mismatches
              positive means Number of positive-scoring matches
               gapopen means Number of gap openings
                  gaps means Total number of gaps
                  ppos means Percentage of positive-scoring matches
                frames means Query and subject frames separated by a '/'
                qframe means Query frame
                sframe means Subject frame
                  btop means Blast traceback operations (BTOP)
               staxids means unique Subject Taxonomy ID(s), separated by a ';'
                             (in numerical order)
             sscinames means unique Subject Scientific Name(s), separated by a ';'
             scomnames means unique Subject Common Name(s), separated by a ';'
            sblastnames means unique Subject Blast Name(s), separated by a ';'
                             (in alphabetical order)
            sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
                             (in alphabetical order) 
                stitle means Subject Title
            salltitles means All Subject Title(s), separated by a '<>'
               sstrand means Subject Strand
                 qcovs means Query Coverage Per Subject
               qcovhsp means Query Coverage Per HSP
       When not provided, the default value is:
       'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
       evalue bitscore', which is equivalent to the keyword 'std'
       Default = `0'
    When using a multi-fasta input file, each sequence will produce an output block that will start with "Query=" line and end with the "Effective search space…" line.


    • #17
      Originally posted by hlyates View Post
      Fantastic. Many salutations and thanks. Can you please point me to the NCBI docs which talk about this? Dr. Google didn't report a hit back for me. Would like to learn more about the formatting.
      It has been so long ago I don't recall if I ever did read any formal documentation on the Pairwise (default output 0) format or just figured it out through trial and error.

      As to parsing these plain text reports I have in the past used both BioPerl Bio::SearchIO and hand rolled code; pluses and minus in each. But as a general recommendation I would say that if you foresee needing to regularly parse BLAST reports I would avoid the default "Pairwise" output format altogether. The problem is that the format wasn't really designed for automated parsing and so parsing code is easily broken. If you plan to do a lot of parsing then the two better choices are tabular or XML. If your needs are simple (e.g. hit ids, evalues, start and end locations) then tabular is the way to go. Output is small and is very simple to parse. If you have more complex needs (e.g. capturing query/target alignments) then have your BLAST job output XML and parse it using the available modules from BioPerl, BioPython, etc. Since XML is such a structured format automated parsing is more robust. Also, since the XML format retains all of the information present in the Pairwise format it is possible to convert the interesting bits of the XML output into human readable form (again using the 'Bio' modules).


      • #18
        Thank you Geno Max and kmcarr. I am thinking I may run my job again and use tabular/xml output. I can then more easily apply a script to it. If I understand you both, it seems biopython can parse tabular quite easily. I might go that route because I just need basic information such as:

        hit id
        input sequence (the input sequence that alignment with something in nt database)
        target sequence organism id and name

        You pros are great and if I knew you in person I would buy you both some drinks.


        • #19
          Provided you have access to a cluster and if you are going to do this over then I suggest you break-up your original file into multiple smaller ones and run the blast jobs in parallel. It would speed things up significantly.
          Last edited by GenoMax; 05-11-2015, 02:08 PM.


          • #20
            Originally posted by GenoMax View Post
            Order of operations:

            1. Download all Drosophila proteins from tax browser link (since this is what your collaborator seems to want you to do) as multi-fasta format file.
            2. Make blast database using the fasta file.
            3. Blastx with your sequences using parameters you want (e-value cutoff etc). I would just choose the tabular output format since you can grab the sequence ID's that show a hit from this table.
            4. Use faFilter utility ( to get a subset that contains sequences from your list that hit Drosophila proteins.
            Since I am downloading a fasta file of proteins, when I make the database should the -dbtype option be:
            1. 1. nucl? (I think this is nucleotide)
            2. 2. prot (this is for proteins and hence what I should choose)?

            I feel stupid for not having thought of that question in advance. Only ran into it as I actually started reading the
            makeblastdb -help


            • #21
              Originally posted by hlyates View Post
              Since I am downloading a fasta file of proteins, when I make the database should the -dbtype option be:
              1. 1. nucl? (I think this is nucleotide)
              2. 2. prot (this is for proteins and hence what I should choose)?

              I feel stupid for not having thought of that question in advance. Only ran into it as I actually started reading the docs
              makeblastdb -help
              Thank you so much for your patience while I learn. As I indicated, I am a one dog show, so this is the only true outlet I have to learn. I am very humbled by your assistance.


              Latest Articles


              • seqadmin
                Advanced Tools Transforming the Field of Cytogenomics
                by seqadmin

                At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                09-26-2023, 06:26 AM
              • seqadmin
                How RNA-Seq is Transforming Cancer Studies
                by seqadmin

                Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                09-07-2023, 11:15 PM





              Topics Statistics Last Post
              Started by seqadmin, Today, 07:14 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 09-29-2023, 09:38 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 09-27-2023, 06:57 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 09-26-2023, 07:53 AM
              0 responses
              Last Post seqadmin