Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    If you run "blastx -help" on the command line you will get all options for blastx. One of the sections is for output formats. Default format is "0".

    Code:
     -outfmt <String>
       alignment view options:
         0 = pairwise,
         1 = query-anchored showing identities,
         2 = query-anchored no identities,
         3 = flat query-anchored, show identities,
         4 = flat query-anchored, no identities,
         5 = XML Blast output,
         6 = tabular,
         7 = tabular with comment lines,
         8 = Text ASN.1,
         9 = Binary ASN.1,
        10 = Comma-separated values,
        11 = BLAST archive format (ASN.1) 
        12 = JSON Seqalign output 
       
       Options 6, 7, and 10 can be additionally configured to produce
       a custom format specified by space delimited format specifiers.
       The supported format specifiers are:
                qseqid means Query Seq-id
                   qgi means Query GI
                  qacc means Query accesion
               qaccver means Query accesion.version
                  qlen means Query sequence length
                sseqid means Subject Seq-id
             sallseqid means All subject Seq-id(s), separated by a ';'
                   sgi means Subject GI
                sallgi means All subject GIs
                  sacc means Subject accession
               saccver means Subject accession.version
               sallacc means All subject accessions
                  slen means Subject sequence length
                qstart means Start of alignment in query
                  qend means End of alignment in query
                sstart means Start of alignment in subject
                  send means End of alignment in subject
                  qseq means Aligned part of query sequence
                  sseq means Aligned part of subject sequence
                evalue means Expect value
              bitscore means Bit score
                 score means Raw score
                length means Alignment length
                pident means Percentage of identical matches
                nident means Number of identical matches
              mismatch means Number of mismatches
              positive means Number of positive-scoring matches
               gapopen means Number of gap openings
                  gaps means Total number of gaps
                  ppos means Percentage of positive-scoring matches
                frames means Query and subject frames separated by a '/'
                qframe means Query frame
                sframe means Subject frame
                  btop means Blast traceback operations (BTOP)
               staxids means unique Subject Taxonomy ID(s), separated by a ';'
                             (in numerical order)
             sscinames means unique Subject Scientific Name(s), separated by a ';'
             scomnames means unique Subject Common Name(s), separated by a ';'
            sblastnames means unique Subject Blast Name(s), separated by a ';'
                             (in alphabetical order)
            sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
                             (in alphabetical order) 
                stitle means Subject Title
            salltitles means All Subject Title(s), separated by a '<>'
               sstrand means Subject Strand
                 qcovs means Query Coverage Per Subject
               qcovhsp means Query Coverage Per HSP
       When not provided, the default value is:
       'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
       evalue bitscore', which is equivalent to the keyword 'std'
       Default = `0'
    When using a multi-fasta input file, each sequence will produce an output block that will start with "Query=" line and end with the "Effective search space…" line.

    Comment


    • #17
      Originally posted by hlyates View Post
      Fantastic. Many salutations and thanks. Can you please point me to the NCBI docs which talk about this? Dr. Google didn't report a hit back for me. Would like to learn more about the formatting.
      It has been so long ago I don't recall if I ever did read any formal documentation on the Pairwise (default output 0) format or just figured it out through trial and error.

      As to parsing these plain text reports I have in the past used both BioPerl Bio::SearchIO and hand rolled code; pluses and minus in each. But as a general recommendation I would say that if you foresee needing to regularly parse BLAST reports I would avoid the default "Pairwise" output format altogether. The problem is that the format wasn't really designed for automated parsing and so parsing code is easily broken. If you plan to do a lot of parsing then the two better choices are tabular or XML. If your needs are simple (e.g. hit ids, evalues, start and end locations) then tabular is the way to go. Output is small and is very simple to parse. If you have more complex needs (e.g. capturing query/target alignments) then have your BLAST job output XML and parse it using the available modules from BioPerl, BioPython, etc. Since XML is such a structured format automated parsing is more robust. Also, since the XML format retains all of the information present in the Pairwise format it is possible to convert the interesting bits of the XML output into human readable form (again using the 'Bio' modules).

      Comment


      • #18
        Thank you Geno Max and kmcarr. I am thinking I may run my job again and use tabular/xml output. I can then more easily apply a script to it. If I understand you both, it seems biopython can parse tabular quite easily. I might go that route because I just need basic information such as:

        hit id
        e-value
        input sequence (the input sequence that alignment with something in nt database)
        target sequence organism id and name

        You pros are great and if I knew you in person I would buy you both some drinks.

        Comment


        • #19
          Provided you have access to a cluster and if you are going to do this over then I suggest you break-up your original file into multiple smaller ones and run the blast jobs in parallel. It would speed things up significantly.
          Last edited by GenoMax; 05-11-2015, 02:08 PM.

          Comment


          • #20
            Originally posted by GenoMax View Post
            Order of operations:

            1. Download all Drosophila proteins from tax browser link (since this is what your collaborator seems to want you to do) as multi-fasta format file.
            2. Make blast database using the fasta file.
            3. Blastx with your sequences using parameters you want (e-value cutoff etc). I would just choose the tabular output format since you can grab the sequence ID's that show a hit from this table.
            4. Use faFilter utility (http://hgdownload.soe.ucsc.edu/admin...86_64/faFilter) to get a subset that contains sequences from your list that hit Drosophila proteins.
            Since I am downloading a fasta file of proteins, when I make the database should the -dbtype option be:
            1. 1. nucl? (I think this is nucleotide)
            2. 2. prot (this is for proteins and hence what I should choose)?


            I feel stupid for not having thought of that question in advance. Only ran into it as I actually started reading the
            Code:
            makeblastdb -help
            docs.

            Comment


            • #21
              Originally posted by hlyates View Post
              Since I am downloading a fasta file of proteins, when I make the database should the -dbtype option be:
              1. 1. nucl? (I think this is nucleotide)
              2. 2. prot (this is for proteins and hence what I should choose)?


              I feel stupid for not having thought of that question in advance. Only ran into it as I actually started reading the docs
              Code:
              makeblastdb -help
              .
              Thank you so much for your patience while I learn. As I indicated, I am a one dog show, so this is the only true outlet I have to learn. I am very humbled by your assistance.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Non-Coding RNA Research and Technologies
                by seqadmin




                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                Nobel Prize for MicroRNA Discovery
                This week,...
                10-07-2024, 08:07 AM
              • seqadmin
                Recent Developments in Metagenomics
                by seqadmin





                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                09-23-2024, 06:35 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 10-02-2024, 04:51 AM
              0 responses
              103 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 10-01-2024, 07:10 AM
              0 responses
              111 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 09-30-2024, 08:33 AM
              1 response
              114 views
              0 likes
              Last Post EmiTom
              by EmiTom
               
              Started by seqadmin, 09-26-2024, 12:57 PM
              0 responses
              20 views
              0 likes
              Last Post seqadmin  
              Working...
              X