Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • blastn - qcovs field - or how to parse results based on % coverage of query sequence

    I have been playing around with blast+ (blastn), a local installation and various custom databases.

    I thought I had my workflow figured out, but some output is confusing me.

    Specifically the qcovs flag. As per the blast manual 'qcovs means Query coverage per subject' - i.e. how much of my query is represented in an alignment. I assumed this to be a percentage value (maxium 100%). And I have used this for filtering.

    But, now, I have done a local blast using a genome db, where the qcovs value goes up to 400! So clearly, it is not calculated in % ! Which means my previous filtering is probably crap...

    I basically want to do the following:

    Blast a set of sequences against dátabase 1. Filter blast result for: a) %idendity and b) alignment length and c) % of query sequence covered in alignment.

    I am basically not interested in alignments that cover 100% of the query, as I am doing breakpoint/insertion mapping. So I wanna filter these out and re-blast against database 2.

    Any ideas?

  • #2
    Nevermind... User error. I managed to mess up the columns while filtering the blast result...

    Comment


    • #3
      Using what options will produce the qcovs?

      Comment


      • #4
        Originally posted by okorist View Post
        Using what options will produce the qcovs?
        Manuals tend to be useful..
        savetherhino.org

        Comment


        • #5
          Indeed, using the -outfmt paramater, you can add all of the fields specified in the manual, see here from the manual:

          outfmt string 0

          alignment view options:
          0 = pairwise,
          1 = query-anchored showing identities,
          2 = query-anchored no identities,
          3 = flat query-anchored, show identities,
          4 = flat query-anchored, no identities,
          5 = XML Blast output,
          6 = tabular,
          7 = tabular with comment lines,
          8 = Text ASN.1,
          9 = Binary ASN.1
          10 = Comma-separated values
          11 = BLAST archive format (ASN.1)
          Options 6, 7, and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers.
          The supported format specifiers are:
          qseqid means Query Seq-id
          qgi means Query GI
          qacc means Query accesion
          sseqid means Subject Seq-id
          sallseqid means All subject Seq-id(s), separated by a ';'
          sgi means Subject GI
          sallgi means All subject GIs
          sacc means Subject accession
          sallacc means All subject accessions
          qstart means Start of alignment in query
          qend means End of alignment in query
          sstart means Start of alignment in subject
          send means End of alignment in subject
          qseq means Aligned part of query sequence
          sseq means Aligned part of subject sequence
          evalue means Expect value
          bitscore means Bit score
          score means Raw score
          length means Alignment length
          pident means Percentage of identical matches
          nident means Number of identical matches
          mismatch means Number of mismatches
          positive means Number of positive-scoring matches
          gapopen means Number of gap openings
          gaps means Total number of gap
          ppos means Percentage of positive-scoring matches
          frames means Query and subject frames separated by a '/'
          qframe means Query frame
          sframe means Subject frame
          btop means Blast traceback operations (BTOP)
          staxids means unique Subject Taxonomy ID(s), separated by a ';'(in numerical order)
          sscinames means unique Subject Scientific Name(s), separated by a ';'
          scomnames means unique Subject Common Name(s), separated by a ';'
          sblastnames means unique Subject Blast Name(s), separated by a ';' (in alphabetical order)
          sskingdoms means unique Subject Super Kingdom(s), separated by a ';' (in alphabetical order)
          stitle means Subject Title
          salltitles means All Subject Title(s), separated by a '<>'
          sstrand means Subject Strand
          qcovs means Query Coverage Per Subject
          qcovhsp means Query Coverage Per HSP
          When not provided, the default value is:
          'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore', which is equivalent to the keyword 'std'

          Comment


          • #6
            I think the qcov sums up the HSP lengths and divide it against query-length. If there is repeats in your query, sth bigger than 100% can show up. Because HSPs are repeatedly calculated. Is that your case?

            I have no solution for this problem, it seems complicated to program and filter the result.
            It will give you a bias towards bigger qcov. But I don't mind too much about it

            I wonder about what qcovhsp does though.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            57 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            45 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X