Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to output only one best-hit result using standalone blast

    Hello
    I am running standalone blast now to blastp a protein data set against a genome data.

    Although I have set the best-hits filtering algorithm, namely, using -best_hit_overhang, and the value I set was 0.25 (0.1~0/25 was recommended at http://www.ncbi.nlm.nih.gov/books/NB...stHits_filteri)

    Compared the results without -best_hit_overhang, indeed, the number of matches for each inquery was eliminated a lot, however, what I want is to output just one best hit result (according to the bit score and evalue).

    For example, the output result was like
    ***************************************************************
    gi|270118727|emb|CAT18807.1| gi|323452346|gb|EGB08220.1| 26.15 65 48 0 31 95 835 899 0.05 35.4
    gi|270118727|emb|CAT18807.1| gi|323451569|gb|EGB07446.1| 28.77 73 46 2 226 298 14 80 0.42 31.2
    gi|270118727|emb|CAT18807.1| gi|323445796|gb|EGB02232.1| 30.65 62 43 0 349 410 111 172 1.8 29.3
    gi|270118727|emb|CAT18807.1| gi|323449782|gb|EGB05667.1| 29.27 41 29 0 132 172 105 145 2.1 29.6
    gi|270118727|emb|CAT18807.1| gi|323448614|gb|EGB04510.1| 26.92 52 38 0 189 240 15 66 6.5 27.7
    ***************************************************************

    And I just want the top best one,
    ***************************************************************
    gi|270118727|emb|CAT18807.1| gi|323452346|gb|EGB08220.1| 26.15 65 48 0 31 95 835 899 0.05 35.4
    ***************************************************************

    Would anyone please help to tell me how to do that?

  • #2
    Hi- I'm not sure how/if it can be done by Blast. However you can extract the top hit from the blast ouput (blastout.txt) with the code below:

    Code:
    sort -k1,1 -k12,12nr -k11,11n  blastout.txt | sort -u -k1,1 --merge
    The first sort orders the blast output by query name then by the 12th column in descending order (bit score - I think), then by 11th column ascending (evalue I think).
    The second sort picks the first line from each query. Obviously you can skip the first sort if the output is already sorted in the 'correct' order.

    Hope it helps (make sure it does what you want)...

    Dario

    Comment


    • #3
      Hi Tsuyoshi,

      After fooling around with blast+ today, I think I've managed to achieve your objective. The command I'm using is:

      blastn -query transcripts.fa -out transcripts.blast.txt -task megablast -db refseq_rna -num_threads 12 -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -outfmt 7 -perc_identity 50 -max_target_seqs 1 &

      Adding the "-max_target_seqs" flag and setting it to "1" yields what appears to be the best hit in terms of e-value and bit score. I haven't done extensive comparisons, but it appears where multiple matches yield the same e-value (e.g., 0), the match with the highest score is retained.

      Perhaps anyone who has more experience with blast+ can provide further insight.

      Comment


      • #4
        Originally posted by dariober View Post
        Hi- I'm not sure how/if it can be done by Blast. However you can extract the top hit from the blast ouput (blastout.txt) with the code below:

        Code:
        sort -k1,1 -k12,12nr -k11,11n  blastout.txt | sort -u -k1,1 --merge
        The first sort orders the blast output by query name then by the 12th column in descending order (bit score - I think), then by 11th column ascending (evalue I think).
        The second sort picks the first line from each query. Obviously you can skip the first sort if the output is already sorted in the 'correct' order.

        Hope it helps (make sure it does what you want)...

        Dario
        Dear Dario,
        Thank you very much for answering this question, and you helped me a lot to resolve the problems of python several days ago.

        In order to extract the best hit in results.txt, I have tried an alternative way in using macro tool of excel. The script for this purpose was written like
        ****************************************************************
        Sub tophitonly()
        Dim i%
        For i = [a65536].End(3).Row To 1 Step -1
        If Application.CountIf(Range("a:a"), Cells(i, 1)) > 1 Then
        Cells(i, 1).EntireRow.Delete
        End If
        Next i
        End Sub
        ****************************************************************
        Anyway, thank you very much. And I would like to try your method in the following days.

        Comment


        • #5
          Originally posted by NormSci View Post
          Hi Tsuyoshi,

          After fooling around with blast+ today, I think I've managed to achieve your objective. The command I'm using is:

          blastn -query transcripts.fa -out transcripts.blast.txt -task megablast -db refseq_rna -num_threads 12 -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -outfmt 7 -perc_identity 50 -max_target_seqs 1 &

          Adding the "-max_target_seqs" flag and setting it to "1" yields what appears to be the best hit in terms of e-value and bit score. I haven't done extensive comparisons, but it appears where multiple matches yield the same e-value (e.g., 0), the match with the highest score is retained.

          Perhaps anyone who has more experience with blast+ can provide further insight.
          Dear NormSci,
          Thank you for your kind reply!
          I learned more useful commands in using blast+ from your reply, such as set the minimum identity percentage and especially the max_targe_seqs. It sounds convenient to use to output the only top match. I would like to try it ASAP.

          Alternatively, I managed to extract the top hit match from excel file using macro tools like
          ****************************************************************
          Sub tophitonly()
          Dim i%
          For i = [a65536].End(3).Row To 1 Step -1
          If Application.CountIf(Range("a:a"), Cells(i, 1)) > 1 Then
          Cells(i, 1).EntireRow.Delete
          End If
          Next i
          End Sub
          ****************************************************************
          Since the queries in result.txt were in an ascending order based on the bit score and evalue, the macro worked well to pick the top hit one of each query. Meanwhile, I could count how many matches were obtained from one blast.
          Thank you.

          Comment


          • #6
            @Dariober, one needs to also append the -t, parameter to your code snippet , since it is a comma-separated file and the default delimiter for sort is non-blank to blank.

            Code:
            sort -t, -k1,1 -k12,12nr -k11,11n  blastout.txt | sort -u -t, -k1,1 --merge
            -Carmen
            Last edited by carmeyeii; 05-21-2013, 06:21 PM.

            Comment


            • #7
              Originally posted by dariober View Post
              Hi- I'm not sure how/if it can be done by Blast. However you can extract the top hit from the blast ouput (blastout.txt) with the code below:

              Code:
              sort -k1,1 -k12,12nr -k11,11n  blastout.txt | sort -u -k1,1 --merge
              The first sort orders the blast output by query name then by the 12th column in descending order (bit score - I think), then by 11th column ascending (evalue I think).
              The second sort picks the first line from each query. Obviously you can skip the first sort if the output is already sorted in the 'correct' order.

              Hope it helps (make sure it does what you want)...

              Dario
              I've lost count on how many times I've used this sort. However, I just realized that it doesn't work as intended. Sort -n compares according to string numerical value, so for example 1e-10 would be smaller than 2e-100. I believe the command below is 'correct'..

              Code:
              sort -k1,1 -k12,12gr -k11,11g -k3,3gr blastout.txt | sort -u -k1,1 --merge > bestHits
              Although the bitscore sort probably works fine with -k12,12nr too (which might make it somewhat faster perhaps??).

              Note, make sure that your locale settings recognize "." as a decimal separator, e.g.

              Code:
              export LC_ALL=en_US.UTF-8
              export LANG=en_US.UTF-8
              Last edited by rhinoceros; 05-17-2014, 03:08 AM.
              savetherhino.org

              Comment


              • #8
                Hello everybody!

                I am in a kind of problem with my standalone blastp.
                I used parameters from NormSci and worked pretty good but there is one point remained in my case. I have 1000 proteins to compare to my query and at these commands, all of them are showed (even with the message "0 hits found"). This represents 998 of my 1000 but I just want to see the 2 which have hits.

                Does anyone have a tip for me?

                Thank you so much!

                Comment


                • #9
                  You guys save my ass, thanks!

                  Comment


                  • #10
                    Hi,

                    Can this -max_target_seqs 1 & also work for commandline blast if I need to output the best hits?

                    kaps

                    Comment


                    • #11
                      Technically you dont need to go that far. Since blastn/blastp scores for latest blast+ packages is already sorted with top hit as the first hit, you can just use the following.

                      Code:
                      blastn -query transcripts.fa -db Targets.fa [other options you like ] -outfmt 7 -out blast_output.txt
                      
                      cat blast_output.txt |awk '/hits found/{getline;print}' | grep -v "#" > top_hits.txt
                      Rahul
                      Last edited by rahulvrane; 03-04-2015, 03:27 PM.

                      Comment


                      • #12
                        Silent total failure on MacOS

                        WARNING: the answers involving `sort -u --merge` fail silently and spectacularly on MacOS (for me at least). Details here https://www.biostars.org/p/144569/#325003

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Exploring the Dynamics of the Tumor Microenvironment
                          by seqadmin




                          The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                          07-08-2024, 03:19 PM
                        • seqadmin
                          Exploring Human Diversity Through Large-Scale Omics
                          by seqadmin


                          In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                          06-25-2024, 06:43 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:53 AM
                        0 responses
                        13 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-10-2024, 07:30 AM
                        0 responses
                        34 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-03-2024, 09:45 AM
                        0 responses
                        204 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-03-2024, 08:54 AM
                        0 responses
                        213 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X