Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Blastn to return only top hit

    Hi,

    I'm trying to use Blastn to blast an entire Unigene file (only cds) in order to get gene names for my reference sequence. I then want to attach these gene names to the FASTA sequence and use this reference for a cufflinks run. Hopefully, this would give me a good idea of the identities of differentially expressed genes, without having to blast them individually.

    Could anyone help me with blastn to ensure that only the top hit is returned, and that this hit contains the full source name?

    I'm currently using the following command...

    $ blastn -db trop -query blast.fa -outfmt 7 -max_target_seqs 1 -out results.out

    ...and get results like this:

    BLASTN 2.2.23+
    # Query: gnl|UG|Xl#S25665548 Xenopus laevis fascin, mRNA (cDNA clone MGC:114829 IMAGE:4970584), complete cds /cds=p(36,1490) /gb=BC097600 /gi=67678339 /ug=Xl.151 /len=2703
    # Database: trop
    # Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
    # 2 hits found
    gnl|UG|Xl#S25665548 gi|45360982|ref|NM_203797.1|_Xenopus_(Silurana)_tropicalis_fascin_homolog_1,_actin-bundling_protein_(fscn1),_mRNA 89.17 1672 159 12 9 1677 18 1670 0.0 2065
    gnl|UG|Xl#S25665548 gi|45360982|ref|NM_203797.1|_Xenopus_(Silurana)_tropicalis_fascin_homolog_1,_actin-bundling_protein_(fscn1),_mRNA 83.02 907 92 42 16492527 1675 2547 0.0 765

    Whereas I just want:

    BLASTN 2.2.23+
    # Query: gnl|UG|Xl#S25665548 Xenopus laevis fascin, mRNA (cDNA clone MGC:114829 IMAGE:4970584), complete cds /cds=p(36,1490) /gb=BC097600 /gi=67678339 /ug=Xl.151 /len=2703
    # 1 hit found
    gnl|UG|Xl#S25665548 gi|45360982|ref|NM_203797.1|_Xenopus_(Silurana)_tropicalis_fascin_homolog_1,_actin-bundling_protein_(fscn1),_mRNA 89.17 1672 159 12 9 1677 18 1670 0.0 2065

    Any tips or advice would be hugely appreciated,

    Many thanks,

    N

  • #2
    So you want the top HSP within the top hit?

    Comment


    • #3
      What's the top HSP?

      I've made some progress using:

      blastn -db trop -query blast.fa -outfmt 6 -max_target_seqs 1 -num_alignments 1 -num_descriptions 1 -out results.out

      which makes isolating blast results for each hit easier (however still multiple hits returned...)

      After removing non unique lines ( uniq -u) I am presumably left with only one hit per sequence? After comparing the number of sequences in my original search (31,434) to the number of unique hits (16,862) it's clear that only ~ half return hits.

      Additionally, does anyone know how to return a value for non-hits?

      N

      Comment


      • #4
        I use awk as a post filter. awk '!x[$1]++' <blast.out >blast.uniq.out

        There is probably a better way, but to get to non-hits I cat my reference list from the query fasta file and list of queries in my blast results and then use awk to count them. Count=1 are ones that did not get a hit.

        Comment


        • #5
          Get best alignment of blast results (based on bit score):
          blastn -query your.fa -db your_db.fa -outfmt 6 | sort -k1,1 -k12,12nr -k11,11n | sort -u -k1,1 --merge > best_single_hits.blastn

          This answer was based on previous post: http://seqanswers.com/forums/showthread.php?t=23166



          Ilia

          Comment


          • #6
            Hi nr23,

            the individual lines in your BLAST output are HSPs (high-scoring segment pairs). You can have several of these per query-hit combination. They represent local stretches of aligned sequence between the query and the hit sequence (remember, BLAST is a local alignment algorithm).

            To the best of my knowledge the HSPs for a given hit are sorted in descending order by the BLAST bit score, which means the best (usually the longest) HSP comes first. If you extract just that you may miss out on other significant HSPs, so this may not always be the best solution.

            There are approaches out there that try to remedy this by combining the HSPs to give you an integrated score across all HSPs (e.g. Salse et al, The Plant Cell, Vol. 20: 11–24, January 2008).

            cheers

            Micha

            Comment


            • #7
              I recently had to do something similar, since I use R I wrote a function to automate it.
              Code:
              scan("raw_blast_out",what="list",sep="\n")->Blast
              Blast.sort<-function(x) {
                require(data.table)
                proc.time()->ptm
                as.list(x)->x
                replace(x,grepl("#",x),values=NULL)->x
                as.character(x)->x
                strsplit(x,"\t")->x
                as.data.frame(do.call("rbind",x))->x
                colnames(x)<-c("id", "match", "identity", "align.len", "mismatches", "gaps", "q.start", "q.end", "s.start", "s.end", "E.value", "bit.score")
                data.table(x)->x
                x[,.SD[which.min(E.value)],by=id][]->res
                print(proc.time()-ptm)
                data.frame(res)->res
              }
              Blast.sort(Blast)->res
              write.table(res,file="Best_blast_match",row.names=F,sep="\t",quote=F)

              Comment


              • #8
                Thanks guys - I think I've got it down now:

                this suggestion works well:

                blastn -query your.fa -db your_db.fa -outfmt 6 | sort -k1,1 -k12,12nr -k11,11n | sort -u -k1,1 --merge > best_single_hits.blastn

                But if you just want the names of query and top hit then:

                blastn -db trop -query blast.fa -outfmt 6 -max_target_seqs 1 -num_alignments 1 -num_descriptions 1 -out results.out

                cut -f 1,2 results.out > new.txt

                more new.txt | uniq -u > singlehit.txt

                N

                Comment


                • #9
                  Originally posted by nr23 View Post
                  Thanks guys - I think I've got it down now:

                  this suggestion works well:

                  blastn -query your.fa -db your_db.fa -outfmt 6 | sort -k1,1 -k12,12nr -k11,11n | sort -u -k1,1 --merge > best_single_hits.blastn

                  But if you just want the names of query and top hit then:

                  blastn -db trop -query blast.fa -outfmt 6 -max_target_seqs 1 -num_alignments 1 -num_descriptions 1 -out results.out

                  cut -f 1,2 results.out > new.txt

                  more new.txt | uniq -u > singlehit.txt

                  N
                  A couple of comments about your command pipeline:

                  1. The options '-num_alignments' and '-num_descriptions' are not relevant to tabular output formats, they are only meaningful for the full alignment report formats (e.g. outfmt 0-4, not sure if they apply to the XML (outfmt 5) as well). You only need '-max_target_seqs' for -outfmt 6.

                  2. As it appears you only want the query id and subject (hit) id you can control that with your blastn command by adding format specifiers to your -outfmt parameter, in your case it would be '-outfmt "6 qseqid sseqid"' (be sure to enclose all the parameters to -outfmt in double quotes). This will allow you to do away with your cut -f1,2 ... command.

                  3. It's not a good idea to use 'more' (or 'less') to pipe output to another command as the results may be unpredictable. If you need to stream a file to a pipe use 'cat' instead. However in this case you don't need to do this since you can simply specify the input file as part of the uniq command.

                  4. The '-u' option to uniq tells the program to only output lines which were present only once in the original input; that's not what you seem to want. You want only a single instance of each hit, no matter how many times it was repeated in the original, thus you want to use uniq without any options. Note that uniq assumes the input are sorted as it is only able to compare adjacent lines, thus you could first sort the file and pipe the output through uniq or achieve the same result by using sort with the '-u' option. (Truth be told, since the output of blastn is naturally sorted by query and then by hit this step is unnecessary, but for completeness sake it can be included.)

                  5. If you do not wish to save the direct output from blastn you could pipe its output directly to sort/uniq.

                  Here is a simplified version of what you did above:

                  Code:
                  blastn -db trop -query blast.fa -outfmt "6 qseqid sseqid" -max_target_seqs 1 -out results.txt
                  
                  sort -u results.txt > singlehit.txt
                  Or as a single command not saving the blastn output:

                  Code:
                  blastn -db trop -query blast.fa -outfmt "6 qseqid sseqid" -max_target_seqs 1 | sort -u > singlehit.txt

                  Comment


                  • #10
                    Awesome - thanks a lot! That's really cleared things up.

                    Comment


                    • #11
                      Thanks for the great information. But i still have the question about the top hit. I am running the blastn using the high throughout computer. I just can choose the parameter from http://www.plexdb.org/modules/docume...BIblastall.htm. Can anyone please give me the suggestion which parameter i can use to just get the top hit.

                      And another question abou the metagenomic analysis. We have the 16srRNA miseq data. We want to get the phylogenetic tree about these data. Can anyone give the good suggestions? Thanks. Now i use the miseq data to blastn against the NCBI 16srRNA database, I want to use the Megan display to get the tree. But i don't know how to just get the top blastn hit. Thanks.

                      Comment


                      • #12
                        Assuming you only want a single top hit from the list and you have output the blast result in table format:

                        awk '!x[$1]++' blast.file >tophits.txt

                        Comment


                        • #13
                          Thanks. Because i need to use the Megan to display the file, my output is not the table format. I use the -m 0 output parameter. Do you know any parameter settiing can direct just get the top kit after run the blastn? Thanks.

                          Comment


                          • #14
                            1. MEGAN also accepts tabular blast output
                            2. The sort command mentioned in this thread returns wrong results, it should be "sort -k1,1 -k12,12gr -k11,11g". The n flag is for string numerical value, i.e. 2e-10 would be smaller than 3e-100.
                            3. You should use something completely different for you 16S data, e.g. QIIME or mothur..

                            bonus

                            4. I think MEGAN has some issues with its documentation. Take for example this file, linked to the megan page, right now it states: "Downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/B...a/lproks_0.txt on Mon Dec 2 20:38:23 CET 2013". That file doesn't actually exist in the ncbi ftp and they no longer maintain it. I emailed the author about this many months ago. He responded, but it's still the same. In my opinion MEGAN is a little bit black boxy in some aspects. It's a good tool in a sense that it's very easy to get a lot of information with it, but there are numerous other more open solutions that go a lot further. Some free services like mg-rast even do the data crunching work for you..
                            Last edited by rhinoceros; 12-05-2013, 10:22 AM.
                            savetherhino.org

                            Comment


                            • #15
                              A small optimization

                              Originally posted by kmcarr View Post
                              A couple of comments about your command pipeline:

                              ...

                              Here is a simplified version of what you did above:

                              Code:
                              blastn -db trop -query blast.fa -outfmt "6 qseqid sseqid" -max_target_seqs 1 -out results.txt
                              
                              sort -u results.txt > singlehit.txt
                              Or as a single command not saving the blastn output:

                              Code:
                              blastn -db trop -query blast.fa -outfmt "6 qseqid sseqid" -max_target_seqs 1 | sort -u > singlehit.txt
                              One small optimization to the above... Since the blastn results are returned sorted in field order, if you specify -outfmt "6 qseqid evalue sseqid", you'll get the results sorted by query first then evalue (numeric sort increasing so that the lowest evalue ends up first) then target seq. Combine with -max_target_seqs 1 and no further processing is necessary. In the case of ties on query and evalue, what will be returned is the first target in alphanumeric order.

                              Madeleine

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X