Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to identify blast hits with only 1 HSP (not limit the number of HSP)

    Hi,

    Is there a way to identify blast hits where there was only one HSP to the database?

    i.e. I don't want to limit the number of HSP's displayed, but I want to identify all queries which only had one HSP to the database to begin with in the blast result.

    Maybe there is an accessory script somewhere that parses a blast report and identify the queries with this property?

    thanks

  • #2
    If I understand what your after correctly then I do the same using this..

    cat BlastOutputFile | grep -B 3 -A 1 "# 1 hits found"

    When BlastOutputFile is in outformat 7. Obviously if your using Xml output or any other format it'll be different.

    Comment


    • #3
      Originally posted by whataBamBam View Post
      If I understand what your after correctly then I do the same using this..

      cat BlastOutputFile | grep -B 3 -A 1 "# 1 hits found"

      When BlastOutputFile is in outformat 7. Obviously if your using Xml output or any other format it'll be different.
      Thanks bambam, but that's not what I'm after.

      Yes that will pull out those with 1 match, but what I'm after is those that have only have one alignment 'section' (i.e. high-scoring segment pair).

      e.g. an isoform could match to only one target in the database, but half of it could be aligning to to the 5' end, then a gap, then the other half the 3' end of the target.

      I'm after hits which aligns without any interruptions, i.e. only 1 HSP. Blastn options don't seem to do what I want (e.g. -ungapped, -culling_limit, -max_hsps_per_subject just mask the information).

      Comment


      • #4
        Originally posted by Kennels View Post

        I'm after hits which aligns without any interruptions, i.e. only 1 HSP. Blastn options don't seem to do what I want (e.g. -ungapped, -culling_limit, -max_hsps_per_subject just mask the information).
        So you mean you want to parse for full-length hsp?

        Comment


        • #5
          Yes and no.
          Also looking for partial alignment, but only those with single hsp.

          Comment


          • #6
            I usually use BioPerl for something like this. The following will get you close (the formatting of the output blast file generated by Bio::SearchIO::Writer::TextResultWriter is slightly different, but basically the same).

            Usage would be:
            perl script.pl --blast inputBlastFile.txt --out reducedBlastFile.txt

            Code:
            #!/usr/bin/perl
            
            use strict;
            use warnings;
            use Getopt::Long;
            use Bio::SearchIO;
            use Bio::SearchIO::Writer::TextResultWriter;
            
            my $blastFile;
            my $outFile;
            
            GetOptions("blast=s" => \$blastFile,
                       "out=s"   => \$outFile) || die "Couldn't get options with GetOpt::Long.\n";
            
            my $blastIO = Bio::SearchIO->new(-file => $blastFile,
                                             -format => 'blast'); #you might need to change this
            
            my $searchIOWriter = Bio::SearchIO::Writer::TextResultWriter->new();
            my $out = Bio::SearchIO->new(-writer => $searchIOWriter,
                                         -file => '>outtest.txt');
            
            while (my $result = $blastIO->next_result) {
                if ($result->num_hits() == 1) {
                    my $hit = $result->next_hit();
                    if ($hit->num_hsps() == 1) {
                        print "A single HSP for a single hit found for " . $result->query_name() . "\n";
                        $out->write_result($result);
                    }
                }
            }

            Comment


            • #7
              Thanks I'll give it a try.

              Comment


              • #8
                if you used the XML output of blast, you could use the following XSLT stylesheet:

                Code:
                <?xml version='1.0'  encoding="ISO-8859-1"?>
                <xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0' >
                <xsl:output method="text"/>
                <xsl:template match="/">
                <xsl:apply-templates select="BlastOutput/BlastOutput_iterations/Iteration/Iteration_hits/Hit"/>
                </xsl:template>
                <xsl:template match="Hit">
                <xsl:if test="count(Hit_hsps/Hsp) = 1">
                <xsl:value-of select="Hit_id"/><xsl:text>	</xsl:text>
                <xsl:value-of select="Hit_def"/><xsl:text>	</xsl:text>
                <xsl:value-of select="Hit_accession"/><xsl:text>
                </xsl:text>
                </xsl:if>
                </xsl:template>
                </xsl:stylesheet>
                usage:

                Code:
                $ xsltproc --novalid  stylesheet.xsl  blast.xml

                Comment


                • #9
                  Originally posted by Kennels View Post
                  Thanks bambam, but that's not what I'm after.

                  Yes that will pull out those with 1 match, but what I'm after is those that have only have one alignment 'section' (i.e. high-scoring segment pair).

                  e.g. an isoform could match to only one target in the database, but half of it could be aligning to to the 5' end, then a gap, then the other half the 3' end of the target.

                  I'm after hits which aligns without any interruptions, i.e. only 1 HSP. Blastn options don't seem to do what I want (e.g. -ungapped, -culling_limit, -max_hsps_per_subject just mask the information).
                  Ah sorry I get you. Yes that would be really useful. I'll have a look at the script posted and see what that's doing.

                  I think Gmap will do something like that.. it isn't really designed to match pairs of transcript length sequences (it's designed to map a transcript length sequence to a genome) but I actually do use it to map transcripts to ESTs and it seems to work really well. I then filter the output based on length of the alignment and can find the matches where the EST aligned along it's whole length without any significant gaps. Which is probably quite similar to what your trying to do.

                  Possibly I really shouldn't be using Gmap to do this so if someone can tell me that then great because I actually do it all the time.

                  Comment


                  • #10
                    Originally posted by atcghelix View Post
                    I usually use BioPerl for something like this. The following will get you close (the formatting of the output blast file generated by Bio::SearchIO::Writer::TextResultWriter is slightly different, but basically the same).
                    Just letting you know the script did what I wanted.
                    I just modified the script a little:

                    Code:
                    #!/usr/bin/perl
                    
                    use strict;
                    use warnings;
                    use Getopt::Long;
                    use Bio::SearchIO;
                    use Bio::SearchIO::Writer::TextResultWriter;
                    
                    my $blastFile = $ARGV[0];     ## added one argument
                    my $outFile;
                    
                    GetOptions("blast=s" => \$blastFile,
                               "out=s"   => \$outFile) || die "Couldn't get options with GetOpt::Long.\n";
                    
                    my $blastIO = Bio::SearchIO->new(-file => $blastFile,
                                                     -format => 'blasttable'); ### changed from 'blast'
                    
                    my $searchIOWriter = Bio::SearchIO::Writer::TextResultWriter->new();
                    my $out = Bio::SearchIO->new(-writer => $searchIOWriter,
                                                 -file => '>outtest.txt');
                    
                    while (my $result = $blastIO->next_result) {
                        if ($result->num_hits() == 1) {
                            my $hit = $result->next_hit();
                            if ($hit->num_hsps() == 1) {
                                
                    	print $result->query_name() . "\n";   ## just print the result ID
                                $out->write_result($result);
                            }
                        }
                    }
                    Since my output was in -outfmt 6 from blastn, I changed the format to 'blasttable'
                    Also I just wanted the ID of the query for single HSP, so I just modified the stdout and redirect to a file to get my list of IDs.
                    I ran as:

                    Code:
                    parse_blastreport_single_hsp.pl Result.blastn6 > singleHSPs.txt

                    For others, also need to make sure the output has '-max_target_seqs' in blastn is set to more than 1.

                    Many thanks!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Best Practices for Single-Cell Sequencing Analysis
                      by seqadmin



                      While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                      06-06-2024, 07:15 AM
                    • seqadmin
                      Latest Developments in Precision Medicine
                      by seqadmin



                      Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                      Somatic Genomics
                      “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                      05-24-2024, 01:16 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Today, 07:23 AM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-17-2024, 06:54 AM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-14-2024, 07:24 AM
                    0 responses
                    24 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-13-2024, 08:58 AM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X