Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to identify blast hits with only 1 HSP (not limit the number of HSP)

    Hi,

    Is there a way to identify blast hits where there was only one HSP to the database?

    i.e. I don't want to limit the number of HSP's displayed, but I want to identify all queries which only had one HSP to the database to begin with in the blast result.

    Maybe there is an accessory script somewhere that parses a blast report and identify the queries with this property?

    thanks

  • #2
    If I understand what your after correctly then I do the same using this..

    cat BlastOutputFile | grep -B 3 -A 1 "# 1 hits found"

    When BlastOutputFile is in outformat 7. Obviously if your using Xml output or any other format it'll be different.

    Comment


    • #3
      Originally posted by whataBamBam View Post
      If I understand what your after correctly then I do the same using this..

      cat BlastOutputFile | grep -B 3 -A 1 "# 1 hits found"

      When BlastOutputFile is in outformat 7. Obviously if your using Xml output or any other format it'll be different.
      Thanks bambam, but that's not what I'm after.

      Yes that will pull out those with 1 match, but what I'm after is those that have only have one alignment 'section' (i.e. high-scoring segment pair).

      e.g. an isoform could match to only one target in the database, but half of it could be aligning to to the 5' end, then a gap, then the other half the 3' end of the target.

      I'm after hits which aligns without any interruptions, i.e. only 1 HSP. Blastn options don't seem to do what I want (e.g. -ungapped, -culling_limit, -max_hsps_per_subject just mask the information).

      Comment


      • #4
        Originally posted by Kennels View Post

        I'm after hits which aligns without any interruptions, i.e. only 1 HSP. Blastn options don't seem to do what I want (e.g. -ungapped, -culling_limit, -max_hsps_per_subject just mask the information).
        So you mean you want to parse for full-length hsp?

        Comment


        • #5
          Yes and no.
          Also looking for partial alignment, but only those with single hsp.

          Comment


          • #6
            I usually use BioPerl for something like this. The following will get you close (the formatting of the output blast file generated by Bio::SearchIO::Writer::TextResultWriter is slightly different, but basically the same).

            Usage would be:
            perl script.pl --blast inputBlastFile.txt --out reducedBlastFile.txt

            Code:
            #!/usr/bin/perl
            
            use strict;
            use warnings;
            use Getopt::Long;
            use Bio::SearchIO;
            use Bio::SearchIO::Writer::TextResultWriter;
            
            my $blastFile;
            my $outFile;
            
            GetOptions("blast=s" => \$blastFile,
                       "out=s"   => \$outFile) || die "Couldn't get options with GetOpt::Long.\n";
            
            my $blastIO = Bio::SearchIO->new(-file => $blastFile,
                                             -format => 'blast'); #you might need to change this
            
            my $searchIOWriter = Bio::SearchIO::Writer::TextResultWriter->new();
            my $out = Bio::SearchIO->new(-writer => $searchIOWriter,
                                         -file => '>outtest.txt');
            
            while (my $result = $blastIO->next_result) {
                if ($result->num_hits() == 1) {
                    my $hit = $result->next_hit();
                    if ($hit->num_hsps() == 1) {
                        print "A single HSP for a single hit found for " . $result->query_name() . "\n";
                        $out->write_result($result);
                    }
                }
            }

            Comment


            • #7
              Thanks I'll give it a try.

              Comment


              • #8
                if you used the XML output of blast, you could use the following XSLT stylesheet:

                Code:
                <?xml version='1.0'  encoding="ISO-8859-1"?>
                <xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0' >
                <xsl:output method="text"/>
                <xsl:template match="/">
                <xsl:apply-templates select="BlastOutput/BlastOutput_iterations/Iteration/Iteration_hits/Hit"/>
                </xsl:template>
                <xsl:template match="Hit">
                <xsl:if test="count(Hit_hsps/Hsp) = 1">
                <xsl:value-of select="Hit_id"/><xsl:text>	</xsl:text>
                <xsl:value-of select="Hit_def"/><xsl:text>	</xsl:text>
                <xsl:value-of select="Hit_accession"/><xsl:text>
                </xsl:text>
                </xsl:if>
                </xsl:template>
                </xsl:stylesheet>
                usage:

                Code:
                $ xsltproc --novalid  stylesheet.xsl  blast.xml

                Comment


                • #9
                  Originally posted by Kennels View Post
                  Thanks bambam, but that's not what I'm after.

                  Yes that will pull out those with 1 match, but what I'm after is those that have only have one alignment 'section' (i.e. high-scoring segment pair).

                  e.g. an isoform could match to only one target in the database, but half of it could be aligning to to the 5' end, then a gap, then the other half the 3' end of the target.

                  I'm after hits which aligns without any interruptions, i.e. only 1 HSP. Blastn options don't seem to do what I want (e.g. -ungapped, -culling_limit, -max_hsps_per_subject just mask the information).
                  Ah sorry I get you. Yes that would be really useful. I'll have a look at the script posted and see what that's doing.

                  I think Gmap will do something like that.. it isn't really designed to match pairs of transcript length sequences (it's designed to map a transcript length sequence to a genome) but I actually do use it to map transcripts to ESTs and it seems to work really well. I then filter the output based on length of the alignment and can find the matches where the EST aligned along it's whole length without any significant gaps. Which is probably quite similar to what your trying to do.

                  Possibly I really shouldn't be using Gmap to do this so if someone can tell me that then great because I actually do it all the time.

                  Comment


                  • #10
                    Originally posted by atcghelix View Post
                    I usually use BioPerl for something like this. The following will get you close (the formatting of the output blast file generated by Bio::SearchIO::Writer::TextResultWriter is slightly different, but basically the same).
                    Just letting you know the script did what I wanted.
                    I just modified the script a little:

                    Code:
                    #!/usr/bin/perl
                    
                    use strict;
                    use warnings;
                    use Getopt::Long;
                    use Bio::SearchIO;
                    use Bio::SearchIO::Writer::TextResultWriter;
                    
                    my $blastFile = $ARGV[0];     ## added one argument
                    my $outFile;
                    
                    GetOptions("blast=s" => \$blastFile,
                               "out=s"   => \$outFile) || die "Couldn't get options with GetOpt::Long.\n";
                    
                    my $blastIO = Bio::SearchIO->new(-file => $blastFile,
                                                     -format => 'blasttable'); ### changed from 'blast'
                    
                    my $searchIOWriter = Bio::SearchIO::Writer::TextResultWriter->new();
                    my $out = Bio::SearchIO->new(-writer => $searchIOWriter,
                                                 -file => '>outtest.txt');
                    
                    while (my $result = $blastIO->next_result) {
                        if ($result->num_hits() == 1) {
                            my $hit = $result->next_hit();
                            if ($hit->num_hsps() == 1) {
                                
                    	print $result->query_name() . "\n";   ## just print the result ID
                                $out->write_result($result);
                            }
                        }
                    }
                    Since my output was in -outfmt 6 from blastn, I changed the format to 'blasttable'
                    Also I just wanted the ID of the query for single HSP, so I just modified the stdout and redirect to a file to get my list of IDs.
                    I ran as:

                    Code:
                    parse_blastreport_single_hsp.pl Result.blastn6 > singleHSPs.txt

                    For others, also need to make sure the output has '-max_target_seqs' in blastn is set to more than 1.

                    Many thanks!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Exploring the Dynamics of the Tumor Microenvironment
                      by seqadmin




                      The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                      07-08-2024, 03:19 PM
                    • seqadmin
                      Exploring Human Diversity Through Large-Scale Omics
                      by seqadmin


                      In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                      06-25-2024, 06:43 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 07-19-2024, 07:20 AM
                    0 responses
                    38 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-16-2024, 05:49 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-15-2024, 06:53 AM
                    0 responses
                    61 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-10-2024, 07:30 AM
                    0 responses
                    43 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X