Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • script to calculate percentage identity for BLAT psl

    Hi,
    I am trying to calculate percentage identity for command line BALT output and I am posting the script below. But it is not working right now and the % is always 100%.
    Please let me know the problem. or any other way to caluculate %. This is being followed from BLAT FAQ documentation at http://genome.ucsc.edu/FAQ/FAQblat#blat4

    Code:
    while(<>) { 
            chomp $_; 
            my @v = split(/\t/,$_); 
            get_pid($_); 
    
    } 
    sub get_pid { 
             my @line = @_; 
             my $pid = (100.0 - (&pslCalcMilliBad(@line) * 0.1)); 
             print "The percentage: $pid\n"; 
             #return $pid; 
    } 
    
    sub pslCalcMilliBad { 
             my @cols = @_; 
    
             # sizeNul depens of dna/Prot 
             my $sizeMul = 1; 
             #if ($option{p}) { 
             #        $sizeMul = 3; 
             #} else { 
             #        $sizeMul = 1; 
             #} 
    
             # cols[0]  matches 
             # cols[1]  misMatches 
             # cols[2]  repMaches 
             # cols[4]  qNumInsert 
             # cols[6]  tNumInsert 
             # cols[11] qStart 
             # cols[12] qEnd 
             # cols[15] tStart 
             # cols[16] tEnd 
    
             my $qAliSize = $sizeMul * ($cols[12] - $cols[11]); 
             my $tAliSize = $cols[16] - $cols[15]; 
    
             # I want the minimum of qAliSize and tAliSize 
             my $aliSize; 
             $qAliSize < $tAliSize ? $aliSize = $qAliSize : $aliSize =   
    $tAliSize; 
    
             # return 0 is AliSize == 0 
             return 0 if ($aliSize <= 0); 
    
             # size diff 
             my $sizeDiff = $qAliSize - $tAliSize; 
             if ($sizeDiff < 0) { 
              $sizeDiff = 0; 
                     #if ($option{m}) { 
                             #$sizeDiff = 0; 
                     #} else { 
                             #$sizeDiff = -($sizeDiff); 
                     #} 
             } 
    
             # insert Factor 
             my $insertFactor = $cols[4]; 
             $insertFactor += $cols[6] unless ($option{m}); 
             my $milliBad = (1000 * ($cols[1]*$sizeMul + $insertFactor +   
    &round(3*log( 1 + $sizeDiff)))) / ($sizeMul * ($cols[0] + $cols[2] 
    + $cols[1])); 
                    return $milliBad; 
    
    } 
    
    sub round { 
             my $number = shift; 
             return int($number + .5); 
    }
    The changes I have donw ith respect to the FAQ code are

    Code:
    1
     my $sizeMul = 1; and commented: 
    #if ($option{p}) { 
             #        $sizeMul = 3; 
             #} else { 
             #        $sizeMul = 1; 
             #} 
    2
    if ($sizeDiff < 0) { 
              $sizeDiff = 0; 
                     #if ($option{m}) { 
                             #$sizeDiff = 0; 
                     #} else { 
                             #$sizeDiff = -($sizeDiff); 
                     #} 
             }
    It would be great if there are any other way of calculating % identity. Thanks in advance.

  • #2
    Any other way to calculate? Thanks.

    Comment


    • #3
      Why did you add the "&" in the function call?

      &pslCalcMilliBad(@line)

      Comment


      • #4
        can you try -out=blast9 in blat command-line and use the %identity column to parse? Makes it much simpler that way..
        --
        bioinfosm

        Comment


        • #5
          Originally posted by Chipper View Post
          Why did you add the "&" in the function call?

          &pslCalcMilliBad(@line)
          Why not? It is perfectly legitimate PERL code and, for some people, putting the ampersand makes it obvious that a function is being called.

          Now as far as why the original poster's code is not working. Hum, it is hard to debug without at least some of the data to the original code. Why don't you head or tail the first 5 lines of your input file (the one you get from BLAT.) That way we can make sure that this is simply not a case of GIGO.

          Comment


          • #6
            Yes, Thanks for the info. I finally output the result as blast9 in which the third column is %identity. But then when I compare the results from psl format and blast9 format, the number of output vary. Why is it so? I think in the blast9 format, only the output format changes.

            And hence I wanted to calculate the percentage identity for psl output which is as below:

            Code:
            match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q               Q       Q       Q       T               T       T       T       block   blockSizes qStarts   tStarts
                    match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
            ---------------------------------------------------------------------------------------------------------------------------------------------------------------
            94      0       0       0       0       0       0       0       -       query1      180     0       94      chr18   90736837        37867797    37867891        1       94,     86,     37867797,
            85      0       0       0       1       1       0       0       -       query1      180     94      180     chr18   90736837        37867961    37868046        2       24,61,  0,25,   37867961,37867985,
            82      2       0       0       2       2       0       0       -       query1      180     94      180     chr18   90736837        37877482    37877566        3       24,43,17,       0,25,69,        37877482,37877506,37877549,
            61      0       0       0       1       1       0       0       -       query2      100     38      100     chr18   90736837        37868010    37868071        2       36,25,  0,37,   37868010,37868046,
            60      0       0       0       2       2       0       0       -       query2      100     38      100     chr18   90736837        37877531    37877591        3       18,17,25,       0,19,37,        37877531,37877549,37877566,
            Thanks.

            Comment


            • #7
              OK. I believe the problem is in the line that says:

              get_pid($_);

              Change this to:

              get_pid(@v);

              And the line will be passed properly to the other routines. When I do this I get, for the first three data lines:

              The percentage: 100
              The percentage: 96.4705882352941
              The percentage: 91.6666666666667

              Don't know if the numbers above are correct but at least they are not all 100.

              BTW: Writing your perl code with 'use warnings;' and 'use strict;' will tell you if undefined variables are being used. Which is what was happening in this case.

              Comment


              • #8
                Please confirm when you have the results if blast9 and psi actually give you different results from blat!

                I would never expect that; we can check with Jim Kent if thats the case.

                Thanks

                Originally posted by seq_GA View Post
                Yes, Thanks for the info. I finally output the result as blast9 in which the third column is %identity. But then when I compare the results from psl format and blast9 format, the number of output vary. Why is it so? I think in the blast9 format, only the output format changes.

                And hence I wanted to calculate the percentage identity for psl output which is as below:

                Code:
                match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q               Q       Q       Q       T               T       T       T       block   blockSizes qStarts   tStarts
                        match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
                ---------------------------------------------------------------------------------------------------------------------------------------------------------------
                94      0       0       0       0       0       0       0       -       query1      180     0       94      chr18   90736837        37867797    37867891        1       94,     86,     37867797,
                85      0       0       0       1       1       0       0       -       query1      180     94      180     chr18   90736837        37867961    37868046        2       24,61,  0,25,   37867961,37867985,
                82      2       0       0       2       2       0       0       -       query1      180     94      180     chr18   90736837        37877482    37877566        3       24,43,17,       0,25,69,        37877482,37877506,37877549,
                61      0       0       0       1       1       0       0       -       query2      100     38      100     chr18   90736837        37868010    37868071        2       36,25,  0,37,   37868010,37868046,
                60      0       0       0       2       2       0       0       -       query2      100     38      100     chr18   90736837        37877531    37877591        3       18,17,25,       0,19,37,        37877531,37877549,37877566,
                Thanks.
                --
                bioinfosm

                Comment


                • #9
                  Hi Westerman,
                  Thanks for identifying and now I am able to calculate the pecentage idetity.

                  Hi bioinfosm,

                  I won't say the results are different but the number of output lines are different. Sometimes for one query output in psl format has 2 corresponding blast9 outputs.

                  Yes the number of output lines (wc -l) for psl and blast9 output varies. I actually extracted only the results excluding the commented lines from blast9 output. In the blast9 output I get more results. I am not able to figure out the reason though I believe they output each block which overlap in psl output into separate lines or results. I will check that part and get back to you. When I look at the first line of psl output, it gices as
                  "psLayout Version3". Please let me know if you need furrther details and sorry if I have created any confusion.
                  Regards
                  Last edited by seq_GA; 03-19-2009, 06:59 AM.

                  Comment


                  • #10
                    Hi seq_GA,
                    Thanks for sharing your experience. Did your final code work? Could you paste your code here? This will certainly be of great help for newbies.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-25-2024, 11:49 AM
                    0 responses
                    19 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-24-2024, 08:47 AM
                    0 responses
                    17 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    62 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X