Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • nanto
    Member
    • Sep 2012
    • 19

    read quality

    I'm trying to retrieve average quality of each read to make graphs of read length/quality. I don't want to use fastx, fastqc, I want data to make graphs myself, so i can adjust scales. I retrieved sequences length, this was trivial. I got phred qualities in qual file, I have no idea how to make those numbers an average. I tried numpy average, but it constantly wants something different, so until I will give up, I wanted to ask a question here.
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    Just read in the qual file, parse each line (or whatever) into an array, and run the numpy average on that.

    Comment

    • nanto
      Member
      • Sep 2012
      • 19

      #3
      Hmm i must have forgotten to add part: please avoid posts: do it yourself.

      Comment

      • simonandrews
        Simon Andrews
        • May 2009
        • 870

        #4
        Originally posted by nanto View Post
        I'm trying to retrieve average quality of each read to make graphs of read length/quality. I don't want to use fastx, fastqc, I want data to make graphs myself, so i can adjust scales. I retrieved sequences length, this was trivial. I got phred qualities in qual file, I have no idea how to make those numbers an average.
        For a single sequence just divide the sum of the qualities by the sequence length. Mathematically speaking this is a slightly bizarre thing to do since Phred scores are log transformed probabilities, and taking a mean of a log transformed value is somewhat unconventional, but that's what everyone does.

        Comment

        • nanto
          Member
          • Sep 2012
          • 19

          #5
          no it's not one sequence. I received pretty new data set, and sequence quality might be affected by it's length. I need some plots to show it. So I have to calculate average for each sequence with it's corresponding length.

          Comment

          • simonandrews
            Simon Andrews
            • May 2009
            • 870

            #6
            Originally posted by nanto View Post
            no it's not one sequence. I received pretty new data set, and sequence quality might be affected by it's length. I need some plots to show it. So I have to calculate average for each sequence with it's corresponding length.
            Since quality generally diminishes with length anyway it shouldn't be a surprise to see that the average quality of longer sequences is lower than that of shorter sequences. If you think there is a global effect then you might want to compare equivalent positions in sequences of different length to see if there's a difference (eg is the average quality of position 5 in a 50bp sequence different from the average quality of position 5 in a 100bp sequence).

            Coming back to your original question, to keep track of the qualities for different lengths you'd just need to make a 2D dataset where you had something like a hash of arrays, where the hash key was the length and the array held the set of average quality values for sequences with that length. Depending on how wide your range of lengths was you might want to bin them rather than tracking every length separately.

            Comment

            • nanto
              Member
              • Sep 2012
              • 19

              #7
              funny thing is I know how it will look like, i just have to do it to visualize data.
              Idea of length filtering and making per base qualities is good, i think i will use boxplots for each subset.
              And with hashing array, I don't think that's necessary. My file is sorted, so every result is in certain position. When loading data do produce graph I will have corresponding length in position to corresponding sequence quality.
              So what I need is a simple part of a script in python, perl whatever which will read my qual file and write to new file only average values for each record which will be separated by \n

              But it's good to find some new ideas what I can get from this data and how to show it so it will look at least interesting

              Comment

              Latest Articles

              Collapse

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              24 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              29 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              39 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-02-2026, 12:03 PM
              0 responses
              61 views
              0 reactions
              Last Post SEQadmin2  
              Working...