Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • seqseq
    Junior Member
    • Oct 2009
    • 7

    Unix one-line command for N50 size

    Is there a good one-liner for calculating the N50 size of a list of contig sizes? Average size is easy with awk but I didn't manage to find an easy way for the N50 size so far. Thanks!
  • quinlana
    Senior Member
    • Sep 2008
    • 119

    #2
    If you have the mean, you can clearly grep/awk your relevant file for the correct data to compute N50. Why not just write a little generic statistics program that computes descriptive stats for a file or for data passed via stdin? Here's a Python function for median (or you could use numpy) that you could dump into a little script. This would compute the N50...

    PHP Code:
    def median (numlist):
        
    """
        Abstract: Returns the median value of the passed list of numbers. 
        Usage:    median(numlist)
        """

        
    numlist.sort()
        
    # take the mean of the two middle elements if there are an even number
        # of elements.  otherwise, take the middle element
        
    if len(numlist) % == 0:
            
    medianpos len(numlist)/2  
            median 
    float(numlist[medianpos] + numlist[medianpos-1]) /2
        
    else:
            
    medianpos len(numlist)/2
            median 
    numlist[medianpos]
        return 
    median 
    The upside of writing such a program is that, if written correctly, it can be re-used for asking basic questions of your data over and over again.

    Good luck,
    Aaron

    Comment

    • seqseq
      Junior Member
      • Oct 2009
      • 7

      #3
      Well, I am looking for a Unix command-line solution.. (btw if I understand your code correctly it is calculating the median, but median is not the same as N50 size)

      Comment

      • quinlana
        Senior Member
        • Sep 2008
        • 119

        #4
        My apologies, my recollection was that N50 was merely the median size. By actually looking up the definition, I see it's not.

        Comment

        • maubp
          Peter (Biopython etc)
          • Jul 2009
          • 1544

          #5
          You can call Python scripts from the command line, and pipe stdin and stout too - great for use at the Unix command line.

          Are you planning to supply the list of lengths as space separated integers? e.g. "120 2500 745 63 ..."? That would be easy to parse in Python from a string into a list of integers.

          Comment

          • seqseq
            Junior Member
            • Oct 2009
            • 7

            #6
            It comes up in several occasions, e.g. a file containing

            contig1 length=1000 reads=582
            contig2 length=1500 reads=900
            contig3 length=1400 reads=766
            ...

            I want to do something like:
            $ cut -f2 file | cut -d"=" -f2 | sort -nr | awk-or-whatever-for-the-N50-size

            The files can have different formats, I just have a sorted list of contig sizes at some point and it would be useful to have a command to get the N50 size.

            Comment

            • maubp
              Peter (Biopython etc)
              • Jul 2009
              • 1544

              #7
              OK - so the stdin is one integer per line. How about a python script like this,
              see also http://seqanswers.com/forums/showthread.php?t=2332

              Code:
              #!/usr/bin/python
              import sys
              
              def N50(numlist):
                  """
                  Abstract: Returns the N50 value of the passed list of numbers. 
                  Usage:    N50(numlist)
              
                  Based on the Broad Institute definition:
                  https://www.broad.harvard.edu/crd/wiki/index.php/N50
                  """
                  numlist.sort()
                  newlist = []
                  for x in numlist :
                      newlist += [x]*x
                  # take the mean of the two middle elements if there are an even number
                  # of elements.  otherwise, take the middle element
                  if len(newlist) % 2 == 0:
                      medianpos = len(newlist)/2  
                      return float(newlist[medianpos] + newlist[medianpos-1]) /2
                  else:
                      medianpos = len(newlist)/2
                      return newlist[medianpos]
              
              assert N50([2, 2, 2, 3, 3, 4, 8, 8]) == 6
              
              lengths = []
              for line in sys.stdin :
                  lengths.append(int(line))
              print N50(lengths)
              Then at the Unix command line, you could use it like this:
              Code:
              $ grep "^>" 454AllContigs.fna | cut -d"=" -f2 | cut -d" " -f1 | ./stdin_N50.py 
              386
              Last edited by maubp; 10-16-2009, 06:23 AM. Reason: Switched from median to N50

              Comment

              • seqseq
                Junior Member
                • Oct 2009
                • 7

                #8
                What about the awk experts out there? Any idea for the direct command-line version without script? (you have to store the script somewhere, and if you are using another computer you don't have it at hand... I actually really prefer code in the command line instead of applying a script - if possible..)

                Comment

                • maubp
                  Peter (Biopython etc)
                  • Jul 2009
                  • 1544

                  #9
                  I rather suspect that an awk solution would be so long that it too would need to be stored in a script. A one liner would be impressive.

                  Comment

                  • sklages
                    Senior Member
                    • May 2008
                    • 628

                    #10
                    Originally posted by seqseq View Post
                    It comes up in several occasions, e.g. a file containing

                    contig1 length=1000 reads=582
                    contig2 length=1500 reads=900
                    contig3 length=1400 reads=766
                    ...

                    I want to do something like:
                    $ cut -f2 file | cut -d"=" -f2 | sort -nr | awk-or-whatever-for-the-N50-size

                    The files can have different formats, I just have a sorted list of contig sizes at some point and it would be useful to have a command to get the N50 size.

                    Code:
                    perl -lanF'[\s=]' -e 'push(@contigs,$F[2]);$total+=$F[2];END{foreach(sort{$b<=>$a}@contigs){$sum+=$_;$L=$_;if($sum>=$total*0.5){print "TOTAL: $total\nN50  : $L";exit;}  ;}}' file
                    This is a quick shot. There might be shorter solution. Test it to see wether I calculate correctly ..

                    cheers,
                    Sven

                    Comment

                    • nhansen
                      Junior Member
                      • Sep 2009
                      • 6

                      #11
                      Here's a stab at a one-liner :

                      sort -rn contig_lengths.txt | awk 'BEGIN {sum=0} {sum += $1; print $1, sum}' | tac - | awk 'NR==1 {halftot=$2/2} lastsize>halftot && $2<halftot {print} {lastsize=$2}'

                      For a file "contig_lengths.txt" with one contig size per line, this prints out the length of the N50 contig, along with the number of bases in contigs this size or greater.

                      Comment

                      • maubp
                        Peter (Biopython etc)
                        • Jul 2009
                        • 1544

                        #12
                        Impressive

                        Assuming no errors, these perl and awk solutions do qualify as "Unix one line" solution, but they are too long to be typed by hand, and cut-and-pasting is asking for errors.

                        I personally would go a short script (in user's language of choice), probably reading the FASTA file directly.

                        Comment

                        • sklages
                          Senior Member
                          • May 2008
                          • 628

                          #13
                          Originally posted by maubp View Post
                          I personally would go a short script (in user's language of choice), probably reading the FASTA file directly.
                          I do totally agree, that's probably the way we all do it :-)

                          cheers,
                          Sven

                          Comment

                          • nhansen
                            Junior Member
                            • Sep 2009
                            • 6

                            #14
                            I actually prefer a script too, both for readability and accuracy, but sometimes it's fun to exercise your shell skills. :-)

                            Comment

                            • seqseq
                              Junior Member
                              • Oct 2009
                              • 7

                              #15
                              Originally posted by nhansen View Post
                              sort -rn contig_lengths.txt | awk 'BEGIN {sum=0} {sum += $1; print $1, sum}' | tac - | awk 'NR==1 {halftot=$2/2} lastsize>halftot && $2<halftot {print} {lastsize=$2}'
                              Great! The first part could be shortened into: awk '{sum += $0; print $0, sum}'

                              So in total:
                              sort -rn contig_lengths.txt | awk '{sum += $0; print $0, sum}' | tac | awk 'NR==1 {halftot=$2/2} lastsize>halftot && $2<halftot {print} {lastsize=$2}'

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 10:09 AM
                              0 responses
                              9 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              26 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...