Is there a good one-liner for calculating the N50 size of a list of contig sizes? Average size is easy with awk but I didn't manage to find an easy way for the N50 size so far. Thanks!
Unconfigured Ad
Collapse
X
-
If you have the mean, you can clearly grep/awk your relevant file for the correct data to compute N50. Why not just write a little generic statistics program that computes descriptive stats for a file or for data passed via stdin? Here's a Python function for median (or you could use numpy) that you could dump into a little script. This would compute the N50...
The upside of writing such a program is that, if written correctly, it can be re-used for asking basic questions of your data over and over again.PHP Code:def median (numlist):
"""
Abstract: Returns the median value of the passed list of numbers.
Usage: median(numlist)
"""
numlist.sort()
# take the mean of the two middle elements if there are an even number
# of elements. otherwise, take the middle element
if len(numlist) % 2 == 0:
medianpos = len(numlist)/2
median = float(numlist[medianpos] + numlist[medianpos-1]) /2
else:
medianpos = len(numlist)/2
median = numlist[medianpos]
return median
Good luck,
Aaron
-
-
You can call Python scripts from the command line, and pipe stdin and stout too - great for use at the Unix command line.
Are you planning to supply the list of lengths as space separated integers? e.g. "120 2500 745 63 ..."? That would be easy to parse in Python from a string into a list of integers.
Comment
-
-
It comes up in several occasions, e.g. a file containing
contig1 length=1000 reads=582
contig2 length=1500 reads=900
contig3 length=1400 reads=766
...
I want to do something like:
$ cut -f2 file | cut -d"=" -f2 | sort -nr | awk-or-whatever-for-the-N50-size
The files can have different formats, I just have a sorted list of contig sizes at some point and it would be useful to have a command to get the N50 size.
Comment
-
-
OK - so the stdin is one integer per line. How about a python script like this,
see also http://seqanswers.com/forums/showthread.php?t=2332
Then at the Unix command line, you could use it like this:Code:#!/usr/bin/python import sys def N50(numlist): """ Abstract: Returns the N50 value of the passed list of numbers. Usage: N50(numlist) Based on the Broad Institute definition: https://www.broad.harvard.edu/crd/wiki/index.php/N50 """ numlist.sort() newlist = [] for x in numlist : newlist += [x]*x # take the mean of the two middle elements if there are an even number # of elements. otherwise, take the middle element if len(newlist) % 2 == 0: medianpos = len(newlist)/2 return float(newlist[medianpos] + newlist[medianpos-1]) /2 else: medianpos = len(newlist)/2 return newlist[medianpos] assert N50([2, 2, 2, 3, 3, 4, 8, 8]) == 6 lengths = [] for line in sys.stdin : lengths.append(int(line)) print N50(lengths)
Code:$ grep "^>" 454AllContigs.fna | cut -d"=" -f2 | cut -d" " -f1 | ./stdin_N50.py 386
Comment
-
-
What about the awk experts out there? Any idea for the direct command-line version without script? (you have to store the script somewhere, and if you are using another computer you don't have it at hand... I actually really prefer code in the command line instead of applying a script - if possible..)
Comment
-
-
Originally posted by seqseq View PostIt comes up in several occasions, e.g. a file containing
contig1 length=1000 reads=582
contig2 length=1500 reads=900
contig3 length=1400 reads=766
...
I want to do something like:
$ cut -f2 file | cut -d"=" -f2 | sort -nr | awk-or-whatever-for-the-N50-size
The files can have different formats, I just have a sorted list of contig sizes at some point and it would be useful to have a command to get the N50 size.
This is a quick shot. There might be shorter solution. Test it to see wether I calculate correctly ..Code:perl -lanF'[\s=]' -e 'push(@contigs,$F[2]);$total+=$F[2];END{foreach(sort{$b<=>$a}@contigs){$sum+=$_;$L=$_;if($sum>=$total*0.5){print "TOTAL: $total\nN50 : $L";exit;} ;}}' file
cheers,
Sven
Comment
-
-
Here's a stab at a one-liner :
sort -rn contig_lengths.txt | awk 'BEGIN {sum=0} {sum += $1; print $1, sum}' | tac - | awk 'NR==1 {halftot=$2/2} lastsize>halftot && $2<halftot {print} {lastsize=$2}'
For a file "contig_lengths.txt" with one contig size per line, this prints out the length of the N50 contig, along with the number of bases in contigs this size or greater.
Comment
-
-
Impressive
Assuming no errors, these perl and awk solutions do qualify as "Unix one line" solution, but they are too long to be typed by hand, and cut-and-pasting is asking for errors.
I personally would go a short script (in user's language of choice), probably reading the FASTA file directly.
Comment
-
-
Great! The first part could be shortened into: awk '{sum += $0; print $0, sum}'Originally posted by nhansen View Postsort -rn contig_lengths.txt | awk 'BEGIN {sum=0} {sum += $1; print $1, sum}' | tac - | awk 'NR==1 {halftot=$2/2} lastsize>halftot && $2<halftot {print} {lastsize=$2}'
So in total:
sort -rn contig_lengths.txt | awk '{sum += $0; print $0, sum}' | tac | awk 'NR==1 {halftot=$2/2} lastsize>halftot && $2<halftot {print} {lastsize=$2}'
Comment
-
Latest Articles
Collapse
-
by SEQadmin2
Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.
The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
...-
Channel: Articles
06-02-2026, 10:05 AM -
-
by SEQadmin2
With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.
Introduction
Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...-
Channel: Articles
05-22-2026, 06:42 AM -
ad_right_rmr
Collapse
News
Collapse
| Topics | Statistics | Last Post | ||
|---|---|---|---|---|
|
Started by SEQadmin2, Yesterday, 10:09 AM
|
0 responses
9 views
0 reactions
|
Last Post
by SEQadmin2
Yesterday, 10:09 AM
|
||
|
Started by SEQadmin2, 06-04-2026, 08:59 AM
|
0 responses
17 views
0 reactions
|
Last Post
by SEQadmin2
06-04-2026, 08:59 AM
|
||
|
Started by SEQadmin2, 06-02-2026, 12:03 PM
|
0 responses
26 views
0 reactions
|
Last Post
by SEQadmin2
06-02-2026, 12:03 PM
|
||
|
Started by SEQadmin2, 06-02-2026, 11:40 AM
|
0 responses
21 views
0 reactions
|
Last Post
by SEQadmin2
06-02-2026, 11:40 AM
|
Comment