Seqanswers Leaderboard Ad

**rhinoceros** · 10-05-2013, 04:56 PM

How about

Code:

awk '{FS="_"; if($6>=1) print}' file

**AdrianP** · 10-05-2013, 05:00 PM

Originally posted by rhinoceros View Post

How about

Code:

awk '{FS="_"; if($6>=1) print}' file

You sir, are awesome.

**AdrianP** · 10-07-2013, 07:32 AM

How would I use a similar script to sort the contigs by lengths, and tell me what the top 10 or 20 lengths are?

**winsettz** · 10-07-2013, 07:46 AM

Could try adding to the above:

Code:

sort -r -n file | head

In short:

Code:

sort -r -n file

sorts the contents of file, -r to use reverse order (highest to lowest), and -n uses numeric sort. Otherwise you get 1, 10, 100, 2, 20, 200, 3, 30, 300.

head just displays the top 10 of the file, or stdout (lest it dump a pile of numbers into stdout). To adjust the number of lines displayed, use -n. For example,

Code:

head -n 20

displays the first 20 lines of a file.

**AdrianP** · 10-07-2013, 07:55 AM

Sorry but the above shows contig names, how will sort know what to sort by? Since it's given the full length of the contig?

**winsettz** · 10-07-2013, 08:04 AM

Oh, whoops.

Contents of file I wish to process

Code:

NODE_2361_length_509_cov_1.43018_ID_745236
NODE_2361_length_509_cov_5.43018_ID_745236
NODE_2361_length_509_cov_3.43018_ID_745236
NODE_2361_length_509_cov_2.43018_ID_745236
NODE_2361_length_509_cov_21.43018_ID_745236

The one-liner

Code:

sed 's/_/ /g' filename | awk '{FS=" "; print $6}' | sort -r -n | head -n 10

sed will replace _ with space, then spit out the output to stdout. | to pipe to awk.

awk will use spaces to separate columns, print $6 instructs awk to display everything from column six (which is the numbers after cov, but before ID).
You can then pipe that again to sort, with head to display just what you want to display.

The output

Code:

Obviously not the only solution, and there are probably better ones.

--------------
Edit: If you're dealing with fasta you will have to separate the headers from the sequences. Standard format is >NODE_...; which can be captured as

Code:

grep "^>NODE" contigs.fasta > contignames

**rhinoceros** · 10-07-2013, 08:26 AM

I would just pipe the awk output to:

Code:

sort -r -g -k 4 -t _

When you have an idea what tool might work, but don't know exactly how, just google "man sort" or whatever the name of the tool happens to be. Also, with sort, it's sometimes better to use -g (general numerical value) instead of -n. For example, -n (string numerical value) doesn't work when you have exponents (1e-10 would be smaller than 2e-100 so e.g. sorting blast output by e-value would fail). Actually, I don't know if there's any reason why -g shouldn't be used always by default..

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 28 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 161 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Help with script, sort by coverage

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News