Seqanswers Leaderboard Ad

**nickloman** · 03-19-2012, 11:24 AM

I haven't tried it out but 'seqretsplit' from the EMBOSS package might do what you want. Otherwise it's a quick script in Bioperl or Biopython, e.g. in BioPython (untested)

Run like python splitgbk.py < input.gbk

Will create a file for each entry in the current directory.

-- splitgbk.py

Code:

from Bio import SeqIO
import sys

for rec in SeqIO.parse(sys.stdin, "genbank"):
   SeqIO.write([rec], open(rec.id + ".gbk", "w"), "genbank")

**maubp** · 03-19-2012, 11:27 AM

If you want one file per record, try EMBOSS seqret and the -ossingle_outseq option.

404 Not Found

http://emboss.open-bio.org/wiki/Appdoc:Seqret

EDIT: That probably does the same as EMBOSS seqretsplit suggested by Nick while I was writing this.

404 Not Found

http://emboss.open-bio.org/wiki/Appdoc:Seqretsplit

Do you just want to break it up into batches, say 10 records in each file? Or, do you have a particular order in mind (which could involve either sorting or random access).

**nickloman** · 03-19-2012, 11:31 AM

Come on Peter, I caught you napping again

**Richard Finney** · 03-19-2012, 12:02 PM

split genbank files using awk

awk -v n=1 '/^\/\//{close("out"n);n++;next} {print > "out"n}' yourfilename.gbk

Split yourfilename.gbk into multiple files by splitting at "//" (end of record) line.

**thmourikis** · 05-17-2013, 06:04 AM

Hi all,

I have the same problem but I want to split the file every 1000 entries. My file has 500,000 records and I want 500 files of 1000 records each. Any suggestions?

Thanks in advance.
Thanos

**maubp** · 05-17-2013, 06:19 AM

Thanos - which scripting languages do you know? GenBank records end with a // line (which is what Richard's awk command exploits) so it is very simple to split up a file into sub-files named however you like using Perl, Python or Ruby.

**thmourikis** · 05-17-2013, 06:29 AM

Hi Peter and thank you for your immediate reply.

I currently use Perl (not very experienced though). I guess I can try to alter Richard's awk command and implement it in a Perl script for renaming etc.

Thank you once again.

**Richard Finney** · 05-20-2013, 06:36 AM

awk -v n=1 -v p=0 '/^\/\//{p++;if(((p%1000)==0)&&(p!=0)){close("out"n);n++;next}} {print > "out"n}' yourfilename.gbk

splits at 1000 records.

**thmourikis** · 05-20-2013, 06:39 AM

Thanks a lot Richard! I really appreciate that!

Best,
Thanos

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

splitting big genbank file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News