Unconfigured Ad

**muthu545** · 05-28-2013, 03:18 PM

Just a Suggestion...
You could use cat function and then do sorting [sort function] on the coordinates to put them in order.

Thanks
--

**mgaldos** · 05-28-2013, 03:23 PM

Thanks for the quick reply muthu.

I guess that's not a bad idea but it won't work for me. I think I wasn't being very clear.

I have a different GFF file for each scaffold that I'm working with. What I am trying to do is put all the scaffolds together into a giant one and preserve the coordinate scheme. The problem is that each GFF file has its own coordinates starting at 0 and ending at some number. I need to make it so that I can merge all the GFF files and make a single file with continuous coordinates.

**muthu545** · 05-28-2013, 03:34 PM

If I understand it right, what you want to get to....
Input:
File 1: 0 - 1000
File 2: 0 - 1000

Output:
File 0 -2000

For this you must all the end coordinate of file1 to all of File 2 coordinates [I'm just think aloud].

If you could do head -n 5 of both files and tail -n 5 of both files paste the output here, then it would be easier..

Thanks
--
Muthu

**mgaldos** · 05-28-2013, 03:42 PM

That's exactly what I'm trying to do, except that its for 13,000 files. I'll send you the top and bottom from the first two files tomorrow since I don't have them with me right now.

Thanks a bunch!

**muthu545** · 05-28-2013, 03:46 PM

Sure, once you have the head and tail of 3 Files... We could figure out some code that could concatenate all your 13000 files into one.

Thanks
--
Muthu

**sdriscoll** · 05-28-2013, 05:16 PM

unfortunately these gene annotation formats are not strictly sorted by position so checking the end of the file for the offset value for the next file may not be reliable. additionally it may be a hassle to sort the files by position to find that value because then you'll have to resort them back by feature.

i think a brute force attack may be appropriate. for example: parse the first file and find the maximum position value in the 5th column (feature end coordinate) while at the same time printing it's content out to the new concatenated file. increment that maximum position and then parse the second file translating it's coordinates by that offset while simultaneously tracking the maximum position from its translated coordinates to use for the next file.

it's quite possible this will work (or at least it's a good start). You want to pass all of the GTF file names to it at once so the useage string I included at the top is appropriate. if your GTF files are scattered around in folders you could replace 'ls *.gtf' with 'find . -name "*.gtf"' run from the most parent of the folders containing them all. hope it works!

Code:

#!/usr/bin/perl
#
# concatenates  and translates GFF/GTF and sends output to stdout
# as it goes
#
# WARNING: UNTESTED
#
# Useage: ls *.gtf | xargs ./this-script.pl > concatenated.gtf
#

use strict;

my $offset = 0;
my $max_pos = 0;
my @arl;
my $fname;

#
# get first offset
#

$fname = shift @ARGV;
open FIN, '<', $fname or die($!);
while(<FIN>) {
	# print out
	print STDOUT $_;
	
	# process offset
	chomp;
	@arl = split(/\t/);
	if($arl[4] > $max_pos) {
		$max_pos = $5;
	}
}

close FIN;

# shift offset forward a base
$offset = $max_pos+1;

while(scalar @ARGV) {

	$fname = shift @ARGV;
	$max_pos = 0;
	open FIN, '<', $fname or die($!);
	
	while(<FIN>) {
		chomp;
		@arl = split(/\t/);
		
		# translate this line's coordinates
		$arl[3] += $offset;
		$arl[4] += $offset;
		
		# update max position from translated file
		if($arl[4] > $max_pos) {
			$max_pos = $arl[4];
		}
		
		# print translated line out
		print STDOUT join("\t", @arl) . "\n";
		
	}
	
	close FIN;
	
	# update offset for next file
	$offset = $max_pos + 1;

}

**malcook** · 05-28-2013, 10:05 PM

perl fu bwahaha

assuming all your files are in the contig directory, how about this perl one-liner:

Code:

perl -lape '$F[4]+=$o; $F[3]+=$o; $_=join("\t",@F); $m=($m,$F[4])[$m < $F[4]]; $o=$m if eof;'  contig/*.gff > contigs.gff

approach is to add an offset, $o, to 4th and 5th column, resetting the offset to the max offsetted value seen in column 5, $m, at each file boundry.

**sdriscoll** · 05-28-2013, 10:12 PM

ah, very nice! now there's a readable and non-readable option.

**mgaldos** · 05-29-2013, 08:13 AM

Wow guys, this is awesome. I'll try out the suggestions now and let you know how it all worked.

Thanks a lot!

**mgaldos** · 05-29-2013, 08:28 AM

Alright, I tried both suggestions but both gave me the same error:

-bash: /usr/bin/perl: Argument list too long

I think that perl just refuses to work with 13000 files at a time. Is there anyway to bypass this?

**muthu545** · 05-29-2013, 08:55 AM

Hi,
If the issue is only on the # of files, you could try combining batches of 100 files - which will leave you with 130 combined files (13000/100) ---> then you could combine these 130 files to 1 file.

Topics	Statistics	Last Post
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 24 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 33 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM

Unconfigured Ad

Concatenate GFF Files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News