Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Concatenate GFF Files

    Dear everyone,
    I went through the threads but couldn't find anyone trying to do the same thing as I am.

    I am working with around 13k GFF files that need to be concatenated into a single one. Normally, a simple "cat" function would do, but I am trying to actually turn all those files into a single one that will have a new coordinate system.

    For example, if one of the gff files has annotations that range from 0 - 1000kb, the next gff file's table should be appended to that one and it's annotations should begin at 1000kb+.

    I've been looking everywhere for a way to do this and have had no luck.

    If anyone has any suggestions I'd greatly appreciate it.

    Thanks a bunch!

  • #2
    Just a Suggestion...
    You could use cat function and then do sorting [sort function] on the coordinates to put them in order.

    Thanks
    --

    Comment


    • #3
      Thanks for the quick reply muthu.

      I guess that's not a bad idea but it won't work for me. I think I wasn't being very clear.

      I have a different GFF file for each scaffold that I'm working with. What I am trying to do is put all the scaffolds together into a giant one and preserve the coordinate scheme. The problem is that each GFF file has its own coordinates starting at 0 and ending at some number. I need to make it so that I can merge all the GFF files and make a single file with continuous coordinates.
      Last edited by mgaldos; 05-28-2013, 03:31 PM.

      Comment


      • #4
        If I understand it right, what you want to get to....
        Input:
        File 1: 0 - 1000
        File 2: 0 - 1000

        Output:
        File 0 -2000

        For this you must all the end coordinate of file1 to all of File 2 coordinates [I'm just think aloud].

        If you could do head -n 5 of both files and tail -n 5 of both files paste the output here, then it would be easier..

        Thanks
        --
        Muthu

        Comment


        • #5
          That's exactly what I'm trying to do, except that its for 13,000 files. I'll send you the top and bottom from the first two files tomorrow since I don't have them with me right now.

          Thanks a bunch!
          Last edited by mgaldos; 05-28-2013, 03:43 PM. Reason: Forgot to change something as I typed

          Comment


          • #6
            Sure, once you have the head and tail of 3 Files... We could figure out some code that could concatenate all your 13000 files into one.

            Thanks
            --
            Muthu

            Comment


            • #7
              unfortunately these gene annotation formats are not strictly sorted by position so checking the end of the file for the offset value for the next file may not be reliable. additionally it may be a hassle to sort the files by position to find that value because then you'll have to resort them back by feature.

              i think a brute force attack may be appropriate. for example: parse the first file and find the maximum position value in the 5th column (feature end coordinate) while at the same time printing it's content out to the new concatenated file. increment that maximum position and then parse the second file translating it's coordinates by that offset while simultaneously tracking the maximum position from its translated coordinates to use for the next file.

              it's quite possible this will work (or at least it's a good start). You want to pass all of the GTF file names to it at once so the useage string I included at the top is appropriate. if your GTF files are scattered around in folders you could replace 'ls *.gtf' with 'find . -name "*.gtf"' run from the most parent of the folders containing them all. hope it works!

              Code:
              #!/usr/bin/perl
              #
              # concatenates  and translates GFF/GTF and sends output to stdout
              # as it goes
              #
              # WARNING: UNTESTED
              #
              # Useage: ls *.gtf | xargs ./this-script.pl > concatenated.gtf
              #
              
              use strict;
              
              my $offset = 0;
              my $max_pos = 0;
              my @arl;
              my $fname;
              
              #
              # get first offset
              #
              
              $fname = shift @ARGV;
              open FIN, '<', $fname or die($!);
              while(<FIN>) {
              	# print out
              	print STDOUT $_;
              	
              	# process offset
              	chomp;
              	@arl = split(/\t/);
              	if($arl[4] > $max_pos) {
              		$max_pos = $5;
              	}
              }
              
              close FIN;
              
              # shift offset forward a base
              $offset = $max_pos+1;
              
              while(scalar @ARGV) {
              
              	$fname = shift @ARGV;
              	$max_pos = 0;
              	open FIN, '<', $fname or die($!);
              	
              	while(<FIN>) {
              		chomp;
              		@arl = split(/\t/);
              		
              		# translate this line's coordinates
              		$arl[3] += $offset;
              		$arl[4] += $offset;
              		
              		# update max position from translated file
              		if($arl[4] > $max_pos) {
              			$max_pos = $arl[4];
              		}
              		
              		# print translated line out
              		print STDOUT join("\t", @arl) . "\n";
              		
              	}
              	
              	close FIN;
              	
              	# update offset for next file
              	$offset = $max_pos + 1;
              
              }
              Last edited by sdriscoll; 05-28-2013, 05:24 PM. Reason: forgot something
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment


              • #8
                perl fu bwahaha

                assuming all your files are in the contig directory, how about this perl one-liner:

                Code:
                perl -lape '$F[4]+=$o; $F[3]+=$o; $_=join("\t",@F); $m=($m,$F[4])[$m < $F[4]]; $o=$m if eof;'  contig/*.gff > contigs.gff
                approach is to add an offset, $o, to 4th and 5th column, resetting the offset to the max offsetted value seen in column 5, $m, at each file boundry.

                Comment


                • #9
                  ah, very nice! now there's a readable and non-readable option.
                  /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                  Salk Institute for Biological Studies, La Jolla, CA, USA */

                  Comment


                  • #10
                    Wow guys, this is awesome. I'll try out the suggestions now and let you know how it all worked.

                    Thanks a lot!

                    Comment


                    • #11
                      Alright, I tried both suggestions but both gave me the same error:

                      -bash: /usr/bin/perl: Argument list too long

                      I think that perl just refuses to work with 13000 files at a time. Is there anyway to bypass this?

                      Comment


                      • #12
                        Hi,
                        If the issue is only on the # of files, you could try combining batches of 100 files - which will leave you with 130 combined files (13000/100) ---> then you could combine these 130 files to 1 file.
                        Last edited by muthu545; 05-29-2013, 09:03 AM.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin




                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                          04-22-2024, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Today, 11:49 AM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 08:47 AM
                        0 responses
                        16 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        61 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        60 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X