Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • apfejes
    Senior Member
    • Feb 2008
    • 236

    #16
    Thanks ECO.. it was original. (=

    In any case, I'll have a look at this on monday. I don't have time to look at it until I'm back in Canada. If you haven't heard from me by then, please send me a reminder.
    The more you know, the more you know you don't know. —Aristotle

    Comment

    • rdeborja
      Member
      • Aug 2008
      • 42

      #17
      Talk about coincidence, I was working on a piece of code that would take Illumina TruSeq export files and merge them into a single bed file just this morning. It outputs only those reads that align uniquely and pass filtering and collapses potential PCR biases (via start point).

      Here's some sample usage:

      rdeborja@hn1:~/tmp$ export2bed.pl --bed s_8_1_export.bed s_8_1_export.txt
      Processing file s_8_1_export.txt...
      Total collapsed unique aligned reads 656
      Completed processing file s_8_1_export.txt


      Hope it helps!

      Code:
      #!/usr/bin/perl
      
      #
       # Created by Richard de Borja on 2011-06-10.
       # Copyright (C) 2011 The Ontario Institute for Cancer Research
       #
       # This program is free software; you can redistribute it and/or modify
       # it under the terms of the GNU General Public License as published by
       # the Free Software Foundation; either version 3 of the License, or (at
       # your option) any later version.
       #
       # This program is distributed in the hope that it will be useful, but
       # WITHOUT ANY WARRANTY; without even the implied warranty of
       # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
       # General Public License for more details.
       #
       # You should have received a copy of the GNU General Public License
       # along with this program; if not, write to the Free Software
       # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
       # MA 02110-1301, USA.
      #
      # $Id export2bed.pl rdeborja$
      
      use strict;
      use Carp;
      use Getopt::Long;
      use Pod::Usage;
      use Data::Dumper;
      
      # command line default arguments
      our %opts = (
      	# list of arguments and default values
      	"bed"               => undef,
      	"length"            => 101,
      
      );
      
      our %_exportfield = (
          # list of columns for export file
          # See CASAVA 1.7 User Guide page 61
          "machine"           => 0,
          "run_number"        => 1,
          "lane"              => 2,
          "tile"              => 3,
          "x_coord_cluster"   => 4,
          "y_coord_cluster"   => 5,
          "index_seq"         => 6,
          "read_number"       => 7,
          "sequence"          => 8,
          "quality"           => 9,
          "match_chr"         => 10,
          "match_contig"      => 11,
          "match_position"    => 12,
          "match_strand"      => 13,
          "match_desc"        => 14,
          "single_read_score" => 15,
          "paired_read_score" => 16,
          "partner_chr"       => 17,
          "partner_contig"    => 18,
          "partner_offset"    => 19,
          "partner_strand"    => 20,
          "filtering"         => 21,
      );
      
      my $result = main();
      exit $result;
      
      ##############################
      # Main
      ##############################
      sub main {
      	# get command line arguments
      	GetOptions(
      		\%opts,
      		"help|?",
      		"man",
      		"bed=s",
      		"length:i",
      	) or pod2usage(64);		# 64=>EX_USAGE
      
      	pod2usage(1) if $opts{'help'};
      	pod2usage(-exitstatus => 0, -verbose => 2) if $opts{'man'};
      
      	while(my ($arg, $value) = each(%opts)) {
      		if(!defined $value) {
      			print "ERROR: Missing argument $arg\n";
      			pod2usage(128);
      		}
      	}
          
          my $read_length = $opts{'length'};
          my %unique_read_hash;
          my $output_file = $opts{'bed'};
          
          foreach my $file (@ARGV) {
              print "Processing file $file...\n";
              my $processed_export = process_export_file($file, $read_length, \%unique_read_hash, $output_file);
              print "Completed processing file $file\n";
          }
          
          my $create_bed_file = print_bed_file($output_file, \%unique_read_hash);
      
      	return 0;
      }
      
      ##############################
      # Functions
      ##############################
      
      sub process_export_file {
          my $export_file = shift;
          my $read_length = shift;
          my $unique_read_hash = shift;
          my $output_file;
          
          my $output_file = $export_file;
          $output_file =~ s/txt/bed/;
          
          open(my $ifh, '<', $export_file) or croak "Cannot open file $export_file: $!";
          
          while(my $line = <$ifh>) {
              $line =~ s/^\s+//;
              $line =~ s/\s+$//;
              
              my @input_line = split(/\t/, $line);
              
              my $chr = $input_line[$_exportfield{'match_chr'}];
              my $start = $input_line[$_exportfield{'match_position'}];
              my $end = $start + $read_length - 1;
              my $pf = $input_line[$_exportfield{'filtering'}];
              
              # Skip any reads that do not align, do not uniquely align, or pass 
              # the Illumina filtering
              if (($chr =~ /chr/) && ($pf eq "Y")) {
                  $chr =~ s/\.fa//g;
                  my $bed_line = join("_", $chr, $start, $end);
                  $unique_read_hash->{$bed_line} = 1;
              } else {
                  next;
              }
          }
          close($ifh);
          
          #my $create_bed_file = print_bed_file($output_file, \$unique_read_hash);
      
          my $count = scalar(keys(%$unique_read_hash));
          print "Total collapsed unique aligned reads $count\n";
          return 0;
      }
      
      
      sub print_bed_file {
          my $bed_file = shift;
          my $read_hash = shift;
          
          open(my $ofh, '>', $bed_file) or croak "Cannot open file $bed_file";
          foreach my $bed_entry (sort(keys(%$read_hash))) {
              my ($chr, $start, $end) = split("_", $bed_entry);
              print {$ofh} "$chr\t$start\t$end\n";
          }
          close($ofh);
      }
      
      __END__
      
      =head1 NAME
      
      export2bed.pl
      
      =head1 SYNOPSIS
      
      B<export2bed.pl> [options] [file ...]
      
      	Options:
      	--help			brief help message
      	--man			full documentation
      	--bed           name of output BED file
      	--length        read length (optional: default 101bp)
      
      =head1 OPTIONS
      
      =over 8
      
      =item B<--help>
      
      Print a brief help message and exit.
      
      =item B<--man>
      
      Print the manual page.
      
      =item B<--bed>
      
      Name of output BED file
      
      =item B<--length>
      
      Read length (optional, default is 101bp)
      
      =back
      
      =head1 DESCRIPTION
      
      B<export2bed.pl> convert an Illumina export.txt file to a BED file
      
      =head1 EXAMPLES
      
      export2bed.pl --bed out.bed --length 101 s_1_1_export.txt s_1_2_export.txt
      
      =head1 AUTHORS
      
      The Ontario Institute for Cancer Research
      
      =head1 SEE ALSO
      
      =cut
      Originally posted by joseph View Post
      Can anybody share an algorithm that takes eland files and make bed files out of them?
      Thanks
      Joseph

      Comment

      • mgogol
        Senior Member
        • Mar 2008
        • 197

        #18
        Okay, I rewrote it a little. It now takes a fasta index file (produced by samtools faidx genome.fa)

        Of course, now you're getting a script hacked away at by two people.... You could try other solutions suggested above.

        perl export2bed.pl mm9_full.fa.fai s_1_1_export.txt > s_1_1_export.bed

        Code:
        #!/usr/bin/perl
        # Program to convert eland export format to BED format
        # for running MACS
        # Chris Seidel, June 2009
        #
        # Requires tab delim file of chromosome names in the format:
        # chr_name chr_length
        # corrects for alignments that go off the ends of the chrs
        # negative bases are trimmed to 1, 
        # bases > chr_length are set to chr_length
        # (I know the former exist, I don't know if the latter exist)
        # results are not sorted, but can be sorted in linux by:
        # sort -o infile.bed -k1,1 -k2,2n infile.bed
        # (sort in place, first column, then by second column numeric)
        
        use File::Basename;
        
        die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);
        
        # get info on chromosomes
        open(cmap, $ARGV[0]) || die("no chromosome length file!");
        %chrmap = {};
        while($line = <cmap>){
            chomp($line);
            ($chrom, $size) = split(/\t/, $line);
            $chrmap{$chrom} = $chrom;
            $chrsize{$chrom} = $size;
        }
        
        # open input file
        open(fp, $ARGV[1]) || die("can't open eland file");
        
        $lines = 0;
        while(<fp>){
            chop;
            ++$lines;
            @bits = split(/\t/);
            # skip reads that didn't pass filtering
            next if($bits[21] eq "N");
            # get match name
            $seqname = $bits[10];
            # skip No Matches or QC failures
            # next if($seqname =~ /NM|QC/);
            # skip repeat matches
            # next if($seqname =~ /\d+:\d+:\d+/);
            # we're only interested in sequences that match our chrs
        
            $chrom = $bits[10];
            $seqlen = length($bits[8]);
            next unless(exists($chrmap{$chrom}));
            $start = $bits[12];
            $end = $start + $seqlen - 1;
            $strand = $bits[13];
            #print "start:$start end:$end strand:$strand\n";
        
            # parse match descriptor
            $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
            # skip reads beyond a certain threshold
            next if($n > 2);
            $read_code = "U".$n;
        
            # correct for alignments off the chromosome ends
            if( $start <= 0 ){
           print STDERR "start less than or equal to 0:   ", $start, "\n";
           print STDERR join("\t", @bits), "\n";
           $start = 1;
            }
        
            if($end > $chrsize{$seqname}){
           print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
           print STDERR join("\t", @bits), "\n";
           $end = $chrsize{$seqname};
            }
        
            if($strand eq "F"){
           $strand = "+";
           $color = "0,0,255";
            }
            else{
           $strand = "-";
           $color = "255,0,0";
            }
        
            $score = 1;
            $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
            my $chr = $chrmap{$seqname};
            $chr =~ s/\.fa$//g;
            print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
        
            # give some feedback
            print STDERR "$lines processed\n" if(!($lines % 100000));
        }

        Comment

        • diablito
          Junior Member
          • Apr 2012
          • 5

          #19
          Xenopus table

          Originally posted by mgogol View Post
          Okay, I rewrote it a little. It now takes a fasta index file (produced by samtools faidx genome.fa)

          Of course, now you're getting a script hacked away at by two people.... You could try other solutions suggested above.

          perl export2bed.pl mm9_full.fa.fai s_1_1_export.txt > s_1_1_export.bed

          .........

          [/CODE]
          Hello. Thank you for this script. I used to convert export to bed using a simple cut&grep regexp and then adding 50 to coordinates. This, of course, created end of chromosome problem, which was easy to correct in human exports. But now I have a Xenopus export file, and it is a 1000x problem. So, could you, please, specify the format for chromosome name/size file or give an example? I could not figure it out from the description line

          Comment

          • diablito
            Junior Member
            • Apr 2012
            • 5

            #20
            Originally posted by apfejes View Post

            The manuals are a work in progress. If anyone would like to give them a try, I'll update that part of the manual.
            Could you, please, give en example of chromosome file? I need to make one for Xenopus...BLEH(((

            Comment

            • mgogol
              Senior Member
              • Mar 2008
              • 197

              #21
              The .fai file can be generated with samtools faidx on the genome fasta file. Sorry I didn't see this earlier, my thread subscription email was set up wrong.

              Comment

              • diablito
                Junior Member
                • Apr 2012
                • 5

                #22
                Originally posted by mgogol View Post
                The .fai file can be generated with samtools faidx on the genome fasta file. Sorry I didn't see this earlier, my thread subscription email was set up wrong.
                Thank you. I have no Samtools installed, but I found the Samtools:Fastaindex utility at the Genepattern site. So, I am running your script....and getting empty outputs, after quite intensive memory use and normal exit of the script. Any suggestions? Please?

                Comment

                • mgogol
                  Senior Member
                  • Mar 2008
                  • 197

                  #23
                  Hm, I don't know, it's quite likely the eland export format has changed. If you post part of a file somewhere I could try to figure it out.

                  Comment

                  • diablito
                    Junior Member
                    • Apr 2012
                    • 5

                    #24
                    Oh, thanks. On the contrary, it is a pretty old file, circa 2009. I attached the index and a chunk of the export. I'd greatly appreciate, if you can take a look
                    Attached Files

                    Comment

                    • mgogol
                      Senior Member
                      • Mar 2008
                      • 197

                      #25
                      Looks like I'm looking in a different column than your chromosome name seems to be in... Try this:

                      Code:
                      #!/usr/bin/perl
                      # Program to convert eland export format to BED format
                      # for running MACS
                      # Chris Seidel, June 2009
                      #
                      # Requires tab delim file of chromosome names in the format:
                      # chr_name chr_length
                      # corrects for alignments that go off the ends of the chrs
                      # negative bases are trimmed to 1, 
                      # bases > chr_length are set to chr_length
                      # (I know the former exist, I don't know if the latter exist)
                      # results are not sorted, but can be sorted in linux by:
                      # sort -o infile.bed -k1,1 -k2,2n infile.bed
                      # (sort in place, first column, then by second column numeric)
                      
                      use File::Basename;
                      
                      die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);
                      
                      # get info on chromosomes
                      open(cmap, $ARGV[0]) || die("no chromosome length file!");
                      %chrmap = {};
                      while($line = <cmap>){
                          chomp($line);
                          ($chrom, $size) = split(/\t/, $line);
                          $chrmap{$chrom} = $chrom;
                          $chrsize{$chrom} = $size;
                      }
                      
                      # open input file
                      open(fp, $ARGV[1]) || die("can't open eland file");
                      
                      $lines = 0;
                      while(<fp>){
                          chop;
                          ++$lines;
                          @bits = split(/\t/);
                          # skip reads that didn't pass filtering
                          next if($bits[21] eq "N");
                          # get match name
                          $seqname = $bits[11];
                          # skip No Matches or QC failures
                          # next if($seqname =~ /NM|QC/);
                          # skip repeat matches
                          # next if($seqname =~ /\d+:\d+:\d+/);
                          # we're only interested in sequences that match our chrs
                      
                          $chrom = $bits[11];
                          $seqlen = length($bits[8]);
                          next unless(exists($chrmap{$chrom}));
                          $start = $bits[12];
                          $end = $start + $seqlen - 1;
                          $strand = $bits[13];
                          #print "start:$start end:$end strand:$strand\n";
                      
                          # parse match descriptor
                          $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
                          # skip reads beyond a certain threshold
                          next if($n > 2);
                          $read_code = "U".$n;
                      
                          # correct for alignments off the chromosome ends
                          if( $start <= 0 ){
                         print STDERR "start less than or equal to 0:   ", $start, "\n";
                         print STDERR join("\t", @bits), "\n";
                         $start = 1;
                          }
                      
                          if($end > $chrsize{$seqname}){
                         print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
                         print STDERR join("\t", @bits), "\n";
                         $end = $chrsize{$seqname};
                          }
                      
                          if($strand eq "F"){
                         $strand = "+";
                         $color = "0,0,255";
                          }
                          else{
                         $strand = "-";
                         $color = "255,0,0";
                          }
                      
                          $score = 1;
                          $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
                          my $chr = $chrmap{$seqname};
                          $chr =~ s/\.fa$//g;
                          print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
                      
                          # give some feedback
                          print STDERR "$lines processed\n" if(!($lines % 100000));
                      }

                      Comment

                      • diablito
                        Junior Member
                        • Apr 2012
                        • 5

                        #26
                        It works now! Thank you. I should have figured it out myself, of course....

                        Comment

                        • mgogol
                          Senior Member
                          • Mar 2008
                          • 197

                          #27
                          Yay! That's okay. Glad it works now.

                          Comment

                          • zzz2010
                            Junior Member
                            • Mar 2010
                            • 1

                            #28
                            #!/usr/bin/perl
                            #modified by zzz2010, no need chromosome length file
                            # Program to convert eland export format to BED format
                            # for running MACS
                            # Chris Seidel, June 2009
                            #
                            # Requires tab delim file of chromosome names in the format:
                            # chr_name chr_length
                            # corrects for alignments that go off the ends of the chrs
                            # negative bases are trimmed to 1,
                            # bases > chr_length are set to chr_length
                            # (I know the former exist, I don't know if the latter exist)
                            # results are not sorted, but can be sorted in linux by:
                            # sort -o infile.bed -k1,1 -k2,2n infile.bed
                            # (sort in place, first column, then by second column numeric)

                            use File::Basename;

                            die("usage: $0 eland_export.txt") unless(scalar(@ARGV) == 1);



                            # open input file
                            open(fp, $ARGV[0]) || die("can't open eland file");

                            $lines = 0;
                            while(<fp>){
                            chop;
                            ++$lines;
                            @bits = split(/\t/);
                            # skip reads that didn't pass filtering
                            next if($bits[21] eq "N");
                            # get match name
                            $seqname = $bits[10];
                            # skip No Matches or QC failures
                            # next if($seqname =~ /NM|QC/);
                            # skip repeat matches
                            # next if($seqname =~ /\d+:\d+:\d+/);
                            # we're only interested in sequences that match our chrs

                            $chrom = $bits[10];
                            $seqlen = length($bits[8]);
                            $start = $bits[12];
                            $end = $start + $seqlen - 1;
                            $strand = $bits[13];
                            #print "start:$start end:$end strand:$strand\n";

                            # parse match descriptor
                            $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
                            # skip reads beyond a certain threshold
                            next if($n > 2);
                            $read_code = "U".$n;

                            # correct for alignments off the chromosome ends
                            if( $start <= 0 ){
                            next;
                            }



                            if($strand eq "F"){
                            $strand = "+";
                            $color = "0,0,255";
                            }
                            else{
                            $strand = "-";
                            $color = "255,0,0";
                            }

                            $score = 1;
                            $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
                            my $chr = $seqname;
                            $chr =~ s/\.fa$//g;
                            $chr =~ s/[\S]+[Cc]hr/chr/g;
                            if (index($chr, "chr") != -1)
                            {
                            print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
                            }
                            # give some feedback
                            print STDERR "$lines processed\n" if(!($lines % 100000));
                            }

                            Comment

                            Latest Articles

                            Collapse

                            • SEQadmin2
                              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                              by SEQadmin2


                              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                              ...
                              06-02-2026, 10:05 AM
                            • SEQadmin2
                              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                              by SEQadmin2


                              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                              Introduction

                              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                              05-22-2026, 06:42 AM
                            • SEQadmin2
                              Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                              by SEQadmin2

                              Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                              Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                              05-06-2026, 09:04 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by SEQadmin2, Yesterday, 08:59 AM
                            0 responses
                            14 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 12:03 PM
                            0 responses
                            22 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 11:40 AM
                            0 responses
                            19 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 05-28-2026, 11:40 AM
                            0 responses
                            32 views
                            0 reactions
                            Last Post SEQadmin2  
                            Working...