Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • apfejes
    Senior Member
    • Feb 2008
    • 236

    #16
    Thanks ECO.. it was original. (=

    In any case, I'll have a look at this on monday. I don't have time to look at it until I'm back in Canada. If you haven't heard from me by then, please send me a reminder.
    The more you know, the more you know you don't know. —Aristotle

    Comment

    • rdeborja
      Member
      • Aug 2008
      • 42

      #17
      Talk about coincidence, I was working on a piece of code that would take Illumina TruSeq export files and merge them into a single bed file just this morning. It outputs only those reads that align uniquely and pass filtering and collapses potential PCR biases (via start point).

      Here's some sample usage:

      rdeborja@hn1:~/tmp$ export2bed.pl --bed s_8_1_export.bed s_8_1_export.txt
      Processing file s_8_1_export.txt...
      Total collapsed unique aligned reads 656
      Completed processing file s_8_1_export.txt


      Hope it helps!

      Code:
      #!/usr/bin/perl
      
      #
       # Created by Richard de Borja on 2011-06-10.
       # Copyright (C) 2011 The Ontario Institute for Cancer Research
       #
       # This program is free software; you can redistribute it and/or modify
       # it under the terms of the GNU General Public License as published by
       # the Free Software Foundation; either version 3 of the License, or (at
       # your option) any later version.
       #
       # This program is distributed in the hope that it will be useful, but
       # WITHOUT ANY WARRANTY; without even the implied warranty of
       # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
       # General Public License for more details.
       #
       # You should have received a copy of the GNU General Public License
       # along with this program; if not, write to the Free Software
       # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
       # MA 02110-1301, USA.
      #
      # $Id export2bed.pl rdeborja$
      
      use strict;
      use Carp;
      use Getopt::Long;
      use Pod::Usage;
      use Data::Dumper;
      
      # command line default arguments
      our %opts = (
      	# list of arguments and default values
      	"bed"               => undef,
      	"length"            => 101,
      
      );
      
      our %_exportfield = (
          # list of columns for export file
          # See CASAVA 1.7 User Guide page 61
          "machine"           => 0,
          "run_number"        => 1,
          "lane"              => 2,
          "tile"              => 3,
          "x_coord_cluster"   => 4,
          "y_coord_cluster"   => 5,
          "index_seq"         => 6,
          "read_number"       => 7,
          "sequence"          => 8,
          "quality"           => 9,
          "match_chr"         => 10,
          "match_contig"      => 11,
          "match_position"    => 12,
          "match_strand"      => 13,
          "match_desc"        => 14,
          "single_read_score" => 15,
          "paired_read_score" => 16,
          "partner_chr"       => 17,
          "partner_contig"    => 18,
          "partner_offset"    => 19,
          "partner_strand"    => 20,
          "filtering"         => 21,
      );
      
      my $result = main();
      exit $result;
      
      ##############################
      # Main
      ##############################
      sub main {
      	# get command line arguments
      	GetOptions(
      		\%opts,
      		"help|?",
      		"man",
      		"bed=s",
      		"length:i",
      	) or pod2usage(64);		# 64=>EX_USAGE
      
      	pod2usage(1) if $opts{'help'};
      	pod2usage(-exitstatus => 0, -verbose => 2) if $opts{'man'};
      
      	while(my ($arg, $value) = each(%opts)) {
      		if(!defined $value) {
      			print "ERROR: Missing argument $arg\n";
      			pod2usage(128);
      		}
      	}
          
          my $read_length = $opts{'length'};
          my %unique_read_hash;
          my $output_file = $opts{'bed'};
          
          foreach my $file (@ARGV) {
              print "Processing file $file...\n";
              my $processed_export = process_export_file($file, $read_length, \%unique_read_hash, $output_file);
              print "Completed processing file $file\n";
          }
          
          my $create_bed_file = print_bed_file($output_file, \%unique_read_hash);
      
      	return 0;
      }
      
      ##############################
      # Functions
      ##############################
      
      sub process_export_file {
          my $export_file = shift;
          my $read_length = shift;
          my $unique_read_hash = shift;
          my $output_file;
          
          my $output_file = $export_file;
          $output_file =~ s/txt/bed/;
          
          open(my $ifh, '<', $export_file) or croak "Cannot open file $export_file: $!";
          
          while(my $line = <$ifh>) {
              $line =~ s/^\s+//;
              $line =~ s/\s+$//;
              
              my @input_line = split(/\t/, $line);
              
              my $chr = $input_line[$_exportfield{'match_chr'}];
              my $start = $input_line[$_exportfield{'match_position'}];
              my $end = $start + $read_length - 1;
              my $pf = $input_line[$_exportfield{'filtering'}];
              
              # Skip any reads that do not align, do not uniquely align, or pass 
              # the Illumina filtering
              if (($chr =~ /chr/) && ($pf eq "Y")) {
                  $chr =~ s/\.fa//g;
                  my $bed_line = join("_", $chr, $start, $end);
                  $unique_read_hash->{$bed_line} = 1;
              } else {
                  next;
              }
          }
          close($ifh);
          
          #my $create_bed_file = print_bed_file($output_file, \$unique_read_hash);
      
          my $count = scalar(keys(%$unique_read_hash));
          print "Total collapsed unique aligned reads $count\n";
          return 0;
      }
      
      
      sub print_bed_file {
          my $bed_file = shift;
          my $read_hash = shift;
          
          open(my $ofh, '>', $bed_file) or croak "Cannot open file $bed_file";
          foreach my $bed_entry (sort(keys(%$read_hash))) {
              my ($chr, $start, $end) = split("_", $bed_entry);
              print {$ofh} "$chr\t$start\t$end\n";
          }
          close($ofh);
      }
      
      __END__
      
      =head1 NAME
      
      export2bed.pl
      
      =head1 SYNOPSIS
      
      B<export2bed.pl> [options] [file ...]
      
      	Options:
      	--help			brief help message
      	--man			full documentation
      	--bed           name of output BED file
      	--length        read length (optional: default 101bp)
      
      =head1 OPTIONS
      
      =over 8
      
      =item B<--help>
      
      Print a brief help message and exit.
      
      =item B<--man>
      
      Print the manual page.
      
      =item B<--bed>
      
      Name of output BED file
      
      =item B<--length>
      
      Read length (optional, default is 101bp)
      
      =back
      
      =head1 DESCRIPTION
      
      B<export2bed.pl> convert an Illumina export.txt file to a BED file
      
      =head1 EXAMPLES
      
      export2bed.pl --bed out.bed --length 101 s_1_1_export.txt s_1_2_export.txt
      
      =head1 AUTHORS
      
      The Ontario Institute for Cancer Research
      
      =head1 SEE ALSO
      
      =cut
      Originally posted by joseph View Post
      Can anybody share an algorithm that takes eland files and make bed files out of them?
      Thanks
      Joseph

      Comment

      • mgogol
        Senior Member
        • Mar 2008
        • 197

        #18
        Okay, I rewrote it a little. It now takes a fasta index file (produced by samtools faidx genome.fa)

        Of course, now you're getting a script hacked away at by two people.... You could try other solutions suggested above.

        perl export2bed.pl mm9_full.fa.fai s_1_1_export.txt > s_1_1_export.bed

        Code:
        #!/usr/bin/perl
        # Program to convert eland export format to BED format
        # for running MACS
        # Chris Seidel, June 2009
        #
        # Requires tab delim file of chromosome names in the format:
        # chr_name chr_length
        # corrects for alignments that go off the ends of the chrs
        # negative bases are trimmed to 1, 
        # bases > chr_length are set to chr_length
        # (I know the former exist, I don't know if the latter exist)
        # results are not sorted, but can be sorted in linux by:
        # sort -o infile.bed -k1,1 -k2,2n infile.bed
        # (sort in place, first column, then by second column numeric)
        
        use File::Basename;
        
        die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);
        
        # get info on chromosomes
        open(cmap, $ARGV[0]) || die("no chromosome length file!");
        %chrmap = {};
        while($line = <cmap>){
            chomp($line);
            ($chrom, $size) = split(/\t/, $line);
            $chrmap{$chrom} = $chrom;
            $chrsize{$chrom} = $size;
        }
        
        # open input file
        open(fp, $ARGV[1]) || die("can't open eland file");
        
        $lines = 0;
        while(<fp>){
            chop;
            ++$lines;
            @bits = split(/\t/);
            # skip reads that didn't pass filtering
            next if($bits[21] eq "N");
            # get match name
            $seqname = $bits[10];
            # skip No Matches or QC failures
            # next if($seqname =~ /NM|QC/);
            # skip repeat matches
            # next if($seqname =~ /\d+:\d+:\d+/);
            # we're only interested in sequences that match our chrs
        
            $chrom = $bits[10];
            $seqlen = length($bits[8]);
            next unless(exists($chrmap{$chrom}));
            $start = $bits[12];
            $end = $start + $seqlen - 1;
            $strand = $bits[13];
            #print "start:$start end:$end strand:$strand\n";
        
            # parse match descriptor
            $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
            # skip reads beyond a certain threshold
            next if($n > 2);
            $read_code = "U".$n;
        
            # correct for alignments off the chromosome ends
            if( $start <= 0 ){
           print STDERR "start less than or equal to 0:   ", $start, "\n";
           print STDERR join("\t", @bits), "\n";
           $start = 1;
            }
        
            if($end > $chrsize{$seqname}){
           print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
           print STDERR join("\t", @bits), "\n";
           $end = $chrsize{$seqname};
            }
        
            if($strand eq "F"){
           $strand = "+";
           $color = "0,0,255";
            }
            else{
           $strand = "-";
           $color = "255,0,0";
            }
        
            $score = 1;
            $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
            my $chr = $chrmap{$seqname};
            $chr =~ s/\.fa$//g;
            print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
        
            # give some feedback
            print STDERR "$lines processed\n" if(!($lines % 100000));
        }

        Comment

        • diablito
          Junior Member
          • Apr 2012
          • 5

          #19
          Xenopus table

          Originally posted by mgogol View Post
          Okay, I rewrote it a little. It now takes a fasta index file (produced by samtools faidx genome.fa)

          Of course, now you're getting a script hacked away at by two people.... You could try other solutions suggested above.

          perl export2bed.pl mm9_full.fa.fai s_1_1_export.txt > s_1_1_export.bed

          .........

          [/CODE]
          Hello. Thank you for this script. I used to convert export to bed using a simple cut&grep regexp and then adding 50 to coordinates. This, of course, created end of chromosome problem, which was easy to correct in human exports. But now I have a Xenopus export file, and it is a 1000x problem. So, could you, please, specify the format for chromosome name/size file or give an example? I could not figure it out from the description line

          Comment

          • diablito
            Junior Member
            • Apr 2012
            • 5

            #20
            Originally posted by apfejes View Post

            The manuals are a work in progress. If anyone would like to give them a try, I'll update that part of the manual.
            Could you, please, give en example of chromosome file? I need to make one for Xenopus...BLEH(((

            Comment

            • mgogol
              Senior Member
              • Mar 2008
              • 197

              #21
              The .fai file can be generated with samtools faidx on the genome fasta file. Sorry I didn't see this earlier, my thread subscription email was set up wrong.

              Comment

              • diablito
                Junior Member
                • Apr 2012
                • 5

                #22
                Originally posted by mgogol View Post
                The .fai file can be generated with samtools faidx on the genome fasta file. Sorry I didn't see this earlier, my thread subscription email was set up wrong.
                Thank you. I have no Samtools installed, but I found the Samtools:Fastaindex utility at the Genepattern site. So, I am running your script....and getting empty outputs, after quite intensive memory use and normal exit of the script. Any suggestions? Please?

                Comment

                • mgogol
                  Senior Member
                  • Mar 2008
                  • 197

                  #23
                  Hm, I don't know, it's quite likely the eland export format has changed. If you post part of a file somewhere I could try to figure it out.

                  Comment

                  • diablito
                    Junior Member
                    • Apr 2012
                    • 5

                    #24
                    Oh, thanks. On the contrary, it is a pretty old file, circa 2009. I attached the index and a chunk of the export. I'd greatly appreciate, if you can take a look
                    Attached Files

                    Comment

                    • mgogol
                      Senior Member
                      • Mar 2008
                      • 197

                      #25
                      Looks like I'm looking in a different column than your chromosome name seems to be in... Try this:

                      Code:
                      #!/usr/bin/perl
                      # Program to convert eland export format to BED format
                      # for running MACS
                      # Chris Seidel, June 2009
                      #
                      # Requires tab delim file of chromosome names in the format:
                      # chr_name chr_length
                      # corrects for alignments that go off the ends of the chrs
                      # negative bases are trimmed to 1, 
                      # bases > chr_length are set to chr_length
                      # (I know the former exist, I don't know if the latter exist)
                      # results are not sorted, but can be sorted in linux by:
                      # sort -o infile.bed -k1,1 -k2,2n infile.bed
                      # (sort in place, first column, then by second column numeric)
                      
                      use File::Basename;
                      
                      die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);
                      
                      # get info on chromosomes
                      open(cmap, $ARGV[0]) || die("no chromosome length file!");
                      %chrmap = {};
                      while($line = <cmap>){
                          chomp($line);
                          ($chrom, $size) = split(/\t/, $line);
                          $chrmap{$chrom} = $chrom;
                          $chrsize{$chrom} = $size;
                      }
                      
                      # open input file
                      open(fp, $ARGV[1]) || die("can't open eland file");
                      
                      $lines = 0;
                      while(<fp>){
                          chop;
                          ++$lines;
                          @bits = split(/\t/);
                          # skip reads that didn't pass filtering
                          next if($bits[21] eq "N");
                          # get match name
                          $seqname = $bits[11];
                          # skip No Matches or QC failures
                          # next if($seqname =~ /NM|QC/);
                          # skip repeat matches
                          # next if($seqname =~ /\d+:\d+:\d+/);
                          # we're only interested in sequences that match our chrs
                      
                          $chrom = $bits[11];
                          $seqlen = length($bits[8]);
                          next unless(exists($chrmap{$chrom}));
                          $start = $bits[12];
                          $end = $start + $seqlen - 1;
                          $strand = $bits[13];
                          #print "start:$start end:$end strand:$strand\n";
                      
                          # parse match descriptor
                          $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
                          # skip reads beyond a certain threshold
                          next if($n > 2);
                          $read_code = "U".$n;
                      
                          # correct for alignments off the chromosome ends
                          if( $start <= 0 ){
                         print STDERR "start less than or equal to 0:   ", $start, "\n";
                         print STDERR join("\t", @bits), "\n";
                         $start = 1;
                          }
                      
                          if($end > $chrsize{$seqname}){
                         print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
                         print STDERR join("\t", @bits), "\n";
                         $end = $chrsize{$seqname};
                          }
                      
                          if($strand eq "F"){
                         $strand = "+";
                         $color = "0,0,255";
                          }
                          else{
                         $strand = "-";
                         $color = "255,0,0";
                          }
                      
                          $score = 1;
                          $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
                          my $chr = $chrmap{$seqname};
                          $chr =~ s/\.fa$//g;
                          print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
                      
                          # give some feedback
                          print STDERR "$lines processed\n" if(!($lines % 100000));
                      }

                      Comment

                      • diablito
                        Junior Member
                        • Apr 2012
                        • 5

                        #26
                        It works now! Thank you. I should have figured it out myself, of course....

                        Comment

                        • mgogol
                          Senior Member
                          • Mar 2008
                          • 197

                          #27
                          Yay! That's okay. Glad it works now.

                          Comment

                          • zzz2010
                            Junior Member
                            • Mar 2010
                            • 1

                            #28
                            #!/usr/bin/perl
                            #modified by zzz2010, no need chromosome length file
                            # Program to convert eland export format to BED format
                            # for running MACS
                            # Chris Seidel, June 2009
                            #
                            # Requires tab delim file of chromosome names in the format:
                            # chr_name chr_length
                            # corrects for alignments that go off the ends of the chrs
                            # negative bases are trimmed to 1,
                            # bases > chr_length are set to chr_length
                            # (I know the former exist, I don't know if the latter exist)
                            # results are not sorted, but can be sorted in linux by:
                            # sort -o infile.bed -k1,1 -k2,2n infile.bed
                            # (sort in place, first column, then by second column numeric)

                            use File::Basename;

                            die("usage: $0 eland_export.txt") unless(scalar(@ARGV) == 1);



                            # open input file
                            open(fp, $ARGV[0]) || die("can't open eland file");

                            $lines = 0;
                            while(<fp>){
                            chop;
                            ++$lines;
                            @bits = split(/\t/);
                            # skip reads that didn't pass filtering
                            next if($bits[21] eq "N");
                            # get match name
                            $seqname = $bits[10];
                            # skip No Matches or QC failures
                            # next if($seqname =~ /NM|QC/);
                            # skip repeat matches
                            # next if($seqname =~ /\d+:\d+:\d+/);
                            # we're only interested in sequences that match our chrs

                            $chrom = $bits[10];
                            $seqlen = length($bits[8]);
                            $start = $bits[12];
                            $end = $start + $seqlen - 1;
                            $strand = $bits[13];
                            #print "start:$start end:$end strand:$strand\n";

                            # parse match descriptor
                            $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
                            # skip reads beyond a certain threshold
                            next if($n > 2);
                            $read_code = "U".$n;

                            # correct for alignments off the chromosome ends
                            if( $start <= 0 ){
                            next;
                            }



                            if($strand eq "F"){
                            $strand = "+";
                            $color = "0,0,255";
                            }
                            else{
                            $strand = "-";
                            $color = "255,0,0";
                            }

                            $score = 1;
                            $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
                            my $chr = $seqname;
                            $chr =~ s/\.fa$//g;
                            $chr =~ s/[\S]+[Cc]hr/chr/g;
                            if (index($chr, "chr") != -1)
                            {
                            print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
                            }
                            # give some feedback
                            print STDERR "$lines processed\n" if(!($lines % 100000));
                            }

                            Comment

                            Latest Articles

                            Collapse

                            • SEQadmin2
                              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                              by SEQadmin2


                              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                              Here are nine questions we think about, in roughly the order they matter, before...
                              06-18-2026, 07:11 AM
                            • SEQadmin2
                              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                              by SEQadmin2


                              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                              ...
                              06-02-2026, 10:05 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by SEQadmin2, 06-26-2026, 11:10 AM
                            0 responses
                            11 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-17-2026, 06:09 AM
                            0 responses
                            45 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-09-2026, 11:58 AM
                            0 responses
                            105 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-05-2026, 10:09 AM
                            0 responses
                            125 views
                            0 reactions
                            Last Post SEQadmin2  
                            Working...