Unconfigured Ad

**apfejes** · 06-10-2011, 01:02 AM

Thanks ECO.. it was original. (=

In any case, I'll have a look at this on monday. I don't have time to look at it until I'm back in Canada. If you haven't heard from me by then, please send me a reminder.

**rdeborja** · 06-10-2011, 05:12 PM

Talk about coincidence, I was working on a piece of code that would take Illumina TruSeq export files and merge them into a single bed file just this morning. It outputs only those reads that align uniquely and pass filtering and collapses potential PCR biases (via start point).

Here's some sample usage:

rdeborja@hn1:~/tmp$ export2bed.pl --bed s_8_1_export.bed s_8_1_export.txt
Processing file s_8_1_export.txt...
Total collapsed unique aligned reads 656
Completed processing file s_8_1_export.txt

Hope it helps!

Code:

#!/usr/bin/perl

#
 # Created by Richard de Borja on 2011-06-10.
 # Copyright (C) 2011 The Ontario Institute for Cancer Research
 #
 # This program is free software; you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
 # the Free Software Foundation; either version 3 of the License, or (at
 # your option) any later version.
 #
 # This program is distributed in the hope that it will be useful, but
 # WITHOUT ANY WARRANTY; without even the implied warranty of
 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 # General Public License for more details.
 #
 # You should have received a copy of the GNU General Public License
 # along with this program; if not, write to the Free Software
 # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
 # MA 02110-1301, USA.
#
# $Id export2bed.pl rdeborja$

use strict;
use Carp;
use Getopt::Long;
use Pod::Usage;
use Data::Dumper;

# command line default arguments
our %opts = (
	# list of arguments and default values
	"bed"               => undef,
	"length"            => 101,

);

our %_exportfield = (
    # list of columns for export file
    # See CASAVA 1.7 User Guide page 61
    "machine"           => 0,
    "run_number"        => 1,
    "lane"              => 2,
    "tile"              => 3,
    "x_coord_cluster"   => 4,
    "y_coord_cluster"   => 5,
    "index_seq"         => 6,
    "read_number"       => 7,
    "sequence"          => 8,
    "quality"           => 9,
    "match_chr"         => 10,
    "match_contig"      => 11,
    "match_position"    => 12,
    "match_strand"      => 13,
    "match_desc"        => 14,
    "single_read_score" => 15,
    "paired_read_score" => 16,
    "partner_chr"       => 17,
    "partner_contig"    => 18,
    "partner_offset"    => 19,
    "partner_strand"    => 20,
    "filtering"         => 21,
);

my $result = main();
exit $result;

##############################
# Main
##############################
sub main {
	# get command line arguments
	GetOptions(
		\%opts,
		"help|?",
		"man",
		"bed=s",
		"length:i",
	) or pod2usage(64);		# 64=>EX_USAGE

	pod2usage(1) if $opts{'help'};
	pod2usage(-exitstatus => 0, -verbose => 2) if $opts{'man'};

	while(my ($arg, $value) = each(%opts)) {
		if(!defined $value) {
			print "ERROR: Missing argument $arg\n";
			pod2usage(128);
		}
	}
    
    my $read_length = $opts{'length'};
    my %unique_read_hash;
    my $output_file = $opts{'bed'};
    
    foreach my $file (@ARGV) {
        print "Processing file $file...\n";
        my $processed_export = process_export_file($file, $read_length, \%unique_read_hash, $output_file);
        print "Completed processing file $file\n";
    }
    
    my $create_bed_file = print_bed_file($output_file, \%unique_read_hash);

	return 0;
}

##############################
# Functions
##############################

sub process_export_file {
    my $export_file = shift;
    my $read_length = shift;
    my $unique_read_hash = shift;
    my $output_file;
    
    my $output_file = $export_file;
    $output_file =~ s/txt/bed/;
    
    open(my $ifh, '<', $export_file) or croak "Cannot open file $export_file: $!";
    
    while(my $line = <$ifh>) {
        $line =~ s/^\s+//;
        $line =~ s/\s+$//;
        
        my @input_line = split(/\t/, $line);
        
        my $chr = $input_line[$_exportfield{'match_chr'}];
        my $start = $input_line[$_exportfield{'match_position'}];
        my $end = $start + $read_length - 1;
        my $pf = $input_line[$_exportfield{'filtering'}];
        
        # Skip any reads that do not align, do not uniquely align, or pass 
        # the Illumina filtering
        if (($chr =~ /chr/) && ($pf eq "Y")) {
            $chr =~ s/\.fa//g;
            my $bed_line = join("_", $chr, $start, $end);
            $unique_read_hash->{$bed_line} = 1;
        } else {
            next;
        }
    }
    close($ifh);
    
    #my $create_bed_file = print_bed_file($output_file, \$unique_read_hash);

    my $count = scalar(keys(%$unique_read_hash));
    print "Total collapsed unique aligned reads $count\n";
    return 0;
}


sub print_bed_file {
    my $bed_file = shift;
    my $read_hash = shift;
    
    open(my $ofh, '>', $bed_file) or croak "Cannot open file $bed_file";
    foreach my $bed_entry (sort(keys(%$read_hash))) {
        my ($chr, $start, $end) = split("_", $bed_entry);
        print {$ofh} "$chr\t$start\t$end\n";
    }
    close($ofh);
}

__END__

=head1 NAME

export2bed.pl

=head1 SYNOPSIS

B<export2bed.pl> [options] [file ...]

	Options:
	--help			brief help message
	--man			full documentation
	--bed           name of output BED file
	--length        read length (optional: default 101bp)

=head1 OPTIONS

=over 8

=item B<--help>

Print a brief help message and exit.

=item B<--man>

Print the manual page.

=item B<--bed>

Name of output BED file

=item B<--length>

Read length (optional, default is 101bp)

=back

=head1 DESCRIPTION

B<export2bed.pl> convert an Illumina export.txt file to a BED file

=head1 EXAMPLES

export2bed.pl --bed out.bed --length 101 s_1_1_export.txt s_1_2_export.txt

=head1 AUTHORS

The Ontario Institute for Cancer Research

=head1 SEE ALSO

=cut

Originally posted by joseph View Post

Can anybody share an algorithm that takes eland files and make bed files out of them?
Thanks
Joseph

**mgogol** · 07-01-2011, 07:03 AM

Okay, I rewrote it a little. It now takes a fasta index file (produced by samtools faidx genome.fa)

Of course, now you're getting a script hacked away at by two people....

You could try other solutions suggested above.

perl export2bed.pl mm9_full.fa.fai s_1_1_export.txt > s_1_1_export.bed

Code:

#!/usr/bin/perl
# Program to convert eland export format to BED format
# for running MACS
# Chris Seidel, June 2009
#
# Requires tab delim file of chromosome names in the format:
# chr_name chr_length
# corrects for alignments that go off the ends of the chrs
# negative bases are trimmed to 1, 
# bases > chr_length are set to chr_length
# (I know the former exist, I don't know if the latter exist)
# results are not sorted, but can be sorted in linux by:
# sort -o infile.bed -k1,1 -k2,2n infile.bed
# (sort in place, first column, then by second column numeric)

use File::Basename;

die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);

# get info on chromosomes
open(cmap, $ARGV[0]) || die("no chromosome length file!");
%chrmap = {};
while($line = <cmap>){
    chomp($line);
    ($chrom, $size) = split(/\t/, $line);
    $chrmap{$chrom} = $chrom;
    $chrsize{$chrom} = $size;
}

# open input file
open(fp, $ARGV[1]) || die("can't open eland file");

$lines = 0;
while(<fp>){
    chop;
    ++$lines;
    @bits = split(/\t/);
    # skip reads that didn't pass filtering
    next if($bits[21] eq "N");
    # get match name
    $seqname = $bits[10];
    # skip No Matches or QC failures
    # next if($seqname =~ /NM|QC/);
    # skip repeat matches
    # next if($seqname =~ /\d+:\d+:\d+/);
    # we're only interested in sequences that match our chrs

    $chrom = $bits[10];
    $seqlen = length($bits[8]);
    next unless(exists($chrmap{$chrom}));
    $start = $bits[12];
    $end = $start + $seqlen - 1;
    $strand = $bits[13];
    #print "start:$start end:$end strand:$strand\n";

    # parse match descriptor
    $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
    # skip reads beyond a certain threshold
    next if($n > 2);
    $read_code = "U".$n;

    # correct for alignments off the chromosome ends
    if( $start <= 0 ){
   print STDERR "start less than or equal to 0:   ", $start, "\n";
   print STDERR join("\t", @bits), "\n";
   $start = 1;
    }

    if($end > $chrsize{$seqname}){
   print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
   print STDERR join("\t", @bits), "\n";
   $end = $chrsize{$seqname};
    }

    if($strand eq "F"){
   $strand = "+";
   $color = "0,0,255";
    }
    else{
   $strand = "-";
   $color = "255,0,0";
    }

    $score = 1;
    $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
    my $chr = $chrmap{$seqname};
    $chr =~ s/\.fa$//g;
    print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";

    # give some feedback
    print STDERR "$lines processed\n" if(!($lines % 100000));
}

**diablito** · 04-18-2012, 02:04 AM

Xenopus table

Originally posted by mgogol View Post

Okay, I rewrote it a little. It now takes a fasta index file (produced by samtools faidx genome.fa)

Of course, now you're getting a script hacked away at by two people....

You could try other solutions suggested above.

perl export2bed.pl mm9_full.fa.fai s_1_1_export.txt > s_1_1_export.bed

.........

[/CODE]

Hello. Thank you for this script. I used to convert export to bed using a simple cut&grep regexp and then adding 50 to coordinates. This, of course, created end of chromosome problem, which was easy to correct in human exports. But now I have a Xenopus export file, and it is a 1000x problem. So, could you, please, specify the format for chromosome name/size file or give an example? I could not figure it out from the description line

**diablito** · 04-18-2012, 02:09 AM

Originally posted by apfejes View Post

The manuals are a work in progress. If anyone would like to give them a try, I'll update that part of the manual.

Could you, please, give en example of chromosome file? I need to make one for Xenopus...BLEH(((

**mgogol** · 05-03-2012, 10:26 AM

The .fai file can be generated with samtools faidx on the genome fasta file. Sorry I didn't see this earlier, my thread subscription email was set up wrong.

**diablito** · 05-06-2012, 10:41 PM

Originally posted by mgogol View Post

The .fai file can be generated with samtools faidx on the genome fasta file. Sorry I didn't see this earlier, my thread subscription email was set up wrong.

Thank you. I have no Samtools installed, but I found the Samtools:Fastaindex utility at the Genepattern site. So, I am running your script....and getting empty outputs, after quite intensive memory use and normal exit of the script. Any suggestions? Please?

**mgogol** · 05-07-2012, 05:23 AM

Hm, I don't know, it's quite likely the eland export format has changed. If you post part of a file somewhere I could try to figure it out.

**diablito** · 05-07-2012, 05:33 PM

Oh, thanks. On the contrary, it is a pretty old file, circa 2009. I attached the index and a chunk of the export. I'd greatly appreciate, if you can take a look

Attached Files

**mgogol** · 05-08-2012, 09:23 AM

Looks like I'm looking in a different column than your chromosome name seems to be in... Try this:

Code:

#!/usr/bin/perl
# Program to convert eland export format to BED format
# for running MACS
# Chris Seidel, June 2009
#
# Requires tab delim file of chromosome names in the format:
# chr_name chr_length
# corrects for alignments that go off the ends of the chrs
# negative bases are trimmed to 1, 
# bases > chr_length are set to chr_length
# (I know the former exist, I don't know if the latter exist)
# results are not sorted, but can be sorted in linux by:
# sort -o infile.bed -k1,1 -k2,2n infile.bed
# (sort in place, first column, then by second column numeric)

use File::Basename;

die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);

# get info on chromosomes
open(cmap, $ARGV[0]) || die("no chromosome length file!");
%chrmap = {};
while($line = <cmap>){
    chomp($line);
    ($chrom, $size) = split(/\t/, $line);
    $chrmap{$chrom} = $chrom;
    $chrsize{$chrom} = $size;
}

# open input file
open(fp, $ARGV[1]) || die("can't open eland file");

$lines = 0;
while(<fp>){
    chop;
    ++$lines;
    @bits = split(/\t/);
    # skip reads that didn't pass filtering
    next if($bits[21] eq "N");
    # get match name
    $seqname = $bits[11];
    # skip No Matches or QC failures
    # next if($seqname =~ /NM|QC/);
    # skip repeat matches
    # next if($seqname =~ /\d+:\d+:\d+/);
    # we're only interested in sequences that match our chrs

    $chrom = $bits[11];
    $seqlen = length($bits[8]);
    next unless(exists($chrmap{$chrom}));
    $start = $bits[12];
    $end = $start + $seqlen - 1;
    $strand = $bits[13];
    #print "start:$start end:$end strand:$strand\n";

    # parse match descriptor
    $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
    # skip reads beyond a certain threshold
    next if($n > 2);
    $read_code = "U".$n;

    # correct for alignments off the chromosome ends
    if( $start <= 0 ){
   print STDERR "start less than or equal to 0:   ", $start, "\n";
   print STDERR join("\t", @bits), "\n";
   $start = 1;
    }

    if($end > $chrsize{$seqname}){
   print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
   print STDERR join("\t", @bits), "\n";
   $end = $chrsize{$seqname};
    }

    if($strand eq "F"){
   $strand = "+";
   $color = "0,0,255";
    }
    else{
   $strand = "-";
   $color = "255,0,0";
    }

    $score = 1;
    $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
    my $chr = $chrmap{$seqname};
    $chr =~ s/\.fa$//g;
    print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";

    # give some feedback
    print STDERR "$lines processed\n" if(!($lines % 100000));
}

**diablito** · 05-08-2012, 10:20 PM

It works now! Thank you. I should have figured it out myself, of course....

**mgogol** · 05-09-2012, 05:32 AM

Yay! That's okay. Glad it works now.

**zzz2010** · 11-19-2012, 07:12 PM

#!/usr/bin/perl
#modified by zzz2010, no need chromosome length file
# Program to convert eland export format to BED format
# for running MACS
# Chris Seidel, June 2009
#
# Requires tab delim file of chromosome names in the format:
# chr_name chr_length
# corrects for alignments that go off the ends of the chrs
# negative bases are trimmed to 1,
# bases > chr_length are set to chr_length
# (I know the former exist, I don't know if the latter exist)
# results are not sorted, but can be sorted in linux by:
# sort -o infile.bed -k1,1 -k2,2n infile.bed
# (sort in place, first column, then by second column numeric)

use File::Basename;

die("usage: $0 eland_export.txt") unless(scalar(@ARGV) == 1);

# open input file
open(fp, $ARGV[0]) || die("can't open eland file");

$lines = 0;
while(<fp>){
chop;
++$lines;
@bits = split(/\t/);
# skip reads that didn't pass filtering
next if($bits[21] eq "N");
# get match name
$seqname = $bits[10];
# skip No Matches or QC failures
# next if($seqname =~ /NM|QC/);
# skip repeat matches
# next if($seqname =~ /\d+:\d+:\d+/);
# we're only interested in sequences that match our chrs

$chrom = $bits[10];
$seqlen = length($bits[8]);
$start = $bits[12];
$end = $start + $seqlen - 1;
$strand = $bits[13];
#print "start:$start end:$end strand:$strand\n";

# parse match descriptor
$n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
# skip reads beyond a certain threshold
next if($n > 2);
$read_code = "U".$n;

# correct for alignments off the chromosome ends
if( $start <= 0 ){
next;
}

if($strand eq "F"){
$strand = "+";
$color = "0,0,255";
}
else{
$strand = "-";
$color = "255,0,0";
}

$score = 1;
$name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
my $chr = $seqname;
$chr =~ s/\.fa$//g;
$chr =~ s/[\S]+[Cc]hr/chr/g;
if (index($chr, "chr") != -1)
{
print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
}
# give some feedback
print STDERR "$lines processed\n" if(!($lines % 100000));
}

Topics	Statistics	Last Post
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 45 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 105 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 125 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News