Unconfigured Ad

**poisson200** · 08-02-2010, 11:08 AM

If Perl is acceptable, this should work.......

#!/usr/bin/perl
use strict;
use warnings;

my $file_one = $ARGV[0] or die $!;

my $file_two = $ARGV[1] or die $!;

my $data={};

read_file_fill_hash($file_one,'first',$data);

read_file_fill_hash($file_two,'second',$data);

print_data($data);

sub print_data{
my $hash=shift;
print "miRNA\t721Es\t162Es\n";
foreach my $mirna (keys %{$data}){
print "$mirna\t$data->{$mirna}{first}\t$data->{$mirna}{second}\n";
}
}

sub read_file_fill_hash{
my $file=shift;
my $which=shift;
my $reference=shift;
open(my $han, '<', $file) or die $!;
while(my $line = <$han>){
my ($mirna,$result)=split(/\s+/,$line);
if($which eq 'first'){
$reference->{$mirna}{first}=$result;
$reference->{$mirna}{second}='-';
}else{
$reference->{$mirna}{first}= '-' if(!exists $reference->{$mirna}{first});
$reference->{$mirna}{second}=$result;
}
}
close $han;
}

**adamdeluca** · 08-02-2010, 12:28 PM

or you can use join

Code:

join file1 file2 -a1 -a2 -o 0 -o1.2 -o2.2

pipe into sed if you really want the dash for the blanks

Code:

tr " " "\t" |sed "s/\t\t/\t-\t/" | sed "s/\t$/\t-/"

**poisson200** · 08-02-2010, 12:31 PM

Wow, that is cool

**DrD2009** · 08-02-2010, 02:29 PM

Thank you both.

@adamdeluca:

Code:

join 162Es/162Es.dsap.rfam.txt 721Es/721Es.dsap.rfam.txt -a1 -a2 -o 0 -o1.2 -o2.2 > 721Es.172Es.rfam
join: file 1 is not in sorted order
join: file 2 is not in sorted order

Any ideas?

**adamdeluca** · 08-02-2010, 02:49 PM

Join needs the input files in sorted order

Code:

sort -k1 file1 > file1.sorted

**DrD2009** · 08-02-2010, 04:08 PM

I guess I forgot to mention that not all ncRNAs are found in both files.

Some ncRNAs are in one file and not the other. That is what caused the '-' in the combined file. Which was created using DSAP's Comparative miRNAomics (here).

Any idea how I would process the file due to that problem?

**adamdeluca** · 08-02-2010, 04:17 PM

That's fine.
The -a1 option keeps unmatched lines from the first file, and the -a2 keeps the unmatched lines from the second.

If you want dashes instead of the blank columns, use the sed commands above.

**DrD2009** · 08-02-2010, 04:44 PM

That works. Thank you so much.

**DrD2009** · 08-12-2010, 12:50 AM

Adam,

I was wondering if you could help me out? I'm trying to do the exact same thing, but with larger files and by matching multiple columns.

File 1:

Code:

Chr5	1522433	1522454	721	1	+	AGGAGAAGGAACAGAATCCAA	.	-1	-1	.	-1	.	.	0
Chr2	1526280	1526301	721	1	-	TGCGCCGCCGCTCACCTTCTC	.	-1	-1	.	-1	.	.	0
Chr2	1526352	1526373	721	1	+	CGAGAGCTCGAAGACGAGGCA	.	-1	-1	.	-1	.	.	0
Chr4	1528147	1528168	721	6	-	AATACTACAATTTCTTCCATA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	21
Chr4	1528149	1528169	721	2	-	TACTACAATTTCTTCCATAA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	20
Chr4	1528168	1528189	721	5	-	AAGCCCCTTCTTATATCGAGT	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	21
Chr4	1528189	1528210	721	3	-	CAACAAAACATCTCGTCCCCA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	21
Chr4	1528189	1528211	721	4	-	CAACAAAACATCTCGTCCCCAA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	22
Chr4	1528191	1528211	721	2	-	ACAAAACATCTCGTCCCCAA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	20

File 2:

Code:

chloroplast	1375	1402	721	1	-	GCTAGTTATCCAGTTACAGAAGCGACC	.	-1	-1	.	-1	.	.	0
chloroplast	1376	1394	721	1	-	CTAGTTATCCAGTTACAG	.	-1	-1	.	-1	.	.	0
Chr2	1379	1401	721	1	+	CGACCAGGACGATGAATGGGCG	Chr2	1378	1400	ASRP	ncRNA_Carrington	+	Name=ASRP27130;Note=small	21
Chr2	1379	1401	721	1	+	CGACCAGGACGATGAATGGGCG	Chr2	1380	1402	ASRP	ncRNA_Carrington	+	Name=ASRP150295;Note=small	21
Chr2	1379	1402	721	1	+	CGACCAGGACGATGAATGGGCGA	Chr2	1380	1402	ASRP	ncRNA_Carrington	+	Name=ASRP150295;Note=small	22
chloroplast	1379	1404	721	1	-	GTTATCCAGTTACAGAAGCGACCCC	.	-1	-1	.	-1	.	.	0

These two files contain data of smRNAs from a sample in the first 7 columns and then the last 7 columns of the file contains annotations from different databases.

What I would like to do is match the first seven columns from both files and then have the last seven columns from each file added to the matching sequences.

So basically it would be in the format:

[sample smRNAs (7 columns)] [database 1 (7 columns)] [database 2 (7 columns)]

I've been trying to adapt the previous strategy to this problem, but thus far I've been unsuccessful.

Any help would be greatly appreciated. Thanks.

**adamdeluca** · 08-12-2010, 05:01 AM

Code:

awk '{print $1"_"$2"_"$3"_"$4"_"$5"_"$6"_"$7"\t"$0}' file1

will concatenate the first 7 columns giving you a field to use for join.

**lix** · 08-14-2010, 05:54 PM

Originally posted by adamdeluca View Post

or you can use join

Code:

join file1 file2 -a1 -a2 -o 0 -o1.2 -o2.2

pipe into sed if you really want the dash for the blanks

Code:

tr " " "\t" |sed "s/\t\t/\t-\t/" | sed "s/\t$/\t-/"

Wonderful!
This is convenient for two files, but what about three or more files?

**adamdeluca** · 08-14-2010, 06:05 PM

Originally posted by lix View Post

Wonderful!
This is convenient for two files, but what about three or more files?

Just repeat the same process, join the output of the first command to file3 etc.

(((f1+f2)+f3)+f4)...

**DrD2009** · 08-15-2010, 06:26 PM

Adam,

Thanks for that solution.

**rajeshkmaurya08** · 02-08-2017, 08:38 PM

How to find SNP by comparing two fasta file using perl code?

Hello,
I am just beginner in Perl,
I have Two fasta file of different length.
I would like to align them to find difference in nucleotide postion.

Output should be like this
Total length of fasta files
First reference file: 1253630 bp
Seconf file: 4523366 bp
If match 2nd file is same as 1st reference file.
if not match out put should like this
Mismatch position of basepair
A-C 100025
C-T 600045

Topics	Statistics	Last Post
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, Yesterday, 10:26 AM	0 responses 15 views 0 reactions	Last Post by SEQadmin2 Yesterday, 10:26 AM
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 27 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 33 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM

Unconfigured Ad

Compare two files with Awk?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News