Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compare two files with Awk?

    Hello everyone,

    I have two files of ncRNAs from two different samples. I would like to compare them to each other by creating a single file that contains all the found ncRNAs in a file format such as this:

    Code:
    ncRNA     sample1     sample 2
    The files are currently in the format of:

    Code:
    ncRNA     sample1
    and

    Code:
    ncRNA     sample2
    To make a file similar to this:

    Code:
    miRNA	721Es	    162Es
    ath-miR173	1	-
    ath-miR1886.1	1	-
    ath-miR1886.2	3	-
    ath-miR319a	1	-
    ath-miR390a	59	15
    ath-miR396a	1	1
    ath-miR822	1	2
    ath-miR824	4	5
    ath-miR825	-	1
    ath-miR837-3p	4	-

    Any help on this would be great. A command-line awk script or something similar would be preferred.

    Thanks,
    Brandon

  • #2
    If Perl is acceptable, this should work.......

    #!/usr/bin/perl
    use strict;
    use warnings;

    my $file_one = $ARGV[0] or die $!;

    my $file_two = $ARGV[1] or die $!;

    my $data={};

    read_file_fill_hash($file_one,'first',$data);

    read_file_fill_hash($file_two,'second',$data);

    print_data($data);



    sub print_data{
    my $hash=shift;
    print "miRNA\t721Es\t162Es\n";
    foreach my $mirna (keys %{$data}){
    print "$mirna\t$data->{$mirna}{first}\t$data->{$mirna}{second}\n";
    }
    }

    sub read_file_fill_hash{
    my $file=shift;
    my $which=shift;
    my $reference=shift;
    open(my $han, '<', $file) or die $!;
    while(my $line = <$han>){
    my ($mirna,$result)=split(/\s+/,$line);
    if($which eq 'first'){
    $reference->{$mirna}{first}=$result;
    $reference->{$mirna}{second}='-';
    }else{
    $reference->{$mirna}{first}= '-' if(!exists $reference->{$mirna}{first});
    $reference->{$mirna}{second}=$result;
    }
    }
    close $han;
    }

    Comment


    • #3
      or you can use join
      Code:
      join file1 file2 -a1 -a2 -o 0 -o1.2 -o2.2
      pipe into sed if you really want the dash for the blanks

      Code:
      tr " " "\t" |sed "s/\t\t/\t-\t/" | sed "s/\t$/\t-/"
      Last edited by adamdeluca; 08-02-2010, 12:34 PM.

      Comment


      • #4
        Wow, that is cool

        Comment


        • #5
          Thank you both.

          @adamdeluca:
          Code:
          join 162Es/162Es.dsap.rfam.txt 721Es/721Es.dsap.rfam.txt -a1 -a2 -o 0 -o1.2 -o2.2 > 721Es.172Es.rfam
          join: file 1 is not in sorted order
          join: file 2 is not in sorted order
          Any ideas?

          Comment


          • #6
            Join needs the input files in sorted order

            Code:
            sort -k1 file1 > file1.sorted

            Comment


            • #7
              I guess I forgot to mention that not all ncRNAs are found in both files.

              Some ncRNAs are in one file and not the other. That is what caused the '-' in the combined file. Which was created using DSAP's Comparative miRNAomics (here).

              Any idea how I would process the file due to that problem?

              Comment


              • #8
                That's fine.
                The -a1 option keeps unmatched lines from the first file, and the -a2 keeps the unmatched lines from the second.

                If you want dashes instead of the blank columns, use the sed commands above.

                Comment


                • #9


                  That works. Thank you so much.

                  Comment


                  • #10
                    Adam,

                    I was wondering if you could help me out? I'm trying to do the exact same thing, but with larger files and by matching multiple columns.

                    File 1:
                    Code:
                    Chr5	1522433	1522454	721	1	+	AGGAGAAGGAACAGAATCCAA	.	-1	-1	.	-1	.	.	0
                    Chr2	1526280	1526301	721	1	-	TGCGCCGCCGCTCACCTTCTC	.	-1	-1	.	-1	.	.	0
                    Chr2	1526352	1526373	721	1	+	CGAGAGCTCGAAGACGAGGCA	.	-1	-1	.	-1	.	.	0
                    Chr4	1528147	1528168	721	6	-	AATACTACAATTTCTTCCATA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	21
                    Chr4	1528149	1528169	721	2	-	TACTACAATTTCTTCCATAA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	20
                    Chr4	1528168	1528189	721	5	-	AAGCCCCTTCTTATATCGAGT	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	21
                    Chr4	1528189	1528210	721	3	-	CAACAAAACATCTCGTCCCCA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	21
                    Chr4	1528189	1528211	721	4	-	CAACAAAACATCTCGTCCCCAA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	22
                    Chr4	1528191	1528211	721	2	-	ACAAAACATCTCGTCCCCAA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	20

                    File 2:
                    Code:
                    chloroplast	1375	1402	721	1	-	GCTAGTTATCCAGTTACAGAAGCGACC	.	-1	-1	.	-1	.	.	0
                    chloroplast	1376	1394	721	1	-	CTAGTTATCCAGTTACAG	.	-1	-1	.	-1	.	.	0
                    Chr2	1379	1401	721	1	+	CGACCAGGACGATGAATGGGCG	Chr2	1378	1400	ASRP	ncRNA_Carrington	+	Name=ASRP27130;Note=small	21
                    Chr2	1379	1401	721	1	+	CGACCAGGACGATGAATGGGCG	Chr2	1380	1402	ASRP	ncRNA_Carrington	+	Name=ASRP150295;Note=small	21
                    Chr2	1379	1402	721	1	+	CGACCAGGACGATGAATGGGCGA	Chr2	1380	1402	ASRP	ncRNA_Carrington	+	Name=ASRP150295;Note=small	22
                    chloroplast	1379	1404	721	1	-	GTTATCCAGTTACAGAAGCGACCCC	.	-1	-1	.	-1	.	.	0
                    These two files contain data of smRNAs from a sample in the first 7 columns and then the last 7 columns of the file contains annotations from different databases.

                    What I would like to do is match the first seven columns from both files and then have the last seven columns from each file added to the matching sequences.

                    So basically it would be in the format:

                    [sample smRNAs (7 columns)] [database 1 (7 columns)] [database 2 (7 columns)]

                    I've been trying to adapt the previous strategy to this problem, but thus far I've been unsuccessful.

                    Any help would be greatly appreciated. Thanks.

                    Comment


                    • #11
                      Code:
                      awk '{print $1"_"$2"_"$3"_"$4"_"$5"_"$6"_"$7"\t"$0}' file1
                      will concatenate the first 7 columns giving you a field to use for join.

                      Comment


                      • #12
                        Originally posted by adamdeluca View Post
                        or you can use join
                        Code:
                        join file1 file2 -a1 -a2 -o 0 -o1.2 -o2.2
                        pipe into sed if you really want the dash for the blanks

                        Code:
                        tr " " "\t" |sed "s/\t\t/\t-\t/" | sed "s/\t$/\t-/"

                        Wonderful!
                        This is convenient for two files, but what about three or more files?

                        Comment


                        • #13
                          Originally posted by lix View Post
                          Wonderful!
                          This is convenient for two files, but what about three or more files?
                          Just repeat the same process, join the output of the first command to file3 etc.

                          (((f1+f2)+f3)+f4)...

                          Comment


                          • #14
                            Adam,

                            Thanks for that solution.

                            Comment


                            • #15
                              How to find SNP by comparing two fasta file using perl code?

                              Hello,
                              I am just beginner in Perl,
                              I have Two fasta file of different length.
                              I would like to align them to find difference in nucleotide postion.

                              Output should be like this
                              Total length of fasta files
                              First reference file: 1253630 bp
                              Seconf file: 4523366 bp
                              If match 2nd file is same as 1st reference file.
                              if not match out put should like this
                              Mismatch position of basepair
                              A-C 100025
                              C-T 600045

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              29 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X