Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • perl_merge fasta and group files (454)

    Hi

    I am trying to put data from 454 that I have de noised and compiled with mothur to start using qiime and PICRUSt. So I need to merge two files that look like this:

    A fasta file:
    >HCZRAFT02HJAN7
    A-G-A-T----AAC------AG-C-C-C-AC-C-A-----A-GG-C-G--A-T--G-A---T--AC-A--T--A--G---C-T-G-G-TCT--G-AG---A--G-G-AT--G-AA-C-A-G-CCA-C-.....
    >HCZRAFT02HUO5K
    A-G-G-T----AAC------GG-C-T-C-AC-C-A-C.......

    and a group file specifying which sequence is found in what sample:

    HCZRAFT02JI33I C_129
    HCZRAFT02IMY3H C_129
    HCZRAFT02HE28V C_129
    HCZRAFT02IM5RX C_129
    HCZRAFT02JYNIP C_129
    HCZRAFT02JFRVC C_129
    HCZRAFT02HRTKG C_129
    HCZRAFT02ISION C_129
    HCZRAFT02IIL63 C_129
    HCZRAFT02HQBFW C_129


    I'd like to end up with a file that looks like this:

    >C_129_HCZRAFT02JI33I
    A-G-A-T----AAC------AG-C-C-C-AC-C-A-----A-GG-C-G--A-T--G-A---T--AC-A--T--A--G---C-T-G-G-TCT--G-AG---A--G-G-AT--G-AA-C-A-G-CCA-C-.....

    Any help would be greatly appreciated

    Andres

  • #2
    Not pretty, but works for me.

    Code:
    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    open (my $codes, "<", "codes.txt");
    open (my $seqs, "<", "test.fasta");
    open (my $output, ">", "merged.txt");
    
    my %codeHash;
    while (my $code = <$codes>) {
        chomp $code;
        if ($code =~ /(\S+)\s(\S+)/) {
            print $1 . "  " . $2 . "\n";
            $codeHash{$1} = $2;
        }
    }
    
    while (my $seq = <$seqs>) {
        chomp $seq;
        if ($seq =~ />(\S+)/) {
            if (exists $codeHash{$1}) {
                print $output ">" . $codeHash{$1} . "_" . $1 . "\n";
            } else {
            print $output $seq . "\n";
            }
        } else {
            print $output $seq . "\n";
        }
    }

    Comment


    • #3
      Thanks, can you give me a hand in usage?

      Comment


      • #4
        I tried running the code you provided as:
        Desktop $ perl merge.pl -code Howler.final.groups -seq Howler.final.fasta -output >merged.txt

        and goy this error:

        readline() on closed filehandle $codes at merge.pl line 11.
        readline() on closed filehandle $seqs at merge.pl line 19.


        Can you please give me a clue on what has happened?

        Comment


        • #5
          Originally posted by Andres_Gomez View Post
          I tried running the code you provided as:
          Desktop $ perl merge.pl -code Howler.final.groups -seq Howler.final.fasta -output >merged.txt

          and goy this error:

          readline() on closed filehandle $codes at merge.pl line 11.
          readline() on closed filehandle $seqs at merge.pl line 19.


          Can you please give me a clue on what has happened?
          The files are hard coded, ie, the script isn't using Getopt::Long to accept flags. So, you need to change the names in the script.

          Comment


          • #6
            Any clues on where I should change the names? is it lines 11 and 19?

            Comment


            • #7
              Originally posted by Andres_Gomez View Post
              Any clues on where I should change the names? is it lines 11 and 19?
              Take a look at the open() statements. That is, lines 6, 7, and 8 are the lines where you need to change the file names. Also, it is best practices to test that you could open the file by putting an "or die ...." statement after open. That way, the program will halt and you can easily see what is going wrong (and it's trying to open a specific file).

              Comment


              • #8
                after changing the names of the files in the script, you want to call the script as follows:

                Code:
                Desktop $ perl merge.pl Howler.final.groups Howler.final.fasta merged.txt

                Comment


                • #9
                  Originally posted by mastal View Post
                  after changing the names of the files in the script, you want to call the script as follows:

                  Code:
                  Desktop $ perl merge.pl Howler.final.groups Howler.final.fasta merged.txt
                  No, you wouldn't. If the names of the files are hardcoded in the script there is no purpose to putting them on the command line. The will simply be ignored since the script does not have any code to do anything with command line arguments. Assuming you have hardcoded the file names in the script (ignoring that that is BAD(tm)) you would simply call the script as:

                  Code:
                  Desktop $ perl merge.pl
                  The files Howler.final.groups and Howler.final.fasta must be present in the current working directory or the script will throw an error. The file merged.txt will be created by script, OVERWRITING any previous version of the file present in the current working directory.

                  Comment


                  • #10
                    yes, kmcarr is right.

                    Comment


                    • #11
                      Sorry about that--just saw this. Here's a better, safer version. Won't overwrite output file if it already exists, and uses command line flags for the files.

                      Usage:
                      perl merge.pl --code codefile.txt --seq seqfile.fasta --out mergedOutput.fasta

                      Code:
                      #!/usr/bin/perl
                      
                      use strict;
                      use warnings;
                      use Getopt::Long;
                      
                      my $codes;
                      my $seqs;
                      my $out;
                      my $usage = "Usage: perl merge.pl --code <file_with_codes.txt> --seq <sequences.fasta> --out <outfile.txt>\n";
                      
                      GetOptions  ("code=s" => \$codes,
                                   "seq=s" => \$seqs,
                                   "out=s" => \$out);
                      
                      if (!defined $codes) {
                          print "Must supply codes file name.\n";
                          die "$usage";
                      } elsif (!defined $seqs) {
                          print "Must supply sequences file name.\n";
                          die "$usage";
                      } elsif (!defined $out) {
                          print "Must supply output file name.\n";
                      }
                      
                      if(-e "$out") {
                          die "File $out already exists--stopping so you don't overwrite.\n";
                      }   
                      
                      open (my $codesFH, "<", "$codes");
                      open (my $seqsFH, "<", "$seqs");
                      open (my $outputFH, ">", "$out");
                      
                      my %codeHash;
                      while (my $code = <$codesFH>) {
                          chomp $code;
                          if ($code =~ /(\S+)\s(\S+)/) {
                              $codeHash{$1} = $2;
                          }
                      }
                      
                      while (my $seq = <$seqsFH>) {
                          chomp $seq;
                          if ($seq =~ />(\S+)/) {
                              if (exists $codeHash{$1}) {
                                  print $outputFH ">" . $codeHash{$1} . "_" . $1 . "\n";
                              } else {
                              print $outputFH $seq . "\n";
                              }
                          } else {
                              print $outputFH $seq . "\n";
                          }
                      }
                      print "Finished executing.\n";

                      Comment


                      • #12
                        Also, what do you want it to do if it doesn't find the corresponding name in the codes file. Currently it just prints unmatched names without a C_XXX prefix in the output file if it doesn't find a match.

                        Comment


                        • #13
                          Worked wonders! thanks a lot atcghelix, mastal, kmcarr, SES and everybody else giving advice.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin




                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                            04-22-2024, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-25-2024, 11:49 AM
                          0 responses
                          19 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-24-2024, 08:47 AM
                          0 responses
                          20 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          62 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          60 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X