Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • perl_merge fasta and group files (454)

    Hi

    I am trying to put data from 454 that I have de noised and compiled with mothur to start using qiime and PICRUSt. So I need to merge two files that look like this:

    A fasta file:
    >HCZRAFT02HJAN7
    A-G-A-T----AAC------AG-C-C-C-AC-C-A-----A-GG-C-G--A-T--G-A---T--AC-A--T--A--G---C-T-G-G-TCT--G-AG---A--G-G-AT--G-AA-C-A-G-CCA-C-.....
    >HCZRAFT02HUO5K
    A-G-G-T----AAC------GG-C-T-C-AC-C-A-C.......

    and a group file specifying which sequence is found in what sample:

    HCZRAFT02JI33I C_129
    HCZRAFT02IMY3H C_129
    HCZRAFT02HE28V C_129
    HCZRAFT02IM5RX C_129
    HCZRAFT02JYNIP C_129
    HCZRAFT02JFRVC C_129
    HCZRAFT02HRTKG C_129
    HCZRAFT02ISION C_129
    HCZRAFT02IIL63 C_129
    HCZRAFT02HQBFW C_129


    I'd like to end up with a file that looks like this:

    >C_129_HCZRAFT02JI33I
    A-G-A-T----AAC------AG-C-C-C-AC-C-A-----A-GG-C-G--A-T--G-A---T--AC-A--T--A--G---C-T-G-G-TCT--G-AG---A--G-G-AT--G-AA-C-A-G-CCA-C-.....

    Any help would be greatly appreciated

    Andres

  • #2
    Not pretty, but works for me.

    Code:
    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    open (my $codes, "<", "codes.txt");
    open (my $seqs, "<", "test.fasta");
    open (my $output, ">", "merged.txt");
    
    my %codeHash;
    while (my $code = <$codes>) {
        chomp $code;
        if ($code =~ /(\S+)\s(\S+)/) {
            print $1 . "  " . $2 . "\n";
            $codeHash{$1} = $2;
        }
    }
    
    while (my $seq = <$seqs>) {
        chomp $seq;
        if ($seq =~ />(\S+)/) {
            if (exists $codeHash{$1}) {
                print $output ">" . $codeHash{$1} . "_" . $1 . "\n";
            } else {
            print $output $seq . "\n";
            }
        } else {
            print $output $seq . "\n";
        }
    }

    Comment


    • #3
      Thanks, can you give me a hand in usage?

      Comment


      • #4
        I tried running the code you provided as:
        Desktop $ perl merge.pl -code Howler.final.groups -seq Howler.final.fasta -output >merged.txt

        and goy this error:

        readline() on closed filehandle $codes at merge.pl line 11.
        readline() on closed filehandle $seqs at merge.pl line 19.


        Can you please give me a clue on what has happened?

        Comment


        • #5
          Originally posted by Andres_Gomez View Post
          I tried running the code you provided as:
          Desktop $ perl merge.pl -code Howler.final.groups -seq Howler.final.fasta -output >merged.txt

          and goy this error:

          readline() on closed filehandle $codes at merge.pl line 11.
          readline() on closed filehandle $seqs at merge.pl line 19.


          Can you please give me a clue on what has happened?
          The files are hard coded, ie, the script isn't using Getopt::Long to accept flags. So, you need to change the names in the script.

          Comment


          • #6
            Any clues on where I should change the names? is it lines 11 and 19?

            Comment


            • #7
              Originally posted by Andres_Gomez View Post
              Any clues on where I should change the names? is it lines 11 and 19?
              Take a look at the open() statements. That is, lines 6, 7, and 8 are the lines where you need to change the file names. Also, it is best practices to test that you could open the file by putting an "or die ...." statement after open. That way, the program will halt and you can easily see what is going wrong (and it's trying to open a specific file).

              Comment


              • #8
                after changing the names of the files in the script, you want to call the script as follows:

                Code:
                Desktop $ perl merge.pl Howler.final.groups Howler.final.fasta merged.txt

                Comment


                • #9
                  Originally posted by mastal View Post
                  after changing the names of the files in the script, you want to call the script as follows:

                  Code:
                  Desktop $ perl merge.pl Howler.final.groups Howler.final.fasta merged.txt
                  No, you wouldn't. If the names of the files are hardcoded in the script there is no purpose to putting them on the command line. The will simply be ignored since the script does not have any code to do anything with command line arguments. Assuming you have hardcoded the file names in the script (ignoring that that is BAD(tm)) you would simply call the script as:

                  Code:
                  Desktop $ perl merge.pl
                  The files Howler.final.groups and Howler.final.fasta must be present in the current working directory or the script will throw an error. The file merged.txt will be created by script, OVERWRITING any previous version of the file present in the current working directory.

                  Comment


                  • #10
                    yes, kmcarr is right.

                    Comment


                    • #11
                      Sorry about that--just saw this. Here's a better, safer version. Won't overwrite output file if it already exists, and uses command line flags for the files.

                      Usage:
                      perl merge.pl --code codefile.txt --seq seqfile.fasta --out mergedOutput.fasta

                      Code:
                      #!/usr/bin/perl
                      
                      use strict;
                      use warnings;
                      use Getopt::Long;
                      
                      my $codes;
                      my $seqs;
                      my $out;
                      my $usage = "Usage: perl merge.pl --code <file_with_codes.txt> --seq <sequences.fasta> --out <outfile.txt>\n";
                      
                      GetOptions  ("code=s" => \$codes,
                                   "seq=s" => \$seqs,
                                   "out=s" => \$out);
                      
                      if (!defined $codes) {
                          print "Must supply codes file name.\n";
                          die "$usage";
                      } elsif (!defined $seqs) {
                          print "Must supply sequences file name.\n";
                          die "$usage";
                      } elsif (!defined $out) {
                          print "Must supply output file name.\n";
                      }
                      
                      if(-e "$out") {
                          die "File $out already exists--stopping so you don't overwrite.\n";
                      }   
                      
                      open (my $codesFH, "<", "$codes");
                      open (my $seqsFH, "<", "$seqs");
                      open (my $outputFH, ">", "$out");
                      
                      my %codeHash;
                      while (my $code = <$codesFH>) {
                          chomp $code;
                          if ($code =~ /(\S+)\s(\S+)/) {
                              $codeHash{$1} = $2;
                          }
                      }
                      
                      while (my $seq = <$seqsFH>) {
                          chomp $seq;
                          if ($seq =~ />(\S+)/) {
                              if (exists $codeHash{$1}) {
                                  print $outputFH ">" . $codeHash{$1} . "_" . $1 . "\n";
                              } else {
                              print $outputFH $seq . "\n";
                              }
                          } else {
                              print $outputFH $seq . "\n";
                          }
                      }
                      print "Finished executing.\n";

                      Comment


                      • #12
                        Also, what do you want it to do if it doesn't find the corresponding name in the codes file. Currently it just prints unmatched names without a C_XXX prefix in the output file if it doesn't find a match.

                        Comment


                        • #13
                          Worked wonders! thanks a lot atcghelix, mastal, kmcarr, SES and everybody else giving advice.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Advanced Methods for the Detection of Infectious Disease
                            by seqadmin




                            The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
                            ...
                            11-27-2023, 01:15 PM
                          • seqadmin
                            Strategies for Investigating the Microbiome
                            by seqadmin




                            Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
                            11-09-2023, 07:02 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 02:24 PM
                          0 responses
                          11 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 07:37 AM
                          0 responses
                          21 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-04-2023, 08:23 AM
                          0 responses
                          8 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-01-2023, 09:55 AM
                          0 responses
                          24 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X