Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • perl_merge fasta and group files (454)

    Hi

    I am trying to put data from 454 that I have de noised and compiled with mothur to start using qiime and PICRUSt. So I need to merge two files that look like this:

    A fasta file:
    >HCZRAFT02HJAN7
    A-G-A-T----AAC------AG-C-C-C-AC-C-A-----A-GG-C-G--A-T--G-A---T--AC-A--T--A--G---C-T-G-G-TCT--G-AG---A--G-G-AT--G-AA-C-A-G-CCA-C-.....
    >HCZRAFT02HUO5K
    A-G-G-T----AAC------GG-C-T-C-AC-C-A-C.......

    and a group file specifying which sequence is found in what sample:

    HCZRAFT02JI33I C_129
    HCZRAFT02IMY3H C_129
    HCZRAFT02HE28V C_129
    HCZRAFT02IM5RX C_129
    HCZRAFT02JYNIP C_129
    HCZRAFT02JFRVC C_129
    HCZRAFT02HRTKG C_129
    HCZRAFT02ISION C_129
    HCZRAFT02IIL63 C_129
    HCZRAFT02HQBFW C_129


    I'd like to end up with a file that looks like this:

    >C_129_HCZRAFT02JI33I
    A-G-A-T----AAC------AG-C-C-C-AC-C-A-----A-GG-C-G--A-T--G-A---T--AC-A--T--A--G---C-T-G-G-TCT--G-AG---A--G-G-AT--G-AA-C-A-G-CCA-C-.....

    Any help would be greatly appreciated

    Andres

  • #2
    Not pretty, but works for me.

    Code:
    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    open (my $codes, "<", "codes.txt");
    open (my $seqs, "<", "test.fasta");
    open (my $output, ">", "merged.txt");
    
    my %codeHash;
    while (my $code = <$codes>) {
        chomp $code;
        if ($code =~ /(\S+)\s(\S+)/) {
            print $1 . "  " . $2 . "\n";
            $codeHash{$1} = $2;
        }
    }
    
    while (my $seq = <$seqs>) {
        chomp $seq;
        if ($seq =~ />(\S+)/) {
            if (exists $codeHash{$1}) {
                print $output ">" . $codeHash{$1} . "_" . $1 . "\n";
            } else {
            print $output $seq . "\n";
            }
        } else {
            print $output $seq . "\n";
        }
    }

    Comment


    • #3
      Thanks, can you give me a hand in usage?

      Comment


      • #4
        I tried running the code you provided as:
        Desktop $ perl merge.pl -code Howler.final.groups -seq Howler.final.fasta -output >merged.txt

        and goy this error:

        readline() on closed filehandle $codes at merge.pl line 11.
        readline() on closed filehandle $seqs at merge.pl line 19.


        Can you please give me a clue on what has happened?

        Comment


        • #5
          Originally posted by Andres_Gomez View Post
          I tried running the code you provided as:
          Desktop $ perl merge.pl -code Howler.final.groups -seq Howler.final.fasta -output >merged.txt

          and goy this error:

          readline() on closed filehandle $codes at merge.pl line 11.
          readline() on closed filehandle $seqs at merge.pl line 19.


          Can you please give me a clue on what has happened?
          The files are hard coded, ie, the script isn't using Getopt::Long to accept flags. So, you need to change the names in the script.

          Comment


          • #6
            Any clues on where I should change the names? is it lines 11 and 19?

            Comment


            • #7
              Originally posted by Andres_Gomez View Post
              Any clues on where I should change the names? is it lines 11 and 19?
              Take a look at the open() statements. That is, lines 6, 7, and 8 are the lines where you need to change the file names. Also, it is best practices to test that you could open the file by putting an "or die ...." statement after open. That way, the program will halt and you can easily see what is going wrong (and it's trying to open a specific file).

              Comment


              • #8
                after changing the names of the files in the script, you want to call the script as follows:

                Code:
                Desktop $ perl merge.pl Howler.final.groups Howler.final.fasta merged.txt

                Comment


                • #9
                  Originally posted by mastal View Post
                  after changing the names of the files in the script, you want to call the script as follows:

                  Code:
                  Desktop $ perl merge.pl Howler.final.groups Howler.final.fasta merged.txt
                  No, you wouldn't. If the names of the files are hardcoded in the script there is no purpose to putting them on the command line. The will simply be ignored since the script does not have any code to do anything with command line arguments. Assuming you have hardcoded the file names in the script (ignoring that that is BAD(tm)) you would simply call the script as:

                  Code:
                  Desktop $ perl merge.pl
                  The files Howler.final.groups and Howler.final.fasta must be present in the current working directory or the script will throw an error. The file merged.txt will be created by script, OVERWRITING any previous version of the file present in the current working directory.

                  Comment


                  • #10
                    yes, kmcarr is right.

                    Comment


                    • #11
                      Sorry about that--just saw this. Here's a better, safer version. Won't overwrite output file if it already exists, and uses command line flags for the files.

                      Usage:
                      perl merge.pl --code codefile.txt --seq seqfile.fasta --out mergedOutput.fasta

                      Code:
                      #!/usr/bin/perl
                      
                      use strict;
                      use warnings;
                      use Getopt::Long;
                      
                      my $codes;
                      my $seqs;
                      my $out;
                      my $usage = "Usage: perl merge.pl --code <file_with_codes.txt> --seq <sequences.fasta> --out <outfile.txt>\n";
                      
                      GetOptions  ("code=s" => \$codes,
                                   "seq=s" => \$seqs,
                                   "out=s" => \$out);
                      
                      if (!defined $codes) {
                          print "Must supply codes file name.\n";
                          die "$usage";
                      } elsif (!defined $seqs) {
                          print "Must supply sequences file name.\n";
                          die "$usage";
                      } elsif (!defined $out) {
                          print "Must supply output file name.\n";
                      }
                      
                      if(-e "$out") {
                          die "File $out already exists--stopping so you don't overwrite.\n";
                      }   
                      
                      open (my $codesFH, "<", "$codes");
                      open (my $seqsFH, "<", "$seqs");
                      open (my $outputFH, ">", "$out");
                      
                      my %codeHash;
                      while (my $code = <$codesFH>) {
                          chomp $code;
                          if ($code =~ /(\S+)\s(\S+)/) {
                              $codeHash{$1} = $2;
                          }
                      }
                      
                      while (my $seq = <$seqsFH>) {
                          chomp $seq;
                          if ($seq =~ />(\S+)/) {
                              if (exists $codeHash{$1}) {
                                  print $outputFH ">" . $codeHash{$1} . "_" . $1 . "\n";
                              } else {
                              print $outputFH $seq . "\n";
                              }
                          } else {
                              print $outputFH $seq . "\n";
                          }
                      }
                      print "Finished executing.\n";

                      Comment


                      • #12
                        Also, what do you want it to do if it doesn't find the corresponding name in the codes file. Currently it just prints unmatched names without a C_XXX prefix in the output file if it doesn't find a match.

                        Comment


                        • #13
                          Worked wonders! thanks a lot atcghelix, mastal, kmcarr, SES and everybody else giving advice.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Exploring the Dynamics of the Tumor Microenvironment
                            by seqadmin




                            The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                            07-08-2024, 03:19 PM
                          • seqadmin
                            Exploring Human Diversity Through Large-Scale Omics
                            by seqadmin


                            In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                            06-25-2024, 06:43 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 07-19-2024, 07:20 AM
                          0 responses
                          28 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 07-16-2024, 05:49 AM
                          0 responses
                          41 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 07-15-2024, 06:53 AM
                          0 responses
                          46 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 07-10-2024, 07:30 AM
                          0 responses
                          43 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X