Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • gsgs
    Senior Member
    • Oct 2009
    • 139

    #16
    I usually write a small basic program for such problems.

    post/send the file, I send the result ?

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      #17
      Oh, I did not realize your sequences were so tiny; I assumed they were much longer. BBDuk is probably not appropriate for this situation, as it makes the implicit assumption that long kmers are relatively unique, which is not the case with 6-mers.

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #18
        @entomology: Try the following

        Use the original file of sequences (i.e. not the fasta format but just sequence, one on each line).

        Code:
        $  while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

        Comment

        • SES
          Senior Member
          • Mar 2010
          • 275

          #19
          Here is a simple script that uses an iterator to fetch records by sequence. This would be likely faster and less error-prone than grep:

          Code:
          #!/usr/bin/env perl
          
          use strict;
          use warnings;
          use File::Basename;
          
          my $usage   = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas";
          my $infilei = shift or die $usage;
          my $infilej = shift or die $usage;
          
          my %hash;
          open my $ini, '<', $infilei or die $!;
          while (my ($id, $seq) = fasta_it(\*$ini)) {
              $hash{$seq} = $id;
          }
          close $ini;
          
          open my $inj, '<', $infilej or die $!;
          while (my ($id, $seq) = fasta_it(\*$inj)) {
              if (exists $hash{$seq}) {
          	print join "\n", ">".$hash{$seq}, "$seq\n";
              }
          }
          close $inj;
          
          sub fasta_it {
              my ($fh) = @_;
              
              local $/ = "\n>";
              return unless my $entry = $fh->getline;
              chomp $entry;
          
              my ($id, $seq) = split /\n/, $entry, 2;
              defined $id && $id =~ s/>//g;
              return ($id, $seq);
          }
          Here is the gist for easier download: https://gist.github.com/sestaton/889cba88b5279a58d997

          The output:

          Code:
          perl fetch_by_seq.pl i.fas j.fas 
          >1123-11234
          aaaaaa
          >232-23424
          tttttt
          >416-2
          gggggg
          >13424241234-23423
          cccccc
          Depending on the size of the original file you may want to think about using an SQLite database but this should work fine for most uses.

          Comment

          • entomology
            Member
            • May 2011
            • 34

            #20
            I've upload the two file, the expected output is like this:

            >1123-11234
            aaaaaa
            >232-23424
            tttttt
            >416-2
            gggggg
            >13424241234-23423
            cccccc

            Thanks!

            Originally posted by gsgs View Post
            I usually write a small basic program for such problems.

            post/send the file, I send the result ?
            Attached Files

            Comment

            • entomology
              Member
              • May 2011
              • 34

              #21
              Actually, I've deal with some small rna sequence which is with length of 18-30. Anyway, thank you for your kindness.


              Originally posted by Brian Bushnell View Post
              Oh, I did not realize your sequences were so tiny; I assumed they were much longer. BBDuk is probably not appropriate for this situation, as it makes the implicit assumption that long kmers are relatively unique, which is not the case with 6-mers.

              Comment

              • SES
                Senior Member
                • Mar 2010
                • 275

                #22
                Originally posted by entomology View Post
                I've upload the two file, the expected output is like this:

                >1123-11234
                aaaaaa
                >232-23424
                tttttt
                >416-2
                gggggg
                >13424241234-23423
                cccccc

                Thanks!
                That is what the post above (#19) produces. The question and expected results seem pretty simple, maybe you missed the previous post or are looking for another way?

                Comment

                • entomology
                  Member
                  • May 2011
                  • 34

                  #23
                  Forgive my poor programming skill, still I got some error message as below

                  -bash: syntax error near unexpected token `do'



                  Originally posted by GenoMax View Post
                  @entomology: Try the following

                  Use the original file of sequences (i.e. not the fasta format but just sequence, one on each line).

                  Code:
                  $  while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

                  Comment

                  • entomology
                    Member
                    • May 2011
                    • 34

                    #24
                    I've got below message when I running the script, do i miss some module?

                    Can't locate object method "getline" via package "IO::Handle" at ./fetch_fasta.test line 30

                    Originally posted by SES View Post
                    That is what the post above (#19) produces. The question and expected results seem pretty simple, maybe you missed the previous post or are looking for another way?

                    Comment

                    • SES
                      Senior Member
                      • Mar 2010
                      • 275

                      #25
                      Originally posted by entomology View Post
                      I've got below message when I running the script, do i miss some module?

                      Can't locate object method "getline" via package "IO::Handle" at ./fetch_fasta.test line 30
                      Interesting, never seen that. What OS are you using? To fix the Perl issue, replace this line:

                      Code:
                      return unless my $entry = $fh->getline;
                      with

                      Code:
                      return unless my $entry = <$fh>;
                      should do the trick.

                      Comment

                      • entomology
                        Member
                        • May 2011
                        • 34

                        #26
                        I'm using centos

                        2.6.32-573.7.1.el6.centos.plus.x86_64

                        when i change as you recommended, this time I got result. but still with some error and the result seems not as expected.

                        ./fetch_fasta.test seq.test seq.test.original.fas
                        Value of <HANDLE> construct can be "0"; test with defined() at ./fetch_fasta.test line 31.
                        >1
                        aaaaaa
                        >2
                        tttttt
                        >3
                        gggggg

                        I expected the result like this:

                        >1123-11234
                        aaaaaa
                        >232-23424
                        tttttt
                        >416-2
                        gggggg
                        >13424241234-23423
                        cccccc



                        Originally posted by SES View Post
                        Interesting, never seen that. What OS are you using? To fix the Perl issue, replace this line:

                        Code:
                        return unless my $entry = $fh->getline;
                        with

                        Code:
                        return unless my $entry = <$fh>;
                        should do the trick.

                        Comment

                        • SES
                          Senior Member
                          • Mar 2010
                          • 275

                          #27
                          Originally posted by entomology View Post
                          I'm using centos

                          2.6.32-573.7.1.el6.centos.plus.x86_64

                          when i change as you recommended, this time I got result. but still with some error and the result seems not as expected.

                          ./fetch_fasta.test seq.test seq.test.original.fas
                          Value of <HANDLE> construct can be "0"; test with defined() at ./fetch_fasta.test line 31.
                          >1
                          aaaaaa
                          >2
                          tttttt
                          >3
                          gggggg

                          I expected the result like this:

                          >1123-11234
                          aaaaaa
                          >232-23424
                          tttttt
                          >416-2
                          gggggg
                          >13424241234-23423
                          cccccc
                          You must be giving the files in the wrong order because it works for me. It needs to be "perl script.pl original.fas other.fas." Though, the approach I posted is odd since there is something wrong with readline on your system and it will generate warnings. Here is a different approach that should work the same:

                          Code:
                          #!/usr/bin/env perl
                          
                          use strict;
                          use warnings;
                          use File::Basename;
                          
                          my $usage   = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas";
                          my $infilei = shift or die $usage;
                          my $infilej = shift or die $usage;
                          
                          my $hash = index_seq($infilei);
                          open my $inj, '<', $infilej or die $!;
                          {
                              local $/ = '>';
                          
                              while (my $line = <$inj>) {
                          	chomp $line;
                          	my ($seqid, @seqparts) = split /\n/, $line;
                          	my $seq = join '', @seqparts;
                          	next unless defined $seqid && defined $seq;
                          	if (exists $hash->{$seq}) {
                          	    print join "\n", ">".$hash->{$seq}, "$seq\n";
                          	}
                              }
                          }
                          close $inj;
                          
                          sub index_seq {
                              my ($file) = @_;
                              open my $in, '<', $file or die $!;
                          
                              my %hash;
                              {
                          	local $/ = '>';
                          
                          	while (my $line = <$in>) {
                          	    chomp $line;
                          	    my ($seqid, @seqparts) = split /\n/, $line;
                          	    my $seq = join '', @seqparts;
                          	    next unless defined $seqid && defined $seq;
                          	    $hash{$seq} = $seqid;
                          	}
                              }
                              close $in;
                          
                              return \%hash;
                          }
                          By the way, can you tell me the output of "perl -v" and "perl -MIO::Handle -e 'print IO::Handle->VERSION'". Thanks. I am also using CentOS, so I'm curious about that message.
                          Last edited by SES; 12-18-2015, 02:14 PM.

                          Comment

                          • entomology
                            Member
                            • May 2011
                            • 34

                            #28
                            Thank you very much for your help. you are right, i get the order wrong. now it works really good with the updated version without warning.

                            the command test gives the result as below:

                            perl -v

                            This is perl, v5.10.1 (*) built for x86_64-linux-thread-multi

                            perl -MIO::Handle -e 'print IO::Handle->VERSION'
                            1.28

                            thanks again for your patience and useful help!!!

                            Originally posted by SES View Post
                            You must be giving the files in the wrong order because it works for me. It needs to be "perl script.pl original.fas other.fas." Though, the approach I posted is odd since there is something wrong with readline on your system and it will generate warnings. Here is a different approach that should work the same:

                            Code:
                            #!/usr/bin/env perl
                            
                            use strict;
                            use warnings;
                            use File::Basename;
                            
                            my $usage   = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas";
                            my $infilei = shift or die $usage;
                            my $infilej = shift or die $usage;
                            
                            my $hash = index_seq($infilei);
                            open my $inj, '<', $infilej or die $!;
                            {
                                local $/ = '>';
                            
                                while (my $line = <$inj>) {
                            	chomp $line;
                            	my ($seqid, @seqparts) = split /\n/, $line;
                            	my $seq = join '', @seqparts;
                            	next unless defined $seqid && defined $seq;
                            	if (exists $hash->{$seq}) {
                            	    print join "\n", ">".$hash->{$seq}, "$seq\n";
                            	}
                                }
                            }
                            close $inj;
                            
                            sub index_seq {
                                my ($file) = @_;
                                open my $in, '<', $file or die $!;
                            
                                my %hash;
                                {
                            	local $/ = '>';
                            
                            	while (my $line = <$in>) {
                            	    chomp $line;
                            	    my ($seqid, @seqparts) = split /\n/, $line;
                            	    my $seq = join '', @seqparts;
                            	    next unless defined $seqid && defined $seq;
                            	    $hash{$seq} = $seqid;
                            	}
                                }
                                close $in;
                            
                                return \%hash;
                            }
                            By the way, can you tell me the output of "perl -v" and "perl -MIO::Handle -e 'print IO::Handle->VERSION'". Thanks. I am also using CentOS, so I'm curious about that message.

                            Comment

                            • SES
                              Senior Member
                              • Mar 2010
                              • 275

                              #29
                              Originally posted by entomology View Post
                              Thank you very much for your help. you are right, i get the order wrong. now it works really good with the updated version without warning.

                              the command test gives the result as below:

                              perl -v

                              This is perl, v5.10.1 (*) built for x86_64-linux-thread-multi

                              perl -MIO::Handle -e 'print IO::Handle->VERSION'
                              1.28

                              thanks again for your patience and useful help!!!
                              Great, thanks for the info. I'm creating a CentOS 6 image now to test your errors above.

                              Comment

                              • GenoMax
                                Senior Member
                                • Feb 2008
                                • 7142

                                #30
                                Originally posted by entomology View Post
                                Forgive my poor programming skill, still I got some error message as below

                                -bash: syntax error near unexpected token `do'
                                Are you running bash shell? If you are not then try explicitly going into bash like this

                                Code:
                                $ /bin/bash
                                $ while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  Yesterday, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM
                                • SEQadmin2
                                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                  by SEQadmin2

                                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                  05-06-2026, 09:04 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, Yesterday, 12:03 PM
                                0 responses
                                19 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, Yesterday, 11:40 AM
                                0 responses
                                14 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-28-2026, 11:40 AM
                                0 responses
                                29 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-26-2026, 10:12 AM
                                0 responses
                                31 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...